M-Machine Publications

The MIT Multi-ALU Processor

I. Introduction

The Multi-ALU Processor (MAP) chip is a component of the M-Machine multicomputer, an experimental machine being developed at MIT to test architectural concepts motivated by the constraints of semiconductor technology and the demands of programming systems, such as faster execution of fixed sized problems and easier use of parallel computers. Each six-chip M-Machine node consists of a MAP chip and 8 MBytes of synchronous DRAM (SDRAM) with ECC. The MAP chip employs a novel architecture for exploiting instruction level parallelism as well as mechanisms to enable large scale multiprocessing. These include an on-chip integrated network interface and router as well as mechanisms for enabling data sharing among processors. In addition, the MAP employs an efficient capability based addressing scheme to provide protection and ensure data integrity. The MAP chip is implemented using 7.5 million transistors in a 5 metal, 0.5 micron process. All of the datapath layout is complete and we are currently performing place and route of the standard-cell control logic. Tapeout is scheduled for June 1997.

II. MAP Architecture

The MAP chip contains three 64-bit execution clusters, a unified cache which is divided into into four banks, an external memory interface, and a communication subsystem consisting of a network interface and a the router. Two of the clusters have two integer units, a floating point multiply-add unit, and a floating point divide/square-root unit. The third cluster has only two integer units. Two crossbar switches interconnect these components. Clusters make memory requests to the appropriate bank of the interleaved cache over the 142-bit wide 3x4 crossbar M-Switch. The 88-bit wide 9x3 crossbar C-Switch is used for inter-cluster communication and to return data from the memory system. Both switches support up to three transfers per cycle; each cluster may send and receive one transfer per cycle. The 64KB unified on-chip cache is organized as four 16KB banks that are word-interleaved to permit accesses to consecutive addresses to proceed in parallel. The cache banks are pipelined with a three-cycle read latency, including switch traversal. Each cluster has its own 8KB instruction cache which fetches instructions from the unified cache when instruction cache misses occur. A 128 entry TLB is used to implement virtual memory. The MAP employs a form of Processor Coupling to control the multiple ALUs. Each of the three clusters has its own independent instruction stream. However, threads running on those clusters may communicate and synchronize with one another very quickly by writing into each other's register files via the C-Switch. Scoreboard bits on the registers are used to synchronize these register-register transfers, as well as to indicate when the data from a load has returned from the non-blocking memory system. Clusters may also communicate and synchronize through globally broadcast condition codes, a cluster barrier instruction (CBAR), and through the on-chip cache. Each cluster's execution units are shared among six threads that are concurrently resident in the cluster's pipeline registers. These threads are multithreaded on a cycle-by-cycle basis by a synchronization (SZ) pipeline stage. Some of the hardware thread slots are reserved for exception, event, and message handlers, so that they can run in parallel with the user threads and not incur invocation overhead when they start up. The integrated communication subsystem includes hardware support for message injection, routing, and extraction. A message is formatted in the general purpose registers and delivered atomically to the network interface with a SEND instruction. An 8-entry Global Translation Lookaside Buffer (GTLB) translates global virtual addresses to physical node identifiers, allowing application independent address mapping similar to the way a traditional TLB maps virtual addresses to physical memory. The two-dimensional network consists of the on-chip routers which connect directly to the routers on adjacent nodes through the pads. The routers implement two message priorities, sharing four virtual channels. A message that arrives at its final destination is placed into the incoming message queue which is mapped to a register name in a dedicated thread slot. Software can then extract the message and perform the required action. The scoreboard bit on the register indicates whether there are any words left in the incoming message queue. The MAP memory system implements protection between threads in a globally shared address space. Guarded pointers encode a capability in the top 10 bits of a 64-bit MAP word, and a tag bit prevents pointers from being forged. Hardware checking of a pointer's permissions and segment bounds prevent unauthorized memory access on load, store, and jump instructions. A combination of hardware and software mechanisms on the MAP chip are used to implement fast and flexible data sharing across M-Machine nodes. In addition to the virtual and physical page numbers, each page table entry also includes two state bits for each cache line (a total of 128 bits with 512 word pages and 8 word cache lines). These bits encode the states READ-ONLY, READ-WRITE, DIRTY, and INVALID, allowing sharing of cache line sized items across processors. When a load to an INVALID line is found in the memory system, an event is invoked in a dedicated thread slot. The software event handler can then send a message to the home node of the data, using the automatic translation of the GTLB, requesting a copy of the line. The remote message handler that is automatically invoked by hardware upon message arrival at the home node retrieves the data and returns it with another SEND instruction. When the line arrives, another software handler is invoked in a dedicated thread slot, which installs the line and delivers the requested word directly to the destination register of the original load instruction.

III. MAP Chip Implementation

A unique set of constraints dictated the MAP implementation of the M-Machine architecture. The M-Machine team had the advantage of a "clean-sheet" design with no requirement to support a pre-existing ISA. The corresponding disadvantage was the lack of pre-existing infrastructure, most notably the absence of pre-existing cell libraries, datapath components, and experience with the target fabrication process. Throughout the project, the limited size of the M-Machine project team was a fundamental constraint. From inception, in 1992, through architecture specification, RTL modelling, circuit design, floorplanning, and physical design, and the planned tapeout in June 1997, an average of nine engineers, with a peak total of twenty, worked regularly on the MAP implementation. Despite limited manpower, during the course of the project the M-Machine team designed, developed and characterized a full standard cell library; a composable datapath cell library; 5 RAM arrays; and a broad family of datapath components including a 64b adder/subtractor, a 64b barrel shifter, a radix 8 multiplier array, and a 7-ported register file. In addition, a set of CMOS and low voltage simultaneous bidirectional I/O pads were designed for the project. The critical components of the execution datapath, such as the multiplier array, adder, and register files, were implemented in a full custom methodology. The majority of the latches, multiplexors, and buffers in the execution datapaths were implemented by explicit placement (tiling) of the composable datapath cells and standard cells. The control logic was implemented via standard cell place and route. The majority of the circuits used in the MAP design are implemented in static CMOS logic for both design simplicity and to minimize the effort required to fully characterize their functionality and performance. The most notable exceptions are the domino multiplier array, the SRAM arrays, and the simultaneous bidirectional pad drivers. While all of the circuit and logic was performed at MIT, a fundamental project decision was to collaborate with an industrial design center for the physical design of the chip. The M-Machine team eventually selected Cadence Spectrum Design (CSD) as its partner for the MAP implementation. The extensive chip building experience of the CSD engineers was critical to the success of the project. The original design of the MAP chip contained over 13 Million transistors and consisted of 4 clusters, each with an IU, MU and FPU, 1MBit of onchip unified cache and a 3D Mesh router. As the project progressed, it became apparent that this design was too large for the target die size. Some of the key lessons of the MAP implementation experience resulted from the process of paring the original design to its final form. Several factors caused underestimation in the size of the original design: design methodology, an under-appreciation for the complexity control logic, and greater expectation from the process technology. The use of the semi-custom datapath layout methodology resulted in an average 40% area growth in the datapaths, However, the additional flexibility significantly reduced the time and effort required to make engineering changes and fix errors. Somewhat surprisingly, the quantity and complexity of the random control logic and not the density of the datapaths has dictated both circuit performance and chip area. The resulting MAP chip is a 7.5 million transistor microprocessor with a die size of 18mm x 18mm. It is implemented in a 5-level-metal 0.7um drawn (0.5um effective) CMOS process. IBM Corporation is manufacturing the MAP chip described in this presentation for MIT. Each MAP die will be packaged in an MCM-L chip carrier with 5 16Mb SDRAM TSOPs.