M-Machine Project Page

Last updated July 17, 1998

The 5-Million transistor Multi-ALU Processor (MAP) chip is the core processing element of each node in the M-Machine multicomputer. The design was first "taped out" on June 9, 1998.

The M-Machine is a fine grained multicomputer being designed and built by the Concurrent VLSI Architecture group at Stanford University (the group was previously affiliated with the Artificial Intelligence Laboratory at MIT ).

The M-Machine project explores new computer architectures that match semiconductor technology trends by more efficiently exploiting increased circuit density through multi-ALU processing nodes and by minimizing and carefully managing required global communications to reduce the performance impact resulting from increasing wire delays. It is very straightforward to build arithmetic units, register files, and memories and to replicate many of them on an integrated circuit. However, new architecture technology is required to efficiently and productively organize and control many arithmetic units on a chip. In addition, new agile memory management and event handling mechanisms are required to enable multiple simultaneously executing threads.

In particular, the M-Machine project is focusing on the following issues:

Multi-ALU node organization
Methods for controlling multi-ALU nodes
Cache organization for multi-ALU nodes
Memory protection for scalable computers
Memory resource management for scalable computers
Exception handling methods for multi-ALU nodes
Communication and synchronization mechanisms for scalable computers
Efficient network interfaces for scalable computers

The M-Machine is designed to efficiently execute programs with any or all granularities of parallelism. On the MAP (Multi-ALU processor - the processing core of the M-Machine), parallel instruction sequences (H-Threads) are run concurrently on the three clusters to exploit ILP across all 9 of the function units. Also, they may be used to exploit loop level parallelism or fine-grain thread-level parallelism. To exploit coarse-grain thread-level parallelism and to mask variable pipeline, memory, and communication delays, the MAP interleaves the 9-wide instruction streams from different tasks (V-Threads) within each cluster on a cluster-by-cluster and cycle-by-cycle basis, thus sharing the execution resources among all active tasks.

This arrangement of V-Threads (Vertical Threads) and H-Threads (Horizontal Threads) is summarized in the figure above. Five V-Threads are resident in the cluster register files. Each V-Thread consists of three H-Threads, one on each cluster. Each H-Thread consists of a sequence of 3-wide instructions containing integer, memory, and floating point operations. On each cluster the H-Threads from the different V-Threads are interleaved over the execution units.

The M-Machine architecture specifies 9 ALUs per node, which would deliver 900 Mips and 600MFlops per node with a 100MHz clock. The actual silicon implementation includes only 7 ALU's per node (6 Integer units and 1 floating point unit instead of 6 Integer units and 3 floating point units) due to chip area restrictions.

M-Machine Design Team

The M-Machine team consists of hardware and software designers at MIT/Stanford, Cadence Spectrum Design, and Caltech/Syracuse. Past members of the M-Machine design team include the Microelectronics Center of North Carolina (MCNC).

MIT/Stanford

William J. Dally

Whay Sing Lee

Steve Keckler

Marco Fillo

Andrew Chang

Nick Carter

Albert Ma

Keith Klayman

Dan Hartman

Parag Gupta

Andrew Chen