EEC Architecture

The EEC Architecture is a hierarchical design that emphasizes providing full software control and flexibility for both time- and space- multiplexing, and a storage hierarchy optimized for low power consumption.

Tiles

At the highest level a chip consists of multiple Tiles. Each tile contains a switch in the chip-wide data network. The Tile is responsible for granting arbitrated access to the network to each of its Ensembles.

The chip would also include scalar processor cores for running the host OS and I/O blocks for moving data on- and off-chip.

Figure 1. A 3x3 array of Tiles, with each Tile containing 4 Ensembles and a Network Switch.

Ensembles

Each Tile is made up of multiple Ensembles. An Ensemble consists of a medium sized (2k word) local storage array which is shared by multiple processing units. The Ensemble is responsible for granting arbitrated access to the Ensemble memory to each of its Compute Engines, and passing requests for network access up to its enclosing Tile.

Figure 2. A single Tile, with 4 Ensembles and a Network Switch. Each Ensemble is made up of 4 Compute Engines, an Ensemble RAM, local RAMs, and an arbiter.

Compute Engine

Within each Ensemble there are multiple Compute Engines. The Compute Engine does the actual computation and sequencing operations. Each Compute Engine can communicate with its neighboring Compute Engines and all other Compute Engines within its Ensemble.

Figure 3. A Compute Engine with a Data Path made up of 2 ALU Groups, 2 indexable data register files (one with predicates, one without), a smaller non-indexable register file, a Local Communications Unit, and a Local Memory Interface. The Instruction Path consists of separate Instruction Register Files for each unit (with the two ALU Groups able to share their Instruction Register Files), a Sequencer, and an Instruction Loader.

  1. ALU Group - The basic computational unit of the Compute Engine contains local operand storage, bypass muxing, and a MAC/ALU unit.
    • The MAC is capable of doing 16x16 MAC operations with a 40-bit extended precision result fed into one of two accumulator registers.
    • The ALU supports 32-bit arithmetic and logical operations, as well as 16-bit and 8-bit SIMD operations on packed data.
    • The ALU/MAC supports predicated operation, with sub-word predicates for SIMD operations.
    • The ALU/MAC is fed from two Operand Registers which provide cheap local storage for frequently reused data. These are also bypassed to allow back-to-back computation.
    • The inputs to the Operand Registers are muxed from the outputs of the other units in the Compute Engine.
    • The results of the ALU and condition codes are fed to the rest of the Compute Unit
  2. Indexed Register Files (IdxRF)- The IdxRFs reduce the overhead of sequential access to local data by providing auto increment/decrement access to an array of data.
    • The IdxRF contain 2 indices for both the read and write ports.
    • One of the two IdxRFs in the Compute Engine also contains an indexed predicate register file which is accessed in step with the data register file, but whose read and write is independently controlled.
  3. Local Communications (LCOMM) - The LCOMM provides access to the inter-Compute Engine communications resources.
    • Each of the two ports into and out of the LCOMM can send or receive a word to one of the neighbors of the Compute Engine on each cycle.
    • If there is no data to read when a read is attempted or if the last written data has not been remotely read when a second write is attempted the LCOMM sends a stall signal to the Compute Engine.
  4. Local Memory Interface (LMEM) - The local Ensemble Memory and Compute Engine memory is accessed through the LMEM.
    • The LMEM handles arbitration to the requested memory and stalls the Compute Engine when necessary.
    • The LMEM can contain various address generators.
  5. Instruction Register Files (IRFs) - The IRFs are a small local memory to store the instructions that are executed on the corresponding unit, rather than fetching them from a cache as needed.
    • The IRFs are controlled by the sequencer, which selects the correct instruction to execute on each cycle.
    • A given unit can share access to multiple IRFs, allowing efficient SIMD operation and/or an expanded instruction storage.
  6. Sequencer - The sequencer provides control-flow control for the operation of the Compute Engine by receiving condition codes from the ALUs and telling the IRFs which instruction to execute on a given cycle.
    • The Sequencer contains 4 Zero-Overhead counters that can be used for either efficient looping or counting.
    • The Sequencer instructions normally consist of opcodes to select the combination of IRFs needed to execute the current instruction, but it can also inject a full instruction directly to any one unit in a cycle. This prevents infrequently used instructions from consuming valuable IRF space.
  7. Instruction Loader - The Instruction loader loads the IRFs, the Sequencer, and the initial contents of the register files before execution begins.

The full Compute Engine definition is somewhat more complicated than what is shown above due to the control of each register file. However, the major components of the two ALUs, and the register files can be seen below in the full definition:

Figure 4. The connectivity for a full Compute Unit as specified above. The sub-units of the two ALU Groups and the three Register Files are shown in red.


Efficient Embedded Computing - Concurrent VLSI Architecture Group - Stanford Department of Electrical Engineering

Last Updated: $Id: index.html,v 1.4 2006/05/10 22:37:30 davidbbs Exp $