EEC Architecture
The EEC Architecture is a hierarchical design that emphasizes providing full
software control and flexibility for both time- and space- multiplexing, and
a storage hierarchy optimized for low power consumption.
Tiles
At the highest level a chip consists of multiple Tiles. Each tile contains
a switch in the chip-wide data network. The Tile is responsible for granting
arbitrated access to the network to each of its Ensembles.
The chip would also include scalar processor cores for running the host OS
and I/O blocks for moving data on- and off-chip.

Figure 1. A 3x3 array of Tiles, with each Tile containing
4 Ensembles and a Network Switch.
Ensembles
Each Tile is made up of multiple Ensembles. An Ensemble consists of a medium
sized (2k word) local storage array which is shared by multiple processing units.
The Ensemble is responsible for granting arbitrated access to the Ensemble memory
to each of its Compute Engines, and passing requests for network access up to
its enclosing Tile.

Figure 2. A single Tile, with 4 Ensembles and a Network
Switch. Each Ensemble is made up of 4 Compute Engines, an Ensemble RAM, local
RAMs, and an arbiter.
Compute Engine
Within each Ensemble there are multiple Compute Engines. The Compute Engine
does the actual computation and sequencing operations. Each Compute Engine can
communicate with its neighboring Compute Engines and all other Compute Engines
within its Ensemble.
|

Figure 3. A Compute Engine with a Data Path made
up of 2 ALU Groups, 2 indexable data register files (one with predicates,
one without), a smaller non-indexable register file, a Local Communications
Unit, and a Local Memory Interface. The Instruction Path consists of separate
Instruction Register Files for each unit (with the two ALU Groups able
to share their Instruction Register Files), a Sequencer, and an Instruction
Loader. |
- ALU Group - The basic computational
unit of the Compute Engine contains local operand storage, bypass muxing,
and a MAC/ALU unit.
- The MAC is capable of doing 16x16 MAC operations
with a 40-bit extended precision result fed into one of two accumulator
registers.
- The ALU supports 32-bit arithmetic and logical operations,
as well as 16-bit and 8-bit SIMD operations on packed data.
- The ALU/MAC supports predicated operation, with
sub-word predicates for SIMD operations.
- The ALU/MAC is fed from two Operand Registers which
provide cheap local storage for frequently reused data. These are
also bypassed to allow back-to-back computation.
- The inputs to the Operand Registers are muxed from
the outputs of the other units in the Compute Engine.
- The results of the ALU and condition codes are fed
to the rest of the Compute Unit
- Indexed Register Files (IdxRF)- The
IdxRFs reduce the overhead of sequential access to local data by providing
auto increment/decrement access to an array of data.
- The IdxRF contain 2 indices for both the read and
write ports.
- One of the two IdxRFs in the Compute Engine also
contains an indexed predicate register file which is accessed in step
with the data register file, but whose read and write is independently
controlled.
- Local Communications (LCOMM) - The
LCOMM provides access to the inter-Compute Engine communications resources.
- Each of the two ports into and out of the LCOMM
can send or receive a word to one of the neighbors of the Compute
Engine on each cycle.
- If there is no data to read when a read is attempted
or if the last written data has not been remotely read when a second
write is attempted the LCOMM sends a stall signal to the Compute Engine.
- Local Memory Interface (LMEM) - The
local Ensemble Memory and Compute Engine memory is accessed through
the LMEM.
- The LMEM handles arbitration to the requested
memory and stalls the Compute Engine when necessary.
- The LMEM can contain various address generators.
- Instruction Register Files (IRFs) -
The IRFs are a small local memory to store the instructions that are
executed on the corresponding unit, rather than fetching them from a
cache as needed.
- The IRFs are controlled by the sequencer, which
selects the correct instruction to execute on each cycle.
- A given unit can share access to multiple IRFs,
allowing efficient SIMD operation and/or an expanded instruction storage.
- Sequencer - The sequencer provides
control-flow control for the operation of the Compute Engine by receiving
condition codes from the ALUs and telling the IRFs which instruction
to execute on a given cycle.
- The Sequencer contains 4 Zero-Overhead counters
that can be used for either efficient looping or counting.
- The Sequencer instructions normally consist of opcodes
to select the combination of IRFs needed to execute the current instruction,
but it can also inject a full instruction directly to any one unit
in a cycle. This prevents infrequently used instructions from consuming
valuable IRF space.
- Instruction Loader - The Instruction
loader loads the IRFs, the Sequencer, and the initial contents of the
register files before execution begins.
|
The full Compute Engine definition is somewhat more complicated than what is
shown above due to the control of each register file. However, the major components
of the two ALUs, and the register files can be seen below in the full definition:

Figure 4. The connectivity for a full Compute Unit as
specified above. The sub-units of the two ALU Groups and the three Register
Files are shown in red.
Efficient Embedded Computing - Concurrent VLSI Architecture Group - Stanford Department of Electrical Engineering
Last Updated: $Id: index.html,v 1.4 2006/05/10 22:37:30 davidbbs Exp $