Efficient Embedded Computing (EEC)

Creating a programmable architecture competitive with ASICs in power and area.

Principal Investigator: William J. Dally

Students: James Balfour, David Black-Schaffer, James Chen, JongSoo Park, Vishal Parikh

Overview

Most computing cycles today are performed in embedded signal- and image-processing systems such as cell phones, digital TVs. Almost all of this processing is currently performed by application-specific integrated circuits (ASICs), as they are 30-100x more efficient in terms of cost and power than the most efficient programmable processors and DSPs.

The EEC project is developing an energy-efficient programmable architecture that is at least as efficient as an ASIC implementation for a broad range of applications, while at the same time providing the flexibility of programmability. This is achieved by a novel architecture that systematically eliminates much of the overhead of conventional programmable processors and through the use of custom circuits.

Results

(August 2007) Initial results comparing a baseline EEC architecture to a standard embedded RISC core through place and route demonstrate impressive energy and area efficiency improvements of over 30x on a variety of benchmarks.

Technology

The basic architecture is a tiled, multi-level hierarchy of computation and control units matched to local memories and an on-chip network. The structure is arranged to provide the required data, computation, and control bandwidth at each level while minimizing power consumption. The EEC project aims to match the efficiencies of ASICs while maintaining programmability by addressing the data-movement, control, and computation overhead of traditional programmable processors.

Data movement in the EEC processor is managed through explicit, cost-exposed communications channels. These provide several levels of communications hierarchy for efficient streaming of data and inter-tile communication. By keeping all data movement exposed, the compiler tool chain can effectively evaluate and optimize the power consumption of a program while maintaining real-time performance guarantees.

Baseline Architecture

EEC Baseline Architecture - Embedded Tiled System overview and efficient processor architecture

Our baseline EEC processor includes many novel features to reduce the power required to deliver instructions and data to the functional units. For data delivery, the lowest level of storage consists of small Operand Register Files (ORFs) integrated directly into the ALUs to provide very efficient access to data with short-term locality, and an exposed result register to avoid the cost of writing into a register file when the result will be immediately consumed. Larger, dual Indexed Register Files (XRFs) make up the second level of the memory hierarchy. The XRFs provide cheap hardware indexing to avoid the instruction overhead of unrolling loops, and provide efficient half-word access to match common embedded data sizes. Local memory is accessed through the Memory Interface Unit (MIU) which contains a programmable address generator whose instructions can be composed to efficiently execute complex memory access patterns without requiring the use of the data path for address calculations. On the instruction delivery side, we have integrated shallow Instruction Register Files (IRFs) within each of the functional units to provide low-power instruction delivery. The IRFs are controlled and loaded by software under control of the sequencer, providing great flexibility for software optimization and instruction selection. Evaluating each of these features in detail will allow us to better understand where the inefficiencies are in current processors and how to overcome them.

Data Supply & Storage: Efficient Register Files

The control structure in the EEC provides support for fine-grain parallelism, low-overhead SIMD processing, and synchronization. Instruction encoding is a combination of per-cycle instruction selection and coarser-grained functional unit configuration, which combine to reduce the size of the instruction and the execution power. Multi-level zero-overhead looping and register indexing extend the flexibility of the instruction set to further reduce the instruction overhead.

Instruction Supply & Storage: Efficient control

Parallelism: Supporting DLP, TLP, and ILP efficiently

The basic computation data-path in the EEC is replicated across all the tiles within the hierarchy. This extensive re-use justifies custom the ALU, register files, and switches within the data-path. By optimizing the computational units at this level, significant savings can be realized in the data-path over standard cell layout techniques.

Tools

A flexible simulator for the EEC architecture is currently under development along with a multi-level compiler and scheduler.

Scheduling: Power-aware scheduling


Efficient Embedded Computing - Concurrent VLSI Architecture Group - Stanford Department of Electrical Engineering

$Id: index.html,v 1.10 2008/01/16 16:53:50 davidbbs Exp $

Unofficial EEC Mascot