ELM Architecture

Overview

Global System

Figure 1: ELM Architecture [taken from CAL, 1/08]

ELM is an embedded processor targeting compute intensive workloads. We built ELM with the belief that current processors will not be energy efficient enough to supply the throughput needed by future embedded applications. By studying these workloads, we knew that ELM must take advantage of both task-level parallelism and data-level parallelism. For a known amount of DLP, ASICs are extremely efficient since they can be manufactured for a given spec with little to no overhead. However, every new task requires a new chip. ELM has several features that help us exploit large amounts of DLP while still allowing for programmable TLP.

The goal that we have while designing ELM is simple: eliminate all overhead. Starting from the ground up, we specified the computational units in each small Ensemble Processor (EP). We opted to focus exclusively on integer and fixed point programs since the embedded applications that we are targeting do not require floating point. Next, we looked for the means of supplying the ALUs with data as quickly and efficiently as possible. We also designed an instruction delivery system that we were able to keep small and low energy because the small size of application kernels. We chose to make ELM a tiled architecture, as shown above. Each of the tiles, called an Ensemble, is comprised of four EPs and a centralized memory. By focusing on eliminating control and supply overheads, we are able to achieve a 23x energy gain over a RISC processor at the same throughput across several embedded kernels. The breakdown of that energy gain is shown in the figure below.

Energy Savings

Figure 2: ELM vs. RISC Energy Consumption [CAL, 1/08]

Data Supply

Data Supply Energy

Figure 3: Data Supply Energies [CAL, 1/08]

In ELM's architecture, we have implemented several features to reduce data supply energy consumption without compromising throughput. An overview of some of these features is as follows:

  • Separate computation and memory execution pipelines: Using a Very Long Instruction Word (VLIW), we can issue one instruction to each pipeline every cycle. Having these two paths allows an increase of throughput by moving data in parallel with computation. In a simple pipeline, loads and stores are interleaved with computation instructions, increasing the number of instructions fetched. Our memory pipeline is used for communication and setting up vector loads/stores in addition to scalar loads and stores.
  • A hierarchical organization of register files: By keeping the data temporally and physically closer to the point of consumption, we reduce overhead from the movement of data bits. Each EP, for example, contains a general purpose register file that is accessed in the decode stage and a much smaller (4-enery) Operand Register File (ORF) that can be accessed in the execute stage. By keeping the ORF small, placed close to the ALU, and accessing it in the execute stage, we are able to only dissipate a small amount of energy when loading operands from it.
  • Vector registers: In order to reduce the amount instructions needed to move data into the EPs, we can use vector loads and stores. These vector operations maximize throughput by streaming in data across the global network.

Instruction Supply

Instruction Supply Energy

Figure 4: Instruction Supply Energies [CAL, 1/08]

In order to efficiently deliver instructions and control signals to computation blocks, we have utilized the following ideas:

  • Instruction Registers(IRs): This is one of our most important features. Instead of fetching instructions out of an instruction cache and hardware managed memory hierarchy, we execute out of MUCH smaller (64 entry) software managed IRs. Because of their size, these small SRAM arrays have a much lower energy per access than a typical instruction cache and can also be placed physically near the pipelines. Since embedded kernels are often small, we feel that loading and reloading these IRs will not use much of the overall computation time and energy. For more information on our IRs, please see our July CAL paper.
  • Execution Modes: ELM supports both a MIMD and SIMD mode (Multiple/Single Instruction, Multiple Data). In MIMD operation, each EP in an Ensemble executes out of its own IRs, allowing for TLP. In SIMD operation, all four EPs execute the same instruction to take advantage of DLP with a lower instruction issue cost.
  • Communication: In order to transmit data between EPs, we have designed a communication system that's keeps with the theme of dissipating only as much energy as necessary. For example, we have designed separate mechanisms for both local and global communication. We are also able to use statically routed channels for streaming data between Ensembles.