Stream Architecture

Imagine	!!
	People	Project	Publications	More Info
		architecture programming performance VLSI board applications tools

The Imagine Stream Architecture

Overview

Imagine is a programmable single-chip processor that supports the stream programming model. The figure to the right shows a block diagram of the Imagine stream processor. The Imagine architecture supports 48 ALUs organized as 8 SIMD clusters. Each cluster contains 6 ALUs, several local register files, and executes completely static VLIW instructions. The stream register file (SRF) is the nexus for data transfers on the processor. The memory system, arithmetic clusters, host interface, microcontroller, and network interface all interact by transferring streams to and from the SRF.
Imagine is a coprocessor that is programmed at two levels: the kernel-level and the stream-level. Kernel functions are coded using KernelC, whose syntax is based on the C language. Kernels may access local variables, read input streams, and write output streams, but may not make arbitrary memory references. Kernels are compiled into microcode programs that sequence the units within the arithmetic clusters to carry out the kernel function on successive stream elements. Kernel programs are loaded into the microcontroller's control store by loading streams from the SRF. At the application level, Imagine is programmed in StreamC. StreamC provides basic functions for manipulating streams and for passing streams between kernel functions. See the Imagine Stream Programming page for more information on the stream programming model.

Bandwidth Hierarchy

One of the key architectural innovations of the Imagine architecture is the three tiered storage bandwidth hierarchy. The bandwidth hierarchy enables the architecture to provide the instruction and data bandwidth necessary to efficiently operate 48 ALUs in parallel. The hierarchy consists of a streaming memory system (2.1GB/s), a 128KB stream register file (25.6GB/s), and direct forwarding of results among arithmetic units via local register files (435GB/s). Using this hierarchy to exploit the parallelism and locality of streaming media applications, Imagine is able to sustain performance of up to 18.3GOPS on key applications. This performance is comparable to special purpose processors; yet Imagine is still easily programmable for a wide range of applications. Imagine is designed to fit on a 2.56cm^2 0.18um CMOS chip and to operate at 400MHz.

Stream Register File (SRF)

The SRF is a 128KB memory organized to handle streams. The SRF can hold any number of streams of any length. The only limitation is the actual size of the SRF. Streams are referenced using a stream descriptor, which includes a base address in the SRF, a stream length, and the record size of data elements in the stream.
An array of 22 64-word stream buffers is used to allow read or write access to 22 stream clients simultaneously. The clients are the units which access streams out of the SRF, such as the memory system, network interface, and arithmetic clusters. The internal memory array is 32 words wide, allowing it to fill or drain half of one stream buffer every two cycles, providing a total bandwidth of 25.6GB/s for all 22 streams.
Each stream client may access its dedicated stream buffer every cycle if there is data available to be read or space available to be written. The eight stream buffers serving the clusters are accessed eight words at a time, one word per cluster. The eight stream buffers serving the network interface are accessed two words at a time. The other six stream buffers are accessed a single word at a time. The peak bandwidth of the stream buffers is therefore 86 words per cycle, allowing peak stream demand to exceed the SRF bandwidth during short transients. Stream buffers are bidirectional, but may only be used in a single direction for the duration of each logical stream transfer.

Memory System

As described above, all Imagine memory references are made using stream load and store instructions that transfer an entire stream between memory and the SRF. This stream load/store architecture is similar in concept to the scalar load/store architecture of contemporary RISC processors. It simplifies programming and allows the memory system to be optimized for stream throughput, rather than the throughput of individual, independent accesses. The memory system provides 2.1GB/s of bandwidth to off-chip SDRAM storage via four independent 32-bit wide SDRAM banks operating at 143MHz. The system can perform two simultaneous stream memory transfers. To support these simultaneous transfers, four streams (two index streams and two data streams) connect the memory system to the SRF. Imagine addressing modes support sequential, constant stride, indexed (scatter/gather), and bit-reversed accesses on a record-by-record basis.

Cluster Array

Eight arithmetic clusters, controlled by a single microcontroller, perform kernel computations on streams of data. Each cluster operates on one record of a stream so that eight records are processed simultaneously. As shown in the figure to the right, each cluster includes three adders, two multipliers, one divide/square root unit, one 128-entry scratch-pad register file, and one intercluster communication unit. This mix of arithmetic units is well suited to our experimental kernels. However, the architectural concept is compatible with other mixes and types of arithmetic units within the clusters.
Each input of every functional unit in the cluster is fed by a separate local register file (LRF). These local register files store kernel constants, parameters, and local variables, reducing the required SRF bandwidth. Each cluster has 15 LRFs (4 32-entry and 13 16-entry LRFs) for a total of 272 words per cluster and 2176 words across the eight clusters. Each local register file has one read port and one write port. The 15 local register files collectively provide 54.4 GB/s of peak data bandwidth per cluster, for a total bandwidth of 435 GB/s within the cluster array.
Additional storage is provided by a 256-word scratch-pad register file, the second unit from the right in the figure. It can be indexed with a base address specified in the instruction word and an offset specified in a local register. The scratch-pad allows for coefficient storage, short arrays, small lookup tables, and some local register spilling.
The intercluster communication unit, labelled CU in the figure, allows data to be transferred between clusters over the intercluster network using arbitrary communication patterns. The communication units are useful for kernels such as the Fast Fourier Transform, where interaction is required between adjacent stream elements.
The adders and multipliers are fully pipelined and perform single precision floating point arithmetic, 32-bit integer arithmetic, and 16-bit or 8-bit parallel subword integer operations, as found in MMX and other multimedia extensions. The divide/square root unit is not pipelined and operates only on single precision floating point and 32-bit integers. The divider can support two simulataneous operations, with latencies ranging from 16-23 cycles depending on the operation and data type. The 48 total arithmetic units, six units replicated across eight clusters, provide a peak computation rate of over 16GOPS for both single precision floating point and 32-bit integer arithmetic.

Network Interface

The network interface connects the SRF to four bidirectional links (400MB/s per link in each direction) that can be configured in an arbitrary topology to interconnect Imagine processors. A send instruction executed on the source Imagine processor reads a stream from the SRF and directs it onto one of the links and through the network as specified by a routing header. At the destination Imagine processor, a receive instruction directs the arriving stream into the SRF. The send and receive instructions both specify a tag to allow a single node to discriminate between multiple arriving messages.
Using the stream model, it is easy to partition an application over multiple Imagine processors using the network. To partition an application across two processors, the application is adapted by dividing the stream-level code across the two processors, inserting a send instruction at one end, and inserting a receive instruction at the other.

Stream Controller

A host processor issues stream-level instructions to Imagine with encoded dependency information. The stream controller buffers these instructions in a scoreboard and issues them when their resource requirements and dependency constraints are satisfied

Host Interface

The host interface allows an Imagine processor to be mapped into the host processor's address space, so the host processor can read and write Imagine memory. The host processor also executes programs that issue the appropriate stream-level instructions to the Imagine processor. These instructions are written to special memory mapped locations in the host interface.

gajh@cva.stanford.edu