Project Overview

Imagine	!!
	People	Project	Publications	More Info
		overview architecture programming performance VLSI board applications tools

Project Overview

Imagine is designed to achieve the performance of a special-purpose image-processing, signal-processing, or graphics engine with the flexibility of a general purpose computer. Our cycle-accurate simulations indicate that a single Imagine chip is able to sustain in excess of 10GFLOPS (32-bit floating point) or 20GOPs (16-bit fixed point) on problems ranging from signal processing, to image processing, to polygon-based graphics. By minimizing unneeded communication, Imagine is able to realize a power efficiency of over 3GFLOPS/W on applications of interest compared to less than 300MFLOPS/W for the most efficient conventional processors and DSPs. Several Imagine chips can be combined to achieve higher performance. A single circuit-board with an 8 x 8 array of Imagines can sustain over a TeraOP of performance with a power dissipation of less than 1kW.

Imagine achieves flexible performance by using a streaming architecture that exposes the parallelism and locality inherent in many signal- and image-processing applications. Fast stream operations are supported through a combination of vector processing, VLIW arithmetic clusters, a streaming memory system, and conditional vector operations. Vector performance is achieved by interleaving stream elements among the eight computation clusters, each consisting of many distinct arithmetic units. Each Imagine chip has a network interface to allow high speed communication among Imagine chips, so that multiprocessor solutions can be constructed if individual chips do not provide enough performance for a particular problem.

A large general purpose stream register file forms the nexus of the chip, and is connected to the clusters, the network, the memory system, and a host processor. The stream register file is program controlled, and serves as a staging area for data that is used by the other units. The number of memory accesses are reduced by keeping frequently used data in the stream register file.

The Imagine chip is controlled by a host processor, which accesses Imagine control and status registers as well as issues commands via a host interface. The host interface also is connected to the stream register file, so that data may be loaded into the machine and then sent to the memory system, network, or any other unit connected to the register file.

Historically, special purpose processors have offered better cost performance than programmable processors for three reasons:

They devote a larger fraction of their silicon area to arithmetic units.

They eliminate register and memory bandwidth bottlenecks by wiring arithmetic units directly together and providing many specialized memories.

They eliminate the control overhead of fetching and interpreting instructions.

Imagine matches these advantages while retaining the flexibility of a programmable signal processor. Imagine has a large fraction of working silicon with 48 32-bit floating-point arithmetic units. These units are divided into eight arithmetic clusters each comprising three adders, two multipliers, and a divide/square-root unit. The arithmetic units support 32-bit integer and floating-point operations and can be subdivided to provide 96 16-bit units or 192 8-bit units. The six units within each cluster are operated under VLIW control to exploit instruction-level parallelism. The eight clusters operate in lockstep under common program control to exploit data parallelism. All of the arithmetic units except the divide/square-root unit are fully pipelined.

Imagine overcomes the bandwidth bottlenecks of global register files and memory systems by using a three-level bandwidth hierarchy organized to support stream operations. Streams are transferred between memory and a stream register file (SRF) by a four-bank streaming memory system (2GB/s) that reorders references to improve bandwidth. Once a stream is loaded from memory, it is typically circulated between the SRF and the arithmetic clusters several times before returning the result to memory, exploiting the 32GB/s bandwidth of the SRF. Finally, during a computation kernel, intermediate results are forwarded directly between local register files associated with the arithmetic units without need to return to the global register file, using the 544GB/s local register bandwidth. On representative benchmark programs, exploiting the locality inherent in stream applications in this manner reduces bandwidth demands on global register ports by a factor of 20 compared to a typical scalar architecture.

Imagine overcomes the performance limiting effects of conditional operations by sorting streams according to a conditional variable rather than through conditional control flow. These conditional stream operations divide data into homogenous sets that can then be processed without the overhead of conditional control instructions. Compared to conventional approaches of branch prediction or predication, conditional stream operations enable very high levels of instruction and data parallelism to be exploited without incurring a large penalty on every unpredictable conditional operation.

By exploiting the efficiency, ease of modeling, and level-of-detail advantages of image-based rendering and the high performance of the Imagine architecture, this single chip processor, with a modest amount of external RAM, will perform high quality (1024x768), real-time (30 frames/sec) animation of complex, realistic scenes in support of applications such as flight simulation, distributed battlefield simulation, walk-throughs of virtual buildings and vehicles, and visualization of terrain databases. The Imagine architecture will also provide order of magnitude performance improvements on other image and signal processing applications, such as synthetic aperture radar. Imagine should serve as a model for future generations of commercial signal and image processors.

gajh@cva.stanford.edu