Imagine is designed to achieve the performance of a special-purpose
image-processing, signal-processing, or graphics engine with the flexibility of
a general purpose computer. Our
cycle-accurate simulations indicate that a single Imagine chip is able to
sustain in excess of 10GFLOPS (32-bit floating point) or 20GOPs (16-bit fixed
point) on problems ranging from signal processing, to image processing, to
polygon-based graphics. By minimizing unneeded communication, Imagine is able to
realize a power efficiency of over 3GFLOPS/W on applications of interest
compared to less than 300MFLOPS/W for the most efficient conventional processors
and DSPs. Several Imagine chips can
be combined to achieve higher performance. A single circuit-board with an 8 x 8 array of Imagines can
sustain over a TeraOP of performance with a power dissipation of less than 1kW.
Imagine achieves flexible performance by using a streaming architecture that
exposes the parallelism and locality inherent in many signal- and
image-processing applications. Fast
stream operations are supported through a combination of vector processing, VLIW
arithmetic clusters, a streaming memory system, and conditional vector
operations. Vector
performance is achieved by interleaving stream elements among the eight computation
clusters, each consisting of many distinct arithmetic units. Each Imagine chip has a
network interface to allow high speed communication among Imagine chips, so that
multiprocessor solutions can be constructed if individual chips do not provide enough
performance for a particular problem.
A large general purpose stream register file forms the nexus of the chip, and is
connected to the clusters, the network, the memory system, and a host processor. The
stream register file is program controlled, and serves as a staging area for data that is
used by the other units. The number of memory accesses are reduced by keeping frequently
used data in the stream register file.
The Imagine chip is controlled by a host processor, which accesses Imagine control and
status registers as well as issues commands via a host interface. The host interface also
is connected to the stream register file, so that data may be loaded into the machine and
then sent to the memory system, network, or any other unit connected to the register file.
Historically, special purpose processors have offered better cost performance than
programmable processors for three reasons:
- They devote a larger fraction of their silicon area to arithmetic units.
- They eliminate register and memory bandwidth bottlenecks by wiring arithmetic units
directly together and providing many specialized memories.
- They eliminate the control overhead of fetching and interpreting instructions.
Imagine matches these advantages while retaining the flexibility of a
programmable signal processor. Imagine has a large fraction of working silicon with 48
32-bit floating-point arithmetic units. These
units are divided into eight arithmetic clusters each comprising three adders,
two multipliers, and a divide/square-root unit. The arithmetic units support 32-bit integer and
floating-point operations and can be subdivided to provide 96 16-bit units or
192 8-bit units. The six units
within each cluster are operated under VLIW control to exploit instruction-level
parallelism. The eight clusters
operate in lockstep under common program control to exploit data parallelism. All of the arithmetic units except the divide/square-root
unit are fully pipelined.
Imagine overcomes the bandwidth bottlenecks of global
register files and memory systems by using a three-level bandwidth hierarchy
organized to support stream operations. Streams
are transferred between memory and a stream register file (SRF) by a four-bank
streaming memory system (2GB/s) that reorders references to improve bandwidth.
Once a stream is loaded from memory, it is typically circulated between
the SRF and the arithmetic clusters several times before returning the result to
memory, exploiting the 32GB/s bandwidth of the SRF.
Finally, during a computation kernel, intermediate results are forwarded
directly between local register files associated with the arithmetic units
without need to return to the global register file, using the 544GB/s local
register bandwidth. On
representative benchmark programs, exploiting the locality inherent in stream
applications in this manner reduces bandwidth demands on global register ports
by a factor of 20 compared to a typical scalar architecture.
Imagine overcomes the performance limiting effects of
conditional operations by sorting streams according to a conditional variable
rather than through conditional control flow.
These conditional stream operations divide data into homogenous sets that
can then be processed without the overhead of conditional control instructions. Compared to conventional approaches of branch prediction or
predication, conditional stream operations enable very high levels of
instruction and data parallelism to be exploited without incurring a large
penalty on every unpredictable conditional operation.
By exploiting the efficiency, ease of modeling, and level-of-detail advantages of
image-based rendering and the high performance of the Imagine architecture,
this single chip processor, with a modest amount of external RAM, will perform high
quality (1024x768), real-time (30 frames/sec) animation of complex, realistic scenes in
support of applications such as flight simulation, distributed battlefield simulation,
walk-throughs of virtual buildings and vehicles, and visualization of terrain databases.
The Imagine architecture will also provide order of magnitude performance improvements on
other image and signal processing applications, such as synthetic aperture radar. Imagine
should serve as a model for future generations of commercial signal and image processors.