Eight arithmetic clusters, controlled by a single microcontroller,
perform kernel computations on streams of data. Each cluster
operates on one record of a stream so that eight records are
processed simultaneously. As shown in the figure to the right, each
cluster includes three adders, two multipliers, one divide/square
root unit, one 128-entry scratch-pad register file, and one
intercluster communication unit. This mix of arithmetic units is
well suited to our experimental kernels. However, the architectural
concept is compatible with other mixes and types of arithmetic units
within the clusters.

Each input of every functional unit in the cluster is fed by a
separate local register file (LRF). These local register files store
kernel constants, parameters, and local variables, reducing the
required SRF bandwidth. Each cluster has 15 LRFs (4 32-entry and 13
16-entry LRFs) for a total of 272 words per cluster and 2176 words
across the eight clusters. Each local register file has one read
port and one write port. The 15 local register files collectively
provide 54.4 GB/s of peak data bandwidth per cluster, for a total
bandwidth of 435 GB/s within the cluster array.

Additional storage is provided by a 256-word scratch-pad register
file, the second unit from the right in the figure. It can be indexed
with a base address specified in the instruction word and an offset
specified in a local register. The scratch-pad allows for
coefficient storage, short arrays, small lookup tables, and some
local register spilling.

The intercluster communication unit, labelled *CU* in the
figure, allows data to be transferred between clusters over the
intercluster network using arbitrary communication patterns. The
communication units are useful for kernels such as the Fast Fourier
Transform, where interaction is required between adjacent stream
elements.

The adders and multipliers are fully pipelined and perform single
precision floating point arithmetic, 32-bit integer arithmetic, and
16-bit or 8-bit parallel subword integer operations, as found in MMX
and other multimedia extensions. The divide/square root unit is not
pipelined and operates only on single precision floating point and
32-bit integers. The divider can support two simulataneous
operations, with latencies ranging from 16-23 cycles depending on
the operation and data type. The 48 total arithmetic units, six
units replicated across eight clusters, provide a peak computation
rate of over 16GOPS for both single precision floating point and
32-bit integer arithmetic.