Exploring the VLSI Scalability of Stream Processors.
William J. Dally,
Ujval J. Kapasi,
John D. Owens,
Computer Systems Laboratory
To appear in the Proceedings of the Ninth Symposium on
High Performance Computer Architecture,
February 8-12, 2003, Anaheim, California, USA, pp. 153-164.
Stream processors are high-performance programmable processors
optimized to run media applications.
Recent work has shown these processors to be more
area- and energy-efficient than conventional programmable
architectures. This paper explores the scalability of stream
architectures to future VLSI technologies where over a thousand
floating-point units on a single chip will be feasible.
Two techniques for increasing the number of ALUs in a stream processor
are presented: intracluster and intercluster scaling. These scaling
techniques are shown to be cost-efficient to tens of ALUs per cluster
and to hundreds of arithmetic clusters. A 640-ALU stream processor
with 128 clusters and 5 ALUs per cluster is shown to be feasible in 45
nanometer technology, sustaining over 300 GOPS on kernels and
providing 15.3x of kernel speedup and 8.0x of application speedup over
a 40-ALU stream processor with a 2% degradation in area per ALU and a
7% degradation in energy dissipated per ALU operation.