Imagine
!!
  People   Project   Publications   More Info  
    architecture   programming   performance   VLSI   board   applications   tools  

 

Imagine Performace

 

The Imagine kernel compiler and cycle accurate simulator were used to generate the following performance results for four applications and three representative media processing kernels.  Table 1 contains the results of these applications.

  Arithmetic Bandwidth Application Performance
Applications
Stereo Depth Extraction 11.92 GOPS (16-bit) 320x240 8-bit gray scale at 198 fps
MPEG-2 Encoding 15.35 GOPS (16- and 8-bit) 320x288 24-bit color at 287 fps
QR Decomposition 10.46 GFLOPS 192x96 matrix decomposition in 1.44 ms
Polygon Rendering 5.91 GOPS (floating-point and integer) 35.6 fps for 720x720 "ADVS" benchmark
Polygon Rendering with Real-Time Shading Language 4.64 GOPS (floating-point and integer) 16.3M pixels/second; 11.1M vertices/second
Kernels
Discrete Cosine Transform 22.6 GOPS (16-bit) 34.8 ns per 8x8 block (16-bit)
7x7 Convolution 25.6 GOPS (16-bit) 1.5 us per row of 320 16-bit pixels
FFT 6.9 GFLOPS 7.4 us per 1,024-point floating-point complex FFT

Table 1: Application and Kernel performance.

 

Figure 1. Measured application bandwidth for each level of the bandwidth hierarchy.

Figure 1 shows the sustained bandwidth used by these applications and demonstrates the effectiveness of the 3-level bandwidth hierarchy. Each level of the hierarchy sustains an order of magnitude more bandwidth than the previous level. The kernels sustain a high computation rate and require hundreds of gigabytes per second of local register bandwidth. The SRF, which provides higher bandwidth than a general-purpose global register file, cannot even provide half of the data bandwidth used by the arithmetic units. Therefore, without the small, fast local register files at the bottom of the bandwidth hierarchy, Imagine would not be able to achieve such high sustained performance on media kernels.

Imagine's stream architecture allows it to achieve high sustained performance for these media processing kernels. For the FFT kernel, for instance, an average of over 21 arithmetic operations are issued on every cycle for a sustained performance of 6.9 GFLOPS. The inherent parallelism in media applications and the Imagine bandwidth hierarchy results in similar sustainable performance on a variety of media processing applications.

 

gajh@cva.stanford.edu