M-Machine Publications

Exploiting Fine-Grain Thread Level Parallelism on the MIT Multi-ALU Processor


Much of the improvement in computer performance over the last twenty years has come from faster transistors and architectural advances that increase parallelism. Historically, parallelism has been exploited either at the instruction level with a grain-size of a single instruction or by partitioning applications into coarse threads with grain-sizes of thousands of instructions. Fine--grain threads fill the parallelism gap between these extremes by enabling tasks with run lengths as small as 20 cycles. As this fine--grain parallelism is orthogonal to ILP and coarse threads, it complements both methods and provides an opportunity for greater speedup. This paper describes the efficient communication and synchronization mechanisms implemented in the Multi-ALU Processor (MAP) chip, including a thread creation instruction, register communication, and a hardware barrier. These register-based mechanisms provide 10 times faster communication and 60 times faster synchronization than mechanisms that operate via a shared on-chip cache. With a three-processor implementation of the MAP, fine--grain speedups of 1.2--2.1 are demonstrated on a suite of applications.