ELM Programming

The programming toolchain of ELM is responsible for taking user code, partitioning it, scheduling data transfers, and outputting C code for each Ensemble Processor (EP). The high-level language (an old version is described in [1]) utilizes a stream programming model. In this model, the application is described with a collection of kernels (or filters) connected by data streams (Figure 1). The high-level partitioner modifies the application to meet real-time constraints with high processor utilization; kernels in the bottleneck are parallelized while kernels underutilizing processors are merged. The low-level compiler takes each partition and generates executables for the ELM micro-architecture.

Compared to StreamIt, our language has the following advantages. First, our language supports multiple stream inputs and outputs so that we can avoid contrived stream interleavings resulted from single input and single output constraint. Second, the computation model of our language is not limited to synchronous data flow. SDF is a great model if the properties of filter execution can be statically determined. However, our language and backend computation model do not assume the SDF model so that we can support wider range of embedded applications. This is more important when we target applications such as 3D rendering that traditionally have not been categorized as embedded applications but soon will be incorporated in many embedded devices. Third, the optimization objective of our language is not limited to minimizing the execution time. For embedded systems, minimizing the energy consumption subject to real-time constraints is the common optimization objective. Rather than finding as much as parallelism, our programming system judiciously applies parallelization on bottleneck kernels and avoids excessive communication and synchronization overhead from over-parallelization. Fourth, our language supports constructs for exploiting locality. Last and probably the most important difference is that our language and programming system will be evaluated in the ELM architecture which is designed for streaming programming model (but note that our programming system can target other architectures and ELM is also not limited to streaming programming model). Our programming system back-end will generate code in a natural way for streaming applications instead of adding workarounds to support architectures that not designed for streaming programming model.

Figure 1. Stream Programming Model [2]

The partitioner analyzes the input program (figure 2) based on the kernel parameters. First, it performs a data flow analysis on the sizes of kernel inputs and outputs to automatically insert buffers and insets for correctness (figure 3). Since adjacent kernels can produce/consume different sizes of data, a producer kernel may overwrite previous data or a consumer kernel may read invalid data without these buffers and insets. Second, the partitioner parallelizes time-consuming kernels to meet real-time data rate constraints (figure 4). Third, it applies time-multiplexing to simple kernels in order to increase processor utilization (figure 5). Finally, it maps parallelized/multiplexed partitions to physical processors.

Figure 2. Input Program

Figure 3. Automatic Insertion of Buffers for Correctness

Figure 4. Automatic Parallelization to Meet Real-time Constraints

Figure 5. Automatic Time-multiplexing to Increase Utilization

The compiler page describes the low-level compiler.

[1] David Black-Schaffer, "Block Parallel Programming for Real-time Applications on Multi-core Processors", Stanford PhD Thesis, 2008

[2] B. Khailany., W. J. Dally, U. J. Kapasi, P. Mattson, J. Namkoong, J. D. Owens, B. Towles, A. Chang, and S. Rixner,"Imagine: Media Processing with Stream." IEEE Micro, 2001