ELM Programming System |
|
|
The programming system of ELM is composed of two levels. The high-level [1] language utilizes a stream programming model. In this model, the application is described with a collection of kernels (or filters) connected by data streams (Figure 1). The high-level partitioner modifies the application to meet real-time constraints with high processor utilization: kernels in the bottleneck are parallelized while kernels underutilizing processors are merged. The low-level compiler and scheduler takes each partition and generates executables for the ELM micro-architecture. Figure 1. Stream Programming Model [2] The partitioner analyzes the input program (figure 2) based on the kernel parameters. First, it performs a data flow analysis on the sizes of kernel inputs and outputs to automatically insert buffers and insets for correctness (figure 3). Since adjacent kernels can produce/consume different sizes of data, a producer kernel may overwrite previous data or a consumer kernel may read invalid data without these buffers and insets. Second, the partitioner parallelizes time-consuming kernels to meet real-time data rate constraints (figure 4). Third, it applies time-multiplexing to simple kernels in order to increase processor utilization (figure 5). Finally, it maps parallelized/multiplexed partitions to physical processors. Figure 2. Input Program Figure 3. Automatic Insertion of Buffers for Correctness Figure 4. Automatic Parallelization to Meet Real-time Constraints Figure 5. Automatic Time-multiplexing to Increase Utilization The low-level compiler needs novel compilation algorithms for ELM's architectural features. For data delivery, the distributed hierarchical register organization poses a challenge on register allocation and instruction scheduling. When the register organization becomes distributed and non-uniform, we have a more serious phase ordering problem between allocations and scheduling [3]. This problem also occurs in the distributed hierarchical register organization. For example, if we allocate register first, the minimum latency between instructions changes depending on which register-level is allocated to the intermediate value as shown in figure 6. If we use a unified allocation and scheduling algorithm, the unified algorithm will meet a case when it cannot make any progress without backtracking. We developed an algorithm for avoiding this case and plan to compare the algorithm with a non-unified algorithm and an algorithm with backtracking. Figure 6. Phase Ordering Problem On the instruction delivery side, the compiler should minimize the traffic between the ensemble memory and the instruction register file. By inserting instruction block load instructions at appropriate program points, it not only reduces the memory traffic from the EPs, but also hides the latency of loading instructions. [1] David Black-Schaffer, "Block Parallel Programming for Real-time Applications on Multi-core Processors", Stanford PhD Thesis, 2008 [2] B. Khailany., W. J. Dally, U. J. Kapasi, P. Mattson, J. Namkoong, J. D. Owens, B. Towles, A. Chang, and S. Rixner,"Imagine: Media Processing with Stream." IEEE Micro, 2001 [3] K. Kailas, K. Ebcioglu, and A. Agrawala, "CARS: A New Code Generation Framework for Clustered ILP Processors." HPCA, 2001 | |