# Data Stream Generation for Concurrent Computation in VLSI Signal Processors\*

Tay-Jyi Lin, Chein-Wei Jen

Dept. of Electronics Engineering National Chiao Tung University, Hsinchu, Taiwan Fax: 886-3-5710580, Email: {tjlin, cwjen}@ee.nctu.edu.tw

#### **Abstract**

Signal processing usually requires extremely high Fortunately, with advance in computing power. VLSI technology, the required performance could be achieved with more functional units performing concurrent computations on a chip. Supplying demanding data streams to these computational units soon becomes the system-performance bottleneck because of slow off-chip I/O and memory with less We have proposed a data improvement speed. stream generation (DSG) scheme that explores data reuse property existing in most signal processing algorithms while supplying appropriate data sequences. This scheme can make the VLSI signal processors much more latency and cost efficient. In the illustrating example, our DSG supplies the specified streams with 4-cycle setup time at the beginning (latency), instead of 48 cycles for each block in conventional FIFO approaches and only requires 1/6-storage elements.

### 1. Introduction

Standard microprocessors today cannot provide a cost-efficient platform with enough computing power for dramatically growing multimedia applications with intensive-computation requirements. Application-specific circuits are an obvious choice, but lack flexibility. Digital signal processors with separate data/program accesses, adapted interconnections and functional units could only satisfy speech-/audio-processing requirements. With an embedded micro-

controller and customized hardware accelerators, application specific instruction-set processors (ASIPs) [1] with both flexibility and computing power are very attractive. Most VLSI signal-processing applications adopt this ASIP model as the basic underlying platform for multimedia computing.

Conventional ASIPs consist of one embedded micro-controller to synchronize all operations and handle some sequential tasks, some customized hardware accelerators such as array coprocessors, SIMD and MMX-like datapath that explore the concurrency in DSP algorithms to provide computing power boost. A memory subsystem is also included as shown in figure 1(a). Data-hungry concurrent computations usually block the memory accesses by other functional units, or lose some performance themselves due to bandwidth limitation on system bus. Address generation and data formation, possibly at bit level, might exhaust the micro-controller.



**Figure 1** (a) Conventional ASIP Model (b) Modified ASIP Model with our Proposed SIU

<sup>\*</sup>This work was supported by the NSC, Taiwan under Grant NSC89-2215-E009-003

A stream interface unit (SIU) is inserted as an alternative [2], to decouple the accelerating DSP datapath from the system bus as shown in figure 1(b). The customized SIU datapath handles data format conversions if necessary and support a very high data rate to the accelerating DSP datapath while maintaining the lowest system bus occupation and hence less off-chip data acquisition. The asymmetric data rate property at the SIU I/O comes from massive data reuse.

Design methodologies for algorithm-specific array processors in accelerating DSP datapath were well developed but performance might be limited by I/O constraints. Our proposed data stream generation (DSG) scheme can release these I/O constraints via the customized SIU that could be automatically synthesized. In next section we describe the DSG problem. SIU implementation issues and our proposed two-stage procedure for automatic circuit generation are summarized in section 3. Section 4 concludes our work.

#### 2. Data Stream Generation

Our stream interface unit (SIU) supplies huge data streams required by concurrent computation while keeping low data acquisition rate from the system bus. The asymmetric SIU I/O-rate could be in the form of different clock rates, different data widths or both. For simplicity, we assume the single-input / multiple-output under one synchronous clock, i.e. the SIU receives scalar inputs and generates array outputs.



Figure 2 Output Sequence Specification

Suppose there are three tasks in sequence, which access identical 4\*4 data block with elements A~P but

in different orders as figure 2 — three widely-used scanning ways in image applications. Parallelizing these tasks directly triples the system bandwidth. We can eliminate redundant memory accesses to keep the same bandwidth as the original sequential hardware by introducing a small latency and some storage elements. The proposed SIU for 3-channel data stream generation is shown in figure 3. The hardware cost could be measured as the number of storage elements together with the routing complexity for the MUX-based interconnection network.



Figure 3 3-Channel Data Stream Generation

We first assume that the SIU acquires data in rasterscan order from the system bus. Each channel is equivalent to an individual data format conversion (DFC) problem. The DFC methodology, which applies lifetime analysis followed by forwardbackward register allocation [3], is generalized here to handle asymmetric I/O rates with multiple data outputs. To maintain a smooth data flow into the parallel functional units, a small latency is introduced, which equals the most negative difference between T<sub>in</sub> (data arriving time) and Tz1 (data consumed time with zero latency). Lifetime for each variable is then derived from its birth (Tin) till death (the last consumed time of the three channels, i.e. MAX [T<sub>out</sub>]). Minimum register count is then obtained, which equals the maximum number of concurrent live variables, via lifetime analysis. The 3-channel SIU requires 9-cycle latency and 13 storage elements as shown in figure 4. Internal copies of identical data samples are not allowed for simplicity.

## 3. Stream Interface Unit (SIU) Implementation

Following (1) *I/O specification* (2) *circuit mapping*, the SIU could be systematically generated. More sophisticated I/O specification than raster-scan can efficiently reduce latency and SIU complexity with some assistance from the microcontroller and the onchip memory subsystem. Minimizing the storage and routing complexity in the MUX-based interconnection network while keeping an acceptable low latency leads the I/O specification to an NP-complete problem. A linear-time heuristic scheduler is shown to efficiently reduce almost half latency and storage elements in figure 5 while re-transmitting the same data item is not allowed for minimum bandwidth requirement.

Forward-backward register allocation [3] allocates the variables to the storage elements that are initially self-organized as a FIFO. The routing problem is significantly simplified based on this underlying FIFO structure with some additional feedback paths. Figure 6 shows the allocation table and the SIU hardware for scheduled input. The SIU hardware consists of 8 storage elements; one 7-input, two 6-input MUX for output ports and only one 2-input MUX for data feedback in the MUX-based interconnection network. Various register allocation schemes [4-7] could be utilized here as well with some adaptation of the input scheduling heuristics.

The methodology for SIU generation can be easily packaged as a soft-IP (silicon Intellectual Property [11]) that finds applications in many fields. Once the data streams are specified, tradeoff among latency, register count and interconnection complexity can be evaluated via various heuristic schedulers and register allocation schemes. Automatic pattern generation for functional verification and testing can be also supported. Users can easily instantiate our SIU module in the top-level Verilog design.

#### 4. Conclusion

The concept of System-on-Chip (SoC) and increasing gates available makes heterogeneous architectures more practical. We propose an alternative general computation model for embedded systems with data-

flow intensive tasks in this paper. The proposed data stream generation (DSG) scheme makes data-driven computing blocks much more efficient in terms of power, speed, and memory usage with an additional customized stream interface unit (SIU). A two-step procedure with simplified algorithms is summarized for automatic SIU circuit generation. We use a 3-channel SIU example to demonstrate the validity of our DSG scheme. The resulting hardware shows great improvement both in area and latency over conventional heterogeneous VLSI signal processors.

#### Reference

- 1. I. Kuroda, T. Nishitani, "Multimedia processors," *Proceedings of the IEEE*, June 1998
- 2. M. Schonfeld, M. Schwiegershausen, P. Pirsch, "Synthesis of Intermediate Memories for the Data Supply to Processor Arrays," *Algorithms and Parallel VLSI Architectures II*, 1992
- 3. K. K. Parhi, "Systematic synthesis of DSP data format converters using life-time analysis and forward-backward register allocation," *IEEE Trans. Circuits Syst. II*, July 1992
- Kazuhito Ito, et al. "ILP-Based Cost-Optimal DSP Synthesis with Module Selection and Data Format Conversion," *IEEE Trans. VLSI*, December 1998
- K. Srivatsan, C. Chakrabarti, and L. Lucke, "Low Power Data Format Converter Design Using Semi-Static Register Allocation," *ICCD*, 1995
- J. Bae, V. K. Prasanna, "Synthesis of Area-Efficient and High Throughput Rate Data Format Converters," *IEEE Trans. VLSI*, December 1998
- 7. M. Majumdar, K. K Parhi, "Design of Data Format Converters Using Two-Dimensional Register Allocation," *IEEE Trans. Circuits Syst. II*, April 1998
- 8. Francky Catthoor, et al, "Custom Memory Management Methodology: Exploration of Memory Organization for Embedded Multimedia System Design," *Kluwer Academic Publishers*, 1998
- 9. Preeti Ranjan Panda, et al, "Memory Issues in Embedded Systems-On-Chip," *Kluwer Academic Publishers*, 1999
- K. K. Parhi, "VLSI Digital Signal Processing Systems: Design and Implementation," John Wiley & Sons, 1999
- 11. M. Keating, P. Bricaud, "Reuse Methodology Manual for System-on-a-Chip Designs," *Kluwer Academic Publisher*, 1998



Figure 4 Lifetime Analysis for Raster-Scan Input: latency =  $|MIN (Tdiff1\sim3)|$  and life period = Tinput  $\sim MAX (Tout1\sim3)$ . The SIU needs 13 storage elements and 9-cycle latency.



Figure 5 Lifetime Analysis for Scheduled Input: "Small MIN  $[T_{zl}]$  first" reduces latency. When variables with identical MIN  $[T_{zl}]$  exist, latency does not depend on how these variables are scheduled. "Small MAX  $[T_{zl}]$  first" schedules late-dying ones last to efficiently decrease their lifetime and hence storage requirement. The scheduled-input SIU only requires 8 storage elements and 4-cycle latency.

| т  |       | 1      | 2     | 2      | 4        | 5      |        | 7     | 0     |          |          |          |
|----|-------|--------|-------|--------|----------|--------|--------|-------|-------|----------|----------|----------|
| T  | input | reg l  | reg 2 | reg 3  | reg 4    | reg 5  | reg 6  | reg 7 | reg 8 | output 1 | output 2 | output 3 |
| 0  | A     |        |       |        |          |        |        |       |       |          |          |          |
| 1  | E     | A      |       |        |          |        |        |       |       |          |          |          |
| 2  | В     | E      | A     |        |          |        |        |       |       |          |          |          |
| 3  | I     | В      | E     | A      |          |        |        |       |       |          |          |          |
| 4  | F     | I      | В     | E      | A(1,2,3) |        |        |       |       | A        | A        | A        |
| 5  | M     | F      | I     | B(3)   | E(1,2)   |        |        |       |       | E        | E        | В        |
| 6  | C     | M      | F     | I(1)   | B(2)     | E(3)   |        |       |       | I        | В        | E        |
| 7  | D     | C      | M(1)  | F(2,3) | I        | В      |        |       |       | M        | F        | F        |
| 8  | G     | D      | C(3)  | M      | F        | I(2)   | B(1)   |       |       | В        | I        | С        |
| 9  | J     | G      | D(3)  | C      | M(2)     | F(1)   | I      |       |       | F        | M        | D        |
| 10 | N     | J(1,2) | G(3)  | D      | C        | M      |        | I     |       | J        | J        | G        |
| 11 | H(3)  | N(1,2) | J     | G      | D        | C      | M      |       | I     | N        | N        | Н        |
| 12 | K     | Н      | N     | J      | G        | D      | C(1,2) | M     | I(3)  | C        | C        | I        |
| 13 | 0     | K      | Н     | N      | J(3)     | G(1,2) | D      |       | M     | G        | G        | J        |
| 14 | L     | 0      | K(1)  | Н      | N        |        |        | D(2)  | M(3)  | K        | D        | M        |
| 15 | P     | L      | O(1)  | K      | H(2)     | N(3)   |        |       | D     | О        | Н        | N        |
| 16 |       | P      | L     | О      | K(2,3)   | Н      |        |       | D(1)  | D        | K        | K        |
| 17 |       |        | P     | L(3)   | O(2)     |        | H(1)   |       |       | Н        | 0        | L        |
| 18 |       |        |       | P      | L(1,2)   | O(3)   |        |       |       | L        | L        | 0        |
| 19 |       |        |       |        | P(1,2,3) |        |        |       |       | P        | P        | P        |



**Figure 6 Forward Backward Register Allocation for Scheduled Input**: The MUX-based network requires one 7-input, two 6-input MUX in I/O and one 2-input MUX in datapath.