

#### Coprocessing Datapath Generation in Configurable DSP Platforms

*Tay-Jyi Lin, Tzung-Shian Yang* and *Chein-Wei Jen* Department of Electronics Engineering National Chiao Tung University 2001/8/16

2

# Heterogeneous Platform



## **Proposed DSP Platform**



- # of datapaths & # of their internal parallel FUs are *scalable* to meet the performance requirements.
- ✓ The coprocessing datapaths are *configurable* for various applications based on the executing algorithms in C/C++.
- Performance boost is achieved by
  - control overheads elimination (program flow)
  - reduced load / store operations with specific SIU data generator (data generation)
  - parallel processing via SIMD-like functional units

### CASCADE – Configurable And SCAlable Dsp Environment

| Micro-Controller<br>C/C++<br>Host Programs<br>Prior Compilation &<br>Performance Evaluation<br>Parallelism Analysis &<br>Task Dispatch | Data-driven Accelerators        | <pre>void dispatch_fun(fix *Din, int size, fix *Dout) { int index1=0, index2=0;    fix din, dout;    bool valid_in, valid_out;    Codes for Exception Handling    while(index2 &lt;= size)     { valid_in = (index1 &lt; size);       din = Din[index1++];       IO_Operation (din, valid_in, &amp;dout, &amp;valid_out);       Dout[index2] = dout;       index2 = index2 + valid_out;    } }</pre> |
|----------------------------------------------------------------------------------------------------------------------------------------|---------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Code Replacement<br>(Software Driver)                                                                                                  |                                 |                                                                                                                                                                                                                                                                                                                                                                                                      |
| Compilation                                                                                                                            | Operation Allocation/Scheduling | <ul> <li>The data-driven accelerators are<br/>synchronized with host's instructions.</li> </ul>                                                                                                                                                                                                                                                                                                      |
|                                                                                                                                        | Optimal Binding                 | <ul> <li>The interfacing software could be</li> </ul>                                                                                                                                                                                                                                                                                                                                                |
| Ļ                                                                                                                                      | Dataflow Control Optimization   | compiled and scheduled (by compiler & RTOS) as usual.                                                                                                                                                                                                                                                                                                                                                |
| Executable                                                                                                                             | Synthesizable Verilog           |                                                                                                                                                                                                                                                                                                                                                                                                      |
| Co-Simulation &                                                                                                                        | Performance Evaluation          |                                                                                                                                                                                                                                                                                                                                                                                                      |

4

// 2001

# Performance Improvement of MJ-like Encoder

|                   | DCT Kernel   |          | 320*240 Frame |
|-------------------|--------------|----------|---------------|
| Software alone    | 5,595 Cycles | 111.9 us | 152.928 ms    |
| With Accelerators | 246 Cycles   | 4.92 us  | 24.552 ms     |

- The host micro-controller is 50-MHz ARM7TDMI.
- The data-driven accelerator is composed of 4 MACs.
- 8-by-8 DCT, quantization specified in JPEG standard, run-length coding, and Huffman coding are performed on the 320\*240 frame.
- An ideal memory subsystem (no memory stall) is assumed for simplicity.





- The proposed coprocessing datapath generation is
  - easily *configurable*
  - performance *scalable*
  - simple low-power management (e.g. our example)
- The accelerating datapaths are *driven by host instructions* in the software interface, which is also auto-generated, to simplify the synchronization problem.
- The coprocessing datapaths boost the performance of low-cost microcontrollers to *lengthen their product life span*.
- Automatic generation of the accelerators with simple software-controlled interfacing dramatically *reduces the development time*.

