# Design of a 125µW, Fully-Scalable MPEG-2 and H.264/AVC Video Decoder for Mobile Applications

Tsu-Ming Liu, Ching-Che Chung, Chen-Yi Lee National Chiao-Tung University 1001, Ta Hsueh Road HsinChu 300, Taiwan, ROC +886-3-5731849

{mingle, wildwolf, cylee}@si2lab.org

## ABSTRACT

A design of MPEG-2 and H.264/AVC video decoder is demonstrated in a 0.18 $\mu$ m CMOS [1]. The key design issues involved in this advanced IC are discussed, including improving area and power efficiency. Power dissipation is greatly lowered through the architectural exploration. Measurement results show that MPEG-2 and H.264/AVC real-time decoding of QCIF@15fps are achieved at 1.15MHz with power dissipation of 108 $\mu$ W and 125 $\mu$ W respectively at 1V supply voltage.

### **Categories and Subject Descriptors**

B.7.1 [Hardware]: Integrated Circuits -- Types and Design Styles

General Terms: Design, Measurement, Performance.

Keywords: MPEG-2, H.264/AVC, Mobile, Low-power.

## **1. INTRODUCTION**

One of the primary goals in the design of video decoding system for mobile applications is power reduction. Although existing video standards greatly reduce the transmission bandwidth, the introduced power dissipation becomes a major challenge for battery-operating systems. In this paper, we consider a dualstandard video decoder as an extreme case to illustrate a powerefficient design. We first concentrate on a memory system since it generally occupies a large portion of power dissipation in various multimedia systems. Specifically, we pre-store the useful data and make a better compromise between internal and external memory access times. After that, we develop a novel motion compensation and deblocking filter [1] to lower the required working frequency. Moreover, we describe a design flow to meet low-power requirements by means of available EDA tools. Finally, the measured results exhibit that a sub-mW of power magnitude can be achieved under the real-time decoding of quarter common intermediate formats (QCIF, 176×144@15fps) in MPEG-2 and H.264/AVC video standards.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

DAC 2006, July 24–28, 2006, San Francisco, California, USA.

Copyright 2006 ACM 1-59593-381-6/06/0007...\$5.00.

Ting-An Lin, Sheng-Zen Wang MediaTek Inc. No. 1, Dusing Road 1, Science-based Industrial Park HsinChu 300, Taiwan, ROC +886-3-5670766

{Zion\_Lin, SJ\_Wang}@mtk.com.tw



Figure 1: The system block diagram.

### 2. SYSTEM ARCHITECTURE

Figure 1 shows block diagram of the dual-video decoder chip. A 22.75Kb embedded SRAM is employed to store local pixel data. Two 4MB external frame memories are connected to SDRAM interface (I/F) via a 64-bit system bus. Accessing SDRAM is issued by both motion compensation and deblocking filter. To reduce the bandwidth between external frame memory and deblocking filter, a separate data bus and display engine are explored for on screen display (OSD) through a direct display I/F. Note that most of functional blocks in H.264/AVC are similar to those in MPEG-2, including the entropy decoder, motion compensation, and intra prediction. However, the IDCT modules between MPEG-2 and H.264/AVC are so diverse that they are difficult to combine. Similarly, the integrations of deblocking filters have the same problem as well. To enhance area-efficiency, we propose solutions for 4×4/8×8 IDCT and in/post-loop deblocking filter [1].

# 3. CHIP DESIGN

Designing and implementing a video decoding chip that meets the low-power demands can be a complicated and time-consuming process. To overcome this challenge, a design breakthrough from the architectural and methodological levels is demanded.

## 3.1 Low-Power Design Strategy

Improving the memory hierarchy or reducing the memory size is very effective for achieving low power dissipation because a memory system occupies about 70% of core power dissipation [2]. Figure 2 depicts a three-level memory hierarchy where a slice pixel SRAM is allocated for the storage with rows of pixels since H.264/AVC features to access logically adjacent pixels in the vertical direction. However, storing all pixels in rows of vertical pixels is unnecessary when the following decoding process is unrelated to the upper neighboring pixels. Hence, we propose a line-pixel-lookahead (LPL) scheme to eliminate the un-used pixels. In particular, a 19.2kb slice pixel SRAM caches the pixels of upper neighbors, and a LPL scheme predicts whether the follow-up pixel data should be kept or not. In the LPL scheme, the TAG prediction issues a Decoding TAG (D. TAG) and the Neighboring TAG (N. TAG) is equal to the previous D. TAG after buffering one row of TAGs. In the TAG prediction, a key observation is that not all upper neighboring pixels need to be prestored when they are decided as a "horizontal prediction mode" in intra prediction or a "SKIP mode" in deblocking filter [1]. Finally, both three-level memory hierarchy and LPL scheme are exploited to achieve 51% power saving compared to conventional design without exploiting any memory hierarchy.

In addition to the improved memory hierarchy, our proposed motion compensation (MC) and deblocking filter (DF) feature to eliminate redundant memory accesses. In MC and DF modules, we allocate internal buffers to reuse the neighboring pixels. Hence, these pixels can be fetched from internal storage instead of external memory, leading to the reduced processing cycles. In summary, they lower the required working frequency as well as power consumption without degrading system performance. Under a real-time decoding process, these designs reduce the required working frequency by approximately 60% with only a few additional buffers and logics.



Figure 2: Three-level memory hierarchy with LPL scheme.



Figure 3: A design flow for this video decoder.

# 3.2 Design Flow

While power savings can be achieved by exploring different architectures, the sophisticated use of some advanced features in EDA tools during the synthesis and P&R phases can also play a key role. A design flow that enables an efficient design for lowpower demands is depicted in Figure 3, with entry of C-Language model and Verilog RTL-level descriptions, then synthesizing and routing with Cadence<sup>®</sup> RTL Compiler and SoC Encounter<sup>TM</sup>, and ending with chip fabrication as well as verification on an Agilent 93000 SOC Test System. We use the layout estimator with process technology file to replace the wire-load model and facilitate the timing closure. Moreover, a toggle count format (TCF) file that provides an average switching activity for the nets over time is generated to automatically optimize the power of design. In the placement and routing phase, we perform a SIprevention timing-aware routing. Altogether, architectural improvement greatly reduces power dissipation, but this power can be further improved through EDA tools. Hence, we have various design alternatives to make a better compromise between performance and power in earlier design phase.

| Table | 1: | Chip | features. |
|-------|----|------|-----------|
|-------|----|------|-----------|

| Specification     |           | Dual MPEG-2 SP@ML             |  |
|-------------------|-----------|-------------------------------|--|
|                   |           | H.264/AVC BL@L4               |  |
| Technology        |           | Standard 0.18µm 1P6M CMOS     |  |
|                   |           | 1.8V core, 3.3V I/O           |  |
| Die Size          |           | 3.9mm×3.9mm                   |  |
| Package           |           | 208-pin CQFP                  |  |
| Logic Gates       |           | 303.78K                       |  |
| Internal Memory   |           | 22.75Kb SRAM                  |  |
| External          |           | 4MB×2 SDRAM                   |  |
| Max. System Clock |           | 100MHz                        |  |
| Max. Throughput   |           | 101.04Mpixels/sec             |  |
| Core Power        | MPEG-2    | 108µW(1.15MHz@1V,QCIF@15fps)  |  |
|                   | H.264/AVC | 125µW(1.15MHz@1V, QCIF@15fps) |  |

# 4. IMPLEMENTATION RESULTS

Table 1 lists the summary of this IC. The power consumption targets for sub-mW in real-time decoding of QCIF resolution and 15fps (frames per second) for mobile applications. This power consumption could be further improved through a voltage scaling. It indicates that this chip operates at a working frequency of 1.15MHz with a supply voltage of 1V. As a result, video decoding of QCIF@15fps consumes 125 $\mu$ W and 108 $\mu$ W for H.264/AVC and MPEG-2 respectively.

# 5. CONCLUSIONS

A single-chip MPEG-2 SP@ML and H.264/AVC BL@L4 video decoder is fabricated in a  $0.18\mu$ m 1P6M CMOS technology with an area of 15.21mm<sup>2</sup>. The low-power issue is resolved by several design breakthroughs, especially in architectural levels. Finally, MPEG-2 and H.264/AVC video decoding of QCIF sequences at 15fps are achieved at a clock frequency of 1.15MHz and consume 108 $\mu$ W and 125 $\mu$ W, respectively, at 1V supply voltage.

## 6. REFERENCES

- Tsu-Ming Liu *et al.*, "A 125µW, Fully Scalable MPEG-2 and H.264/AVC Video Decoder for Mobile Applications," *ISSCC Digest of Technical Papers*, pp. 402-403, Feb. 2006.
- [2] Tsu-Ming Liu *et al.*, "An 865-μW H.264/AVC Video Decoder for Mobile Applications," *IEEE Asian Solid-State Circuit Conference (A-SSCC'05)*, pp. 301-304, Nov. 2005.