### 國立中正大學

## 資訊工程學系研究所

#### 碩士論文

應用於動態系統效能調整之嵌入式溫度與電路 延遲感測器與系統驗證技術開發

On-chip Temperature and Delay Sensors for Adaptive System Design and System Verification with SVA

研究生:張家銘

指導教授: 鍾菁哲 博士

#### 中華民國 一百 年 七 月

國立中正大學碩士班研究生

學位考試同意書

#### 本人所指導 資訊工程學系

研究生 張家銘 所提之論文

應用於動態系統效能調整之嵌入式溫度與電路延遲感測器與系統 驗證技術開發 On-chip Temperature and Delay Sensors for Adaptive System Design and System Verification with SVA

同意其提付 碩 士學位論文考試

書哲 Æ \_\_\_\_ 簽章 指導教授 100年6月2日

#### 國立中正大學碩士學位論文考試審定書

#### 資訊工程學系

#### 研究生張家銘 所提之論文

應用於動態系統效能調整之嵌入式溫度與電路延遲感測 器與系統驗證技術開發 On-chip Temperature and Delay Sensors for Adaptive System Design and System Verification with SVA 經本委員會審查,符合碩士學位論文標準。



摘要

在這篇論文中,我們討論了可適性系統設計的重要性,其釋放了電路原本該有的 性能。我們提出了兩種應用於可適性系統設計的延遲感測器,具有較簡單、可攜性高 及可靠度高的特性。除此之外,我們開發了一個低電壓變異敏感的改進版本的溫度感 測器。

我們討論了從傳統晶片系統設計的缺點,到使用了現有傳統感測器的動態電壓與 頻率調整系統,接著錯誤偵測電路的推出。其感測晶片變異性的能力從粗略的範圍到 高精密度,從大晶片範圍的偵測到各別組合邏輯路徑的等級。相對的整合這些不同的 感測器到系統有不同的實現難度。

論文的第四章討論了溫度感測器。使用了漏電流延遲單元,使得難以校正的動態 電壓變異對溫度感測器的影響得以降低。

我們開發的溫度感測器與延遲感測電路也移植到場域可程式化閘陣列(FPGA)上. 與另一計畫的 UniRISC 整合,實現了一個具可適性調整的晶片系統,並可現場展示。

在這項研究中,我們在 65 奈米製程下實現了延遲感測器。展示了一個整合了延遲 感測器的動態效能調整的可適性系統。

#### Abstract

In this thesis we discuss the benefit of adaptive system design, and this released the max performance of circuits should have. Two types of delay monitor has been proposed, with simpler, more portability and more robust system. An improved version of smart temperature sensor with lower dependent to supply voltage has been proposed too.

Include discussion the shortcoming of traditional system design to the dynamic voltage and frequency system (DVFS), then introduce of error detection circuits. The ability of detect on-chip variation is improve from coarse to highly precision, from rough whole chip to every path on the circuit. The integration's difficult varied with use of these sensors.

The chapter 4 of the thesis is the thermal sensor. With the use of leakage delay cell, the effect of hard to calibrate variation of dynamic voltage can reduced.

The developed thermal sensor and delay monitor had ported to FPGA. With integration of the UniRISC system, a live demonstrates of with adaptive scaling SoC system had implemented.

In this research, we had fabricated a test chip on 65nm process. With the integration of delay monitor, a adaptive system demonstrated on test chip.

### Contents

| Chapte | r 1 Introduction1                                                         |
|--------|---------------------------------------------------------------------------|
| 1.1    | Introduction to Adaptive System Design1                                   |
| 1.2    | Motivation1                                                               |
| 1.3    | Types of Variations                                                       |
| 1.4    | Thesis Organization2                                                      |
| Chapte | r 2 Sensors for Adaptive System Design4                                   |
| 2.1    | Combination of Various Sensors                                            |
| 2.2    | Canary Circuit                                                            |
|        | 2.2.1 Critical Path Monitor                                               |
|        | 2.2.2 Delay Monitor                                                       |
| 2.3    | Error Detection Circuit7                                                  |
|        | 2.3.1 Razor and Razor II7                                                 |
|        | 2.3.2 Error Detection Sequence                                            |
|        | 2.3.3 Low Cost Error Detection Circuit                                    |
|        |                                                                           |
| Chapte | r 3 Adaptive System Architecture                                          |
| 3.1    | System Architecture                                                       |
| 3.2    | DLX Processor                                                             |
| 3.3    | Clock Generator                                                           |
| 3.4    | Sensors Integration                                                       |
| 3.5    | Adaptive Clock Control                                                    |
| Chapte | r 4 Smart Temperature Sensor17                                            |
| 4.1    | Introduction17                                                            |
| 4.2    | Proportional to Absolute Temperature (PTAT) Circuit17                     |
| 4.3    | Low Supply Voltage Sensitive Temperature Sensor                           |
| 4.4    | Sensor Architecture                                                       |
| 4.5    | Calibration                                                               |
| 4.6    | Summary                                                                   |
| Chapte | r 5 Experimental Results25                                                |
| 5.1    | Adaptive System with Delay Monitor                                        |
| 5.2    | Delay Monitor and Temperature Sensor Implementation on FPGA26             |
|        | 5.2.1 Delay Monitor and Smart Temperature on CCU SoC Criti-core Project26 |

|         | 5.2.2 Temperature Sensor Implementations on FPGA  | 27 |
|---------|---------------------------------------------------|----|
|         | 5.2.3 Delay Monitor Sensor Implementation on FPGA |    |
|         | 5.2.4 Sensors Integration                         |    |
|         | 5.2.5 Sensor Calibration                          |    |
|         | 5.2.6 Criti-Core Project Demo                     |    |
|         |                                                   |    |
| Chapte  | r 6 System Verification with SVA                  |    |
| 6.1     | Introduction                                      |    |
| 6.2     | MPEG-2 Decoder IP                                 |    |
| 6.3     | SVA for Inverse Discrete Cosine Transform (IDCT)  |    |
| 6.4     | SVA for MPEG-2 Decoder IP                         |    |
| Chapte  | r 7 Conclusions and Future Work                   |    |
| Referer | ነሮድ                                               |    |



## **List of Figures**

| Fig. 2.1 DVFS system with various sensors [8]                                        | 4    |
|--------------------------------------------------------------------------------------|------|
| Fig. 2.2 Critical-path timing monitor used in 65-nm microprocessor [9]               | 5    |
| Fig. 2.3 Delay Monitor                                                               | 6    |
| Fig. 2.4 Timing diagram of delay monitor                                             | 6    |
| Fig. 2.5 (a) Razor I latch (b) Timing diagram of Razor I latch                       | 7    |
| Fig. 2.6 Data path with min-delay < Detection window                                 | 8    |
| Fig. 2.7 (a) (TDTB) and (b) timing diagram of TDTB [16]                              | 9    |
| Fig. 2.8 (a) Double sampling with time borrowing (DSTB) and (b) time diagram of DSTB | .10  |
| Fig. 2.9 Low cost error detection circuit                                            | . 11 |
| Fig. 2.10 Timing diagram of error detection circuit                                  | . 11 |
| Fig. 2.11 Timing diagram of hspice simulation of low cost error detection circuit    | .12  |
| Fig. 3.1 System architecture                                                         | .13  |
| Fig. 3.2 5-pipeline stages DLX                                                       | .14  |
| Fig. 3.3 Clock Generator.                                                            | .14  |
| Fig. 3.4 DCO mux type.                                                               | .15  |
| Fig. 3.5 DCO output frequency.                                                       | .15  |
| Fig. 3.6 Time diagram of post-layout simulation.                                     | .16  |
| Fig. 4.1 Architecture of conventional BJT based temperature sensor.                  | .18  |
| Fig. 4.2 Schematic of the leakage delay cell.                                        | . 19 |
| Fig. 4.3 PTAT Pulse's width of leakage delay cell type                               | . 19 |
| Fig. 4.4 Sensor architecture                                                         | .21  |
| Fig. 4.5 Timing diagram of IPTAT pulse generator                                     | .21  |
| Fig. 4.6 Single slope calibration with 0-70 $^{\circ}$ C                             | .22  |
| Fig. 4.7 Two slope calibration                                                       | .22  |
| Fig. 4.8 Three slope                                                                 | .23  |
| Fig. 4.9 Second order approximation with 0-100 °C                                    | .23  |
| Fig. 4.10 Second order approximation with 0-80 °C                                    | .24  |
| Fig. 5.1 Chip microphotograph                                                        | .25  |
| Fig. 5.2 Optimal frequency                                                           | .26  |
| Fig. 5.3 Criti-core reliability-central SoC systems architecture.                    | .27  |
| Fig. 5.4 Thermal sensor implemented on FPGA                                          | .28  |
| Fig. 5.5 Delay monitor used in FPGA                                                  | .28  |
| Fig. 5.6 Timing diagram of delay monitor                                             | .29  |
| Fig. 5.7 Calibration circuit                                                         | .29  |
| Fig. 5.8 Architecture used in project demo                                           | .31  |
| Fig. 5.9 Cores and sensors floorplanning, identified by color.                       | .31  |

| Fig. 5.10 CPR calibration flow.              | 33 |
|----------------------------------------------|----|
| Fig. 5.11 Demo environment                   | 34 |
| Fig. 5.12 Adaptive clock control.            | 35 |
| Fig. 6.1 MPEG-2 IP                           | 37 |
| Fig. 6.2 The 2D IDCT decoder diagram         | 38 |
| Fig. 6.3 Navigate the SVA checker with Verdi | 39 |
| Fig. 6.4 An example of run-level encoding.   | 40 |
| Fig. 6.5 Waveform of assertion data_out_chk  | 40 |
| Fig. 6.6 Assertions statistics               | 41 |



## **List of Tables**

| Table 1.1 Types of variation.[3]                               | 2  |
|----------------------------------------------------------------|----|
| Table 4.1 Compare of voltage variation on PTAT's pulse         | 20 |
| Table 5.1 Resolution of delay cell used in calibration circuit |    |



# Chapter 1

## Introduction

#### **1.1 Introduction to Adaptive System Design**

Today's silicon devices are toward more bigger and larger, chips are integrated with more functions and become hotter and consumes more power. As fabrication technology scales to sub-micron meter, the on-chip variation gets worse [1, 2]. Traditional chips design flow is built with margins, take in count with the process, voltage and temperature variation. These margins are translated to more power consumption or degradation of performance, limit the real power of designs.

Adaptive system came with various type of monitor sensors, support for dynamic frequency scaling and voltage control. It can maximize the performance or minimize the power consumption of system.

#### **1.2 Motivation**

In a very large SoC design, the local variation may encounter multiple source of variation, induce worsen local environment than we expected. The summation of these bad local situations my exceed the original designed guard margin, and lead the whole system crash. The pretty simple solution is made these guard margin even lager. But this may largely further degrade the performance and consume more and more power.

The above is the pessimistic view of the variation. In contrast, what if the local situations are better than bottom limit of guardbands, somehow it can be speed up. In next

section we discuss the types of variations, and the limit of tradition sensors.

### **1.3 Types of Variations**

In the production of chips and the executing environments, chips are suffering various types of variation. These variations can be classified by temporal and spatial. Table 1.1 describes these type variations.

By the limitation of conversion rate of traditional voltage and thermal sensors, fast changes can't be detected by these sensors. Transistor level of local process variation can't be detected too.

|        | Static                       | Dynam                   | ic                    |
|--------|------------------------------|-------------------------|-----------------------|
|        | Extremely slow               | Slow change             | Fast change           |
| Local  | Within-die process variation | Temperature hot-spot    | Local IR drop         |
|        |                              |                         | Cross-talk            |
|        |                              |                         | Clock-tree jitter     |
| Global | Die-to-die process variation | Environment Temperature | Clock jitter          |
|        | NBTI[4]                      | Battery device supply   | Supply voltage jitter |
|        | Electron migration           | voltage drop            |                       |

Table 1.1 Types of variation.[3]

### **1.4 Thesis Organization**

In this thesis, we discuss the sensors used in the adaptive system and the integration of various sensors. The rest of the thesis is organized as follows.

In chapter 2, we discuss sensors to monitor the environment situation, combined with

process, voltage and temperature information. The ability and restriction of using these sensors and the design complexity will be discussed too.

In chapter 3, the proposed adaptive system's architecture will be shown. Including the microprocessor, clock generator and monitor sensors. Error recovery mechanism when encounter errors, coopered with the controlling of system's clock.

In chapter 4, an improved version of the previous research of smart temperature sensor is proposed [5]. By using of newly developed leakage delay cell in another research [6], significantly reduce the impact of voltage variation on the generation of PTAT (proportional to absolute temperature) pulse. And then improve the precision of smart temperature sensor.

In chapter 5, first, the simulation result of test chip will be shown. Followed by the ported FPGA version of thermal sensor and delay monitor sensor, which is integrated into the CCU Criti-core project.

In chapter 6, the conclusion will be made.

### Chapter 2

## **Sensors for Adaptive System Design**

In order to detect the environment information for system to adaptive adjusts working clock or voltage, various type of sensors are proposed. There are mainly classified into three categories of these sensors.

### 2.1 Combination of Various Sensors



Fig. 2.1 DVFS system with various sensors [8]

Fig. 2.1 show a DVFS system cope with existing traditional sensors, and build pre-characterize information into lookup table (LUT). The main problem of this type of systems is every sensor is suffering multiple type of variation source, induce the first error. Second, combine these information into lookup table cause second error. The design and calibration's effort on multiple sensors is also huge.

### 2.2 Canary Circuit

#### 2.2.1 Critical Path Monitor

Critical path monitor are used to detector overall variation effect on critical path at once, includes of process, voltage and temperature variation [9-11].



Fig. 2.2 Critical-path timing monitor used in 65-nm microprocessor [9]

The critical path monitor is using various logic gates and wire to represent the real critical path's delay. Fig. 2.2 shows the critical path timing monitor used in power 7 microprocessor [12].

#### 2.2.2 Delay Monitor

We proposed an easy type of delay monitor circuit. With simple and small design, it can monitor the PVT variation on chip.

In this design, delay buffers are used as critical path's replica. We use the concept of FO4. It stands for "fanout-of-four inverter delay". Path of composed by different logic gate's delay can be divided by an FO4, and the normalized delay holds constant over a wide

range of process, temperature and voltage [13].

On-chip variation's effect between gates and wires shows higher than different gates. However, in a small area of local detection and without long distance of wire connection, the precision of replica is acceptable.



Fig. 2.4 Timing diagram of delay monitor

Fig. 2.3 shows the design of delay monitor. Timing diagram is shown in Fig. 2.4.

The signal Pulse connected to critical path replica, adds the critical path's delay, then the outputted signal Pulse\_dly is used sample signal Pulse. The sampled result of FF1 denoted whether the critical path's has exceeds one period clock. The signal Pulse\_dly then connect to block Speed Up Margin, it is the guard band to ensure that speed up to higher clock, the monitored circuit still works correctly.

### **2.3 Error Detection Circuit**

To deal with fast variations and exploit the path level of variation. Several error detection circuits are proposed. Some of these circuits can be detect for single event upsets too.

#### Pipeline Pipeline Registers Registers ····· Combinational D MSFF Logic Q **Razer Latch** Shadow ERROR Latch SO CLKC (a) CLK D Q SQ **ERROR** (b)

#### 2.3.1 Razor and Razor II

Fig. 2.5 (a) Razor I latch (b) Timing diagram of Razor I latch

Fig. 2.5 (a) shows the Razor Latch and the time diagram. [3, 14, 15]. It is the first time error detection circuit wildly used in modern adaptive system. It exploit the path level delay

detection, detect the real error occurs on any register. Therefore razor latch can detect fast change type of variation. And these type of error detection can't prevent the occurrence of error, thus it need the system's support of recovery mechanism.



Fig. 2.6 Data path with min-delay < Detection window

Another constraint exists in these types of error detectors. In order to detector the late signal, it built a detection window after the rising edge. As shown in Fig. 2.6, FF1's output connects to FF2 and FF3 through a set combinational logic. Point A and B's transition fall in the detection window. The detector can't distinguish whether the data is exceeds one period or the min-delay of data path fall in the window

$$\begin{cases}
T_{max-delay} < T_{period} + T_{detection-window} \\
T_{min-delay} > T_{detection-window}
\end{cases}$$
(Eq. 2.1)

The two constraints described in (Eq. 2.1). This is a trade-off problem, with wider detection window, more min-delay must be set, stands more area overhead. With narrower detection window the max-delay of paths must restricted seriously, if the max-delay exceeds,

then it won't be detected as error. The undetected error may cause the incorrect execution of the circuit.

#### 2.3.2 Error Detection Sequence

Razor latch suffers from high area overhead and much higher clock energy. Another type of error detection circuit has been proposed, the transition detector with time borrowing (TDTB) latch and double sampling with time borrowing (DSTB) latch [16].



Fig. 2.7 (a) (TDTB) and (b) timing diagram of TDTB [16]

The first proposed design is transition detector with time-borrowing (TDTB) latch. Circuits and timing diagram are shown in Fig. 2.7. The delayed exclusive-or is served as a transition detector, as signal D changed, XOR\_O will output a corresponding pulse. When CLK is low, transistor p1 is on, and transistor n1 is off, the dynamic gate gets pre-charged. As CLK rose, p1 turned off, and n1 turned on, if XOR\_O stays at logic-low then there haven't dis-charge path exist. If signal D changed during the high phase of CLK, there will be a pulse occurred, and have overlapping with CLK high phase, dynamic gates get dis-charged, and ERROR rose. The ERROR signal of the same pipeline stage's TDTB are aggregate to one set-domain latch (SDL).



Fig. 2.8 (a) Double sampling with time borrowing (DSTB) and (b) time diagram of DSTB

Fig. 2.8 (a) shows the second proposed design double sampling with time borrowing. It is similar to the TDSB latch, transition detector is replaced by a master-slave flop-flop. Timing diagram is shown in Fig. 2.8(b).

Since using the TDTB and DSTB, the min-delay will be large than high phase of input clock, thus the data path register can be replaced by latch, also reduce the clock energy.

#### 2.3.3 Low Cost Error Detection Circuit

The error detection circuits proposed before aren't capable to working with general 50% duty-cycle cycle. The need of duty-cycle controller is another design and area overhead. And may constraint the usability if application need for 50% ducy-cycle clock.

Another issue is that every register which need be monitored is requiring one error detection circuit. In an aggressive pipelined design, every pipeline stage's may be very tight. That is many registers need to be monitored. The overhead these design is still high.



Fig. 2.9 Low cost error detection circuit.

Fig. 2.9 shows the low cost error detection circuit we have developed. In this design, every 8 register shared one error latch and detection window generator. Detail timing diagram is shown in Fig. 2.10.



Fig. 2.10 Timing diagram of error detection circuit

The detection circuit is composed with standard cell .In the implementation of the detection circuit some of gates are reduction to equivalent logics circuit with lower propagation delay. And the balance of eight different paths delay is needed to be considered.



Fig. 2.11 Timing diagram of hspice simulation of low cost error detection circuit

Fig. 2.11 shows the timing diagram of hspice simulation of low cost error detection circuit. The d4\_d fall in detection window  $dw_n$ , then the latch srl\_q is set to logic 1.



### Chapter 3

### **Adaptive System Architecture**

### 3.1 System Architecture



The main architecture of the designed system is shown in Fig. 3.1. There are four main blocks. The clock generator generated the required clock frequency clock according to input freq\_code. The controller takes information from delay monitor to determine the optimum frequency to use. The delay monitor is in charge of monitor the chips PVT situation. The DLX processor is the test circuit and its performance is affect by the PVT situation.

### **3.2 DLX Processor**



Fig. 3.2 5-pipeline stages DLX

In order to show the potential of adaptive system and the ability of integrate delay monitor to generic ASIC design, a RISC microprocessor is implemented. The microprocessor used is a 32-bit 5-pipeline stage DLX processor [17]. Build with 64 byte of instruction cache, 64 byte of data cache and 16 32-bit register entries.

For simplicity, the program be ran on this test chip is built in as ROM. The executed results are built in Failure Checker, and will be checked automatically.

### **3.3 Clock Generator**



Fig. 3.3 Clock Generator.



Fig. 3.4 DCO mux type.

Fig. 3.3 and Fig. 3.4 illustrate the architecture of clock generator. reference\_clk take a 100 MHz external clock. The required frequency is selected by freq\_code. The precision clock rate isn't very high, so use a counter to lock the frequency of ref\_clk is enough. Phase doesn't need to be lock too. The output characteristic of DCO is shown in Fig. 3.5.



Fig. 3.5 DCO output frequency.

### **3.4 Sensors Integration**

The integration of delay monitor is very simple; it won't affect the original data path of monitored microprocessor.

With integrate of low cost error detection circuits, pipeline stage's registers with critical timing are replaced by error detection circuit. The min-delay of data path must set as hold time constraint at auto place and routed step..

#### **3.5 Adaptive Clock Control**

In first version of test chip, the clock frequency is control by signal Updown and Lock.

DCO\_lock denote for whether the DCO is locked or not. After DCO locked, it looks signal Lock and Updown, if Lock is logic low, clock frequency will adjust by signal Updown, logic 1 for speed up and logic 0 for slow down. Until the signal Lock rose.



Fig. 3.6 Time diagram of post-layout simulation.

Fig. 3.6 Shows the timing diagram of clock converge progress.

### Chapter 4

### **Smart Temperature Sensor**

### 4.1 Introduction

Temperature sensors have been wildly used today to measure temperature of local temperature on chip. In many-core system and large SoC system, there may be many thermal sensors on chip. Cost of calibration for every sensor is too expensive and inefficiency. Smart temperature sensors are proposed to solve these problems.

Previous proposed smart temperature sensor are solved the variation on process, but have no resistant to voltage variation [5, 18]. The process variation is static and can be calibrated. However, voltage variation is dynamic, thus it can be calibrated before sensors shipped from factory.

## 4.2 Proportional to Absolute Temperature (PTAT) Circuit

There are many types of PTAT pulse generator, include of bipolar junction transistor (BJT) based PTAT circuit [19] and Delay-line based PTAT circuit [20].



Fig. 4.1 Architecture of conventional BJT based temperature sensor.

Fig. 4.1 shows the architecture of conventional BJT based PTAT circuit.  $\Delta V_{BE}$  is the difference the two BJTs'  $V_{BE}$  which is proportional to absolute temperature. The followed analog to digital converter (ADC) will convert the  $V_{PTAT}$  with  $V_{Ref}$  to a digital code. However, the variation of  $V_{Ref}$  used in ADC will affect the precision of output code.

Another types of delay-line based PTAT pulse suffering voltage variation too.

## 4.3 Low Supply Voltage Sensitive Temperature Sensor

A PTAT pulse generator used of a p-MOS working in cut-off region has proposed [21]. The leakage current of an off p-MOS is less dependent on supply voltage, thus these type of PTAT pulse generator can reduce the voltage variation's effect.



Fig. 4.2 Schematic of the leakage delay cell.

In this section we use the leakage delay cell developed in [6] as the PTAT pulse generator's delay line. Fig. 4.2 shows the schematic of leakage delay cell.



Fig. 4.3 PTAT Pulse's width of leakage delay cell type

Fig. 4.3 shows the PTAT pulse width under  $\pm 10$  % voltage variation at typical process corner.

| Temp. \ PTAT | Leakage Delay Cell | Delay Buffer |
|--------------|--------------------|--------------|
| 0°C          | 11.53%             | 26.49%       |
| 10°C         | 10.81%             | 26.03%       |
| 20°C         | 10.03%             | 26.00%       |
| 30°C         | 9.59%              | 25.79%       |
| 40°C         | 9.41%              | 25.60%       |
| 50°C         | 8.73%              | 25.22%       |
| 60°C         | 8.26%              | 24.99%       |
| 70°C         | 7.86%              | 25.08%       |
| 80°C         | 7.01%              | 24.59%       |
| 90°C         | 6.72%              | 24.63%       |
| 100°C        | 6.08%              | 24.69%       |

Table 4.1 Compare of voltage variation on PTAT's pulse

Table 4.1 describes the voltage variation effect on leakage cell type PTAT pulse width verse delay buffer type PTAT pulse. Error ratio is computed by  $(T_{0.9v} - T_{1.1v})T_{1.0v}$  under typical case.  $T_{xv}$  is the PTAT pulse width with voltage of *n* V. The voltage variation's effect is reduced with the use of leakage cell.

### 4.4 Sensor Architecture



Fig. 4.4 Sensor architecture

The architecture of developed sensor shows in Fig. 4.4.



Fig. 4.5 Timing diagram of IPTAT pulse generator

In order to achieve enough resolution, a cyclic pulse generator is used and composed with a leakage cell DCO and a counter. Timing diagram of IPTAT pulse generator shows in Fig. 2.1. Since the IPTAT pulse generator's pulse width is much wider than other IPTAT generator, the TDC can be designed with the use of 100 MHz reference clock directly and minimize the inference of PVT variation on TDC circuit.

### 4.5 Calibration

In this chapter, we discuss various type of conversion from TDC code to temperature.



Fig. 4.6 Single slope calibration with 0-70 °C

In Fig. 4.6, we use two points (at 20°C, 70°C) calibration, the error shows under 10°C within 20-70°C.



Fig. 4.7 Two slope calibration

In Fig. 4.7, we use three points calibration, the error shows under 10°C.



Fig. 4.8 Three slope

In Fig. 4.8, we use four points calibration, the error shows under 6 °C.



Fig. 4.9 Second order approximation with 0-100 °C

In Fig. 4.9, we use second order approximation with data in 0-100°C, the error shows under 15°C.



Fig. 4.10 Second order approximation with 0-80 °C

we use second order approximation with data in 0-80°C, the error in 0-80°C shows under 8°C.

### 4.6 Summary

The use of leakage delay cell can effectively reduce the effect of voltage variation, Improve the precision of thermal sensor. And this can combine the calibration method we used in previous research.

## Chapter 5

## **Experimental Results**

### **5.1 Adaptive System with Delay Monitor**



Fig. 5.1 Chip microphotograph

First version of delay monitor is implemented on UMC standard performance (SP) 65nm CMOS process. Fig. 5.1 shows the first version of delay monitor's chip microphotograph.



Fig. 5.2 Optimal frequency

Fig. 5.2 shows the optimal frequency of the test chip under voltage variation from 0.9v

to 1.1 v.

## 5.2 Delay Monitor and Temperature Sensor Implementation on FPGA

5.2.1 Delay Monitor and Smart Temperature on CCU SoC

**Criti-core Project** 



Fig. 5.3 Criti-core reliability-central SoC systems architecture.

The research was supported in part by the National Science Council of Taiwan, R.O.C., under Grant NSC98-2220-E-194-013 and NSC99-2220-E-194-011. We design the thermal sensor and delay monitor sensor, and integrate these sensors into this project. The architecture of Criti-core multicore SoC system architecture is shown in Fig. 5.3.

The final year demo of National Science Council (NSC) project is to integrate thermal sensor and delay monitor sensor into UniRISC cores. We implement these sensors on FPGA to evaluate the adaptive scaling multicore platform.

#### 5.2.2 Temperature Sensor Implementations on FPGA

The implementations of thermal sensor and delay monitor sensor on FPGA came with first problem: the EDA tools of FPGA design flow we used does logical simplify and optimizations. During the Synthesis and Translation process, the delay line buffers will be absorbed. So we use a primitive gate TBUF in the CLB (Configurable Logic Block) as a basic delay unit and wrapped in a module. Xillinx ISE Design Suite provides a KEEP constraint to keep the design hierarchy throughout the implementation flow [22]. With this constraint applied on the delay line, the delay line will be preserved as designed hierarchy.

Thermal sensor on FPGA is shown in Fig. 5.4. The basic architecture is same as last year demo. The delay line buffer used is updated to TBUF, and integrated to dual UniRISC cores. The multicore with multi thermal sensor floorplanning and implementation flow was surveyed, too.



Fig. 5.5 Delay monitor used in FPGA

The FPGA version of delay monitor is shown in Fig. 5.5. Basically the architecture is same as ASIC version in section 2.2.2, except for detail design is adapted to the characteristic of FPGA.



Fig. 5.6 Timing diagram of delay monitor

Fig. 5.6 shows timing diagram of delay monitor. Pulse Generator takes CLK\_GLOBAL\_IN as clock to generate a pulse for every 8 cycle, and its width is just the period of input clock. The generated Pulse connects to Critical Path Replica (CPR), which adds a delay of the most critical path length. The delay of CPR is determined by static timing analysis after test place and route, and it will be calibrated after bit image downloaded to FPGA. The Error\_latch takes Pulse\_out as clock to sample Pulse, outputs wheatear the delay of CPR is exceed one clock period.



Fig. 5.7 Calibration circuit

Fig. 5.7 shows the detail of calibration circuit. In the implementation of calibration circuit, coarse tune uses a primitive tri-state buffer: TBUF of Virtex-II logic tile. On FPGA, half of delay is contributed by the routing matrix. If we use TBUF as fine tune cell, though

the delay of TBUF is small but net routing takes too much time to achieve enough resolution needed. We use the primitive gate: MUXCY in fine tune circuit which is designed for high speed carry chain. These MUXCY gates will be placed on a straight line, and routed through dedicate paths. In the auto place and route process, the calibration circuit must place and routed first to avoid the routing matrices being too congested. As mentioned before, wire routing occupied great deal of propagation delay, if the wires don't routed uniformly, it is necessity to adjust routing manually. Table 5.1 shows the rough resolution of calibration circuit.

Table 5.1 Resolution of delay cell used in calibration circuit

|                 | BUFT in CLB | MUXCY chain |
|-----------------|-------------|-------------|
| Resolution (ns) | 2           | 0.2         |

In the demo architecture, the clock used in FPGA comes from the oscillator on baseboard. It is routed through several buffers in order to using AHB in synchronous mode. The jitter performance of CLK\_GLOBAL\_IN is not as good as clock generators on the FPGA tile. If the delay of CPR is closed to the period of CLK\_GLOBAL\_IN, Error\_Latch starts to alternated between logic 1 and 0. We designed a counter to count the times Error\_latch being logic 0(that is delay of CPR exceeds period) in a specific window.

#### 5.2.4 Sensors Integration



Fig. 5.8 Architecture used in project demo.

Fig. 5.8 shows the designed architecture used in Criti-core project demo. Two UniRISC core are implemented on FPGA. One delay monitor sensor for the whole FPGA and each cores are attached by thermal sensor. Sensors are connected by Advanced Microcontroller Bus Architecture (AMBA) [23]. The floorplanning of cores and sensors is shown in Fig. 5.9.



Fig. 5.9 Cores and sensors floorplanning, identified by color.

The clock used on FPGA came from an oscillator of baseboard. This clock source is an

ICS307 serially programmable clock generator [24]. The output frequency can be configured by setting a system register OSC0. The clock generator is configured by three parameters:

- VCO Divider Word (VDW) = 4 to 511
- Reference Divider Word (RDW) = 1 to 127
- Output Divider (OD) = 2 to 10

Output Frequency (MHz) = 
$$24 * 2 * \frac{(VDW+8)}{(RDW+2)(OD)}$$
 (Eq. 5.1)

Calculation of output of the clock source can be expressed as (Eq. 5.1). Because the UniRISC core doesn't support for dynamic clock scaling, the clock is scaled under no protection. RDW and OD is fixed to 22 and 10 respectively, therefore output frequency is determined by VDW. VDW will be changed in single increment, and waiting for 10ms for clock locking.

Since FPGA is synchronous to AHB host clock, clock control must be configured by ARM baseboard. And in order to speed up the development cycle, delay monitor's information will sent to ARM baseboard, then adaptive clock control is done by ARM.

#### 5.2.5 Sensor Calibration

Each thermal sensor is calibrated separately with two-point calibration.



Calibration flow of CPR is illustrated as Fig. 5.10. After the image downloaded to FPGA, adjust the working frequency to select the max speed such that the two cores can work correctly. This clock period is just the right length that CPR needs to calibrate to. At the same environment condition, tune the calibration bit to enlarge or shrink CPR delay to appropriate length according to Error\_num reported by delay monitor sensor.

#### 5.2.6 Criti-Core Project Demo

Demo environment:

- RealView Platform Baseboard for ARM926EJ-S + Versatile/LT-XC2V8000
- FLUKE 54 II Thermometer
- Samsung 17" SyncMaster LCD

• TATUNG THD-90J hair dryer



• AXD Debugger for ARM Developer Suite 1.2

Fig. 5.11 Demo environment

The demo environment is set as in Fig. 5.11. Development board loads a JPEG decoder to UniRISC on FPGA. The decoded images are output to VGA interface of baseboard. The sensed temperature and delay information is shown on ARM AXD Debuggler. Thermal meter's probes are attached to surface of FPGA chip. Thermal use to check precision of thermal sensor and make sure the FPGA won't be over heated during the demo.

We demo the following two scenarios:

- Without adaptive clock control, cores are run at max speed of 35 MHz, then heating the FPGA by the hair dryer. When the FPGA's temperature rose to 60 °C, decode core 2 starts to show a broken image.
- 2. With adaptive clock control, cores speed is determined by delay monitor.

Fig. 5.12 shows the adaptive clock control flow of the demo. With the adaptive clock

control, cores can ran at higher speed than excepted when temperature is low. When temperature rose, adaptive clock control slows down the core clock, the two cores are works correctly.



### Chapter 6

## System Verification with SVA

### **6.1 Introduction**

In large scale system design, there are many function blocks, synchronous/asynchronous buses and IPs combined together. The increasing complexity made verification as a huge challenge. [25]

Due to the nature of Verilog language, it is hard to check the complex timing relation between signals.

In this section, we will introduce an example of SVA application on a MPEG-2 decoder IP. This IP is inserted with SVA checker and navigate with Novas Verdi Property Tool and Novas nWave.

### 6.2 MPEG-2 Decoder IP



Fig. 6.1 shows the block diagram of the MPEG-2 IP used in this chapter. The MPEG-2 IP support various MPEG-2 decoding and AMBA Arm High Performance Bus (AHB), it is composed with bitstream analyzer (BA), texture decoder (TD), reusable data manager (RDM), motion compensation (MC).

## 6.3 SVA for Inverse Discrete Cosine Transform (IDCT)

Inverse discrete cosine transform is an important module in MPEG-2 decoder, it is considered to be the most effective transform coding technique in practice for image and video compression. Using this technique, blocks of video data are converted into the transform domain for more efficient data compression.

The 8x8 IDCT is defined in (Eq. 6.1):

$$x(i,j) = \left(\frac{1}{4}\right) \sum_{u=0}^{7} \sum_{v=0}^{7} C(u) C(v) X(u,v) \cos\left(\frac{(2i+1)u\pi}{16}\right) \cos\left(\frac{(2j+1)v\pi}{16}\right)$$

where x(i, j), u, v = 0, 1, ..., 7 is the pixel value

 $X\left( u,v\right) ,u,v=0,1,...,7\,$  is the transformed coefficient

$$C(u), C(v) = \begin{cases} \frac{1}{\sqrt{2}} & \text{for } u, v = 0\\ 1 & \text{for } u, v \ge 1 \end{cases}$$
 (Eq. 6.1)

2-Diminational IDCT can be simplify by applying 1D IDCT along the rows and then along the columns (or vice versa). This module is using this technique to perform 2D IDCT with a 64 entries transpose memory. The block diagram of 2D IDCT module shows in Fig. 6.2.



The SVA checker is added on 1D IDCT, transpose memory and the whole 2D IDCT module. With SVA, the 2D IDCT can be implementing and check easily. In this example we also inserted a bug in the transpose memory, therefore the 2D IDCT checker will be fail, the 1D IDCT checker remains success and the transpose memory's checker shows failure. Then we knows the bug is occurs in transpose memory. Fig. 6.3 shows the executing of 2D IDCT checker, the waveform indicate that an error occurred because the outputted answer is differ from golden answer which is generated by SVA.

| Me | issage Analyzer                                                                                                                                             |
|----|-------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 1  | a_2d_dout: assert property(                                                                                                                                 |
| 2  | aestoo<br>p_dout<br>+                                                                                                                                       |
| 3  | );                                                                                                                                                          |
| 4  |                                                                                                                                                             |
| 5  | property p_dout;                                                                                                                                            |
| 6  | @(posedge clk) (\$rose(dout_en)  -> ( ##k ((dout >= go]d2[k]) ? ((dout - go]d2[k]) <= 1) : ((dout - go]d2[k]) >= -1))));<br>1 0 83 8 0 83 8 0 83 8 0 83 8 0 |
| 7  |                                                                                                                                                             |
| 8  | endproperty                                                                                                                                                 |
| 9  |                                                                                                                                                             |



Fig. 6.3 Navigate the SVA checker with Verdi

## 6.4 SVA for MPEG-2 Decoder IP

In this section, we focus the point on the interface between modules. We will insert the checker on the interface of BA-TD and TD-MC.

The bitstream decoder (BA) contains a run-level decoder and FIFO to bridge texture decoder (TD). The run-level encoding is a technique to use shorter set of bits to represent consecutive zeros. Fig. 6.4 shows an example of run-level encoding, and decoding is the reverse of encoding procedure.



Fig. 6.4 An example of run-level encoding.

The interface between TD and MC is a 256-entry FIFO, this FIFO will be inserted with

SVA.



Fig. 6.5 Waveform of assertion data\_out\_chk

Fig. 6.5 shows a success assertion of run-level decoder and its waveform.

| FSDB/Prop Type              | Total Prop | Fail | Pass | Inco |
|-----------------------------|------------|------|------|------|
| 🕀 🎰 evaluate_result.fsdb.vf | 13         | 0    | 13   | 0    |
| - Assert                    | 13         | 0    | 13   | 0    |
| - 🔄 Assume                  | 0          | 0    | 0    | 0    |
| Cover 🗀                     | 0          | 0    | 0    | 0    |
| La Others                   | 0          | 0    | 0    | 0    |

Fig. 6.6 Assertions statistics

Fig. 6.6 shows the all 13 assertions are passed.



### Chapter 7

### **Conclusions and Future Work**

In the thesis, we have proposed a design of adaptive system. Include two types of monitor sensors, clock controller and the microprocessor. With the first delay monitor, a simple delay monitor is used. It is suited for a small design. The second one is low cost error detection circuits. Hspice level simulation has been proved work. Furthermore, it doesn't need for duty-cycle controller. Clock jitter introduced by duty-cycle controller has been reduced.

An improved thermal sensor is proposed. Hypice simulation shows instead of using delay buffer, the use of leakage delay cell have been reducing the error from 25.79% to 9.59% with  $\pm 10\%$  of voltage variation.

There are some works need to be done in the future:

First, the delay monitor is not taking care on the difference of wire delay. It can't monitor on long wire delay since the delay characteristic of wire is different from logic gates. Additional, it is important as fabrication process scales to sub-micron, ratio of wire delay becomes negligible.

Second, the low cost error detection circuit is composed with standard cell, though it has higher portability, but use of standard cell is limited to logic gate level, a full custom design should have a smaller size of detector and more flexibility design.

In the microprocessor, the implemented one is a simple 5 pipeline stage system, a

much smaller design relative others. The difference of wire delay verses logic gates can't be observed. For the long term research, it is better way to use existed open source processor, for example: LEON3 SPARC V8 Processor core[26]. With these existed open source SoC system, most tools have been prepared, includes of GCC cross compiler for LEON3, advanced high performance bus, JTAG controller. Let us can focus on the sensor development.

From the above of view, there are many works and research to be done.



## Reference

- [1] Keith A. Bowman, Steven G. Duvall, and James D. Meindl, "Impact of die-to-die and within-die parameter fluctuations on the maximum clock frequency distribution for gigascale integration," *IEEE Journal of Solid-State Circuits*, vol. 37, no. 2, pp. 183-190, 2002.
- [2] Sani Nassif, "Delay variability: sources, impacts and trends," in *Digest of Technical Papers, IEEE Solid-State Circuits Conference (ISSCC)*, Feb. 2000, pp. 368-369.
- [3] David Bull, Shidhartha Das, Karthik Shivashankar, Ganesh S. Dasika, Krisztian Flautner, and David Blaauw, "A Power-Efficient 32 bit ARM Processor Using Timing-Error Detection and Correction for Transient-Error Tolerance and Adaptation to PVT Variation," *IEEE Journal of Solid-State Circuits*, vol. 46, no. 1, pp. 18-31, 2011.
- [4] Wikipedia, "Negative bias temperature instability". Available: http://en.wikipedia.org/wiki/Negative\_bias\_temperature\_instability
- [5] Ching-Che Chung and Cheng-Ruei Yang, "An Autocalibrated All-Digital Temperature Sensor for On-Chip Thermal Monitoring," *IEEE Transactions on Circuits and Systems II: Express Briefs*, vol. 58, no. 2, pp. 105-109, 2011.
- [6] Ching-Che Chung and Chia-Lin Chang, "A 600 kHz to 1.2 GHz all-digital delay-locked loop in 65nm CMOS technology," *IEICE Electronics Express (ELEX)*, vol. 8, pp. 518-524, Apr. 2011.
- [7] Kang Kunhyuk, Park Sang Phill, Kim Keejong, and Kaushik Roy, "On-Chip Variability Sensor Using Phase-Locked Loop for Detecting and Correcting Parametric Timing Failures," *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 18, no. 2, pp. 270-280, 2010.
- [8] James W. Tschanz, Kim Nam Sung, Saurabh Dighe, Jason Howard, Gregory Ruhl, Sriram Vanga, Siva Narendra, Yatin Hoskote, Howard Wilson, Carol Lam, Matthew Shuman, Carlos Tokunaga, Dinesh Somasekhar, Stephen Tang, David Finan, Tanay Karnik, Nitin Borkar, Nasser Kurd, and Vivek K. De, "Adaptive Frequency and Biasing Techniques for Tolerance to Dynamic Temperature-Voltage Variations and Aging," in *Solid-State Circuits Conference, 2007. ISSCC 2007. Digest of Technical Papers. IEEE International*, 2007, pp. 292-604.
- [9] Alan Drake, Robert Senger, Harmander Deogun, Gary Carpenter, Soraya Ghiasi, Tuyet Nguyen, Nguyen James, Michael Floyd, and Vikas Pokala, "A Distributed Critical-Path Timing Monitor for a 65nm High-Performance Microprocessor," in *Digest of Technical Papers, IEEE Solid-State Circuits Conference (ISSCC)*, Feb. 2007, pp. 398-399.
- [10] Mohamed Elgebaly and Manoj Sachdev, "Variation-Aware Adaptive Voltage Scaling System," IEEE

Transactions on Very Large Scale Integration (VLSI) Systems, vol. 15, no. 5, pp. 560-571, 2007.

- [11] James Tschanz, Keith Bowman, Steve Walstra, Marty Agostinelli, Tanay Karnik, and Vivek De, "Tunable replica circuits and adaptive voltage-frequency techniques for dynamic voltage, temperature, and aging variation tolerance," in VLSI Circuits, 2009 Symposium on, Jun. 2009, pp. 112-113.
- [12] Malcolm Ware, Karthick Rajamani, Michael Floyd, Bishop Brock, Juan C. Rubio, Freeman Rawson, and John B. Carter, "Architecting for Power Management: The IBM® POWER7<sup>™</sup> Approach," in *High Performance Computer Architecture (HPCA), 2010 IEEE 16th International Symposium on*, 2010, pp. 1-11.
- [13] Ron Ho, Kenneth W. Mai, and Mark A. Horowitz, "The future of wires," *Proceedings of the IEEE*, vol. 89, no. 4, pp. 490-504, 2001.
- [14] Shidhartha Das, David Roberts, Lee Seokwoo, Sanjay Pant, David Blaauw, Todd Austin, Krisztian Flautner, and Trevor Mudge, "A self-tuning DVS processor using delay-error detection and correction," *IEEE Journal of Solid-State Circuits*, vol. 41, no. 4, pp. 792-804, 2006.
- [15] Shidhartha Das, Carlos Tokunaga, Sanjay Pant, Wei-Hsiang Ma, Sudherssen Kalaiselvan, Kevin Lai, David M. Bull, and David T. Blaauw, "RazorII: In Situ Error Detection and Correction for PVT and SER Tolerance," *IEEE Journal of Solid-State Circuits*, vol. 44, no. 1, pp. 32-48, 2009.
- [16] Keith A. Bowman, James W. Tschanz, Kim Nam Sung, Janice C. Lee, Chris B. Wilkerson, Shih-Lien L. Lu, Tanay Karnik, and Vivek K. De, "Energy-Efficient and Metastability-Immune Resilient Circuits for Dynamic Variation Tolerance," *IEEE Journal of Solid-State Circuits*, vol. 44, no. 1, pp. 49-63, 2009.
- [17] David A. Patterson and John L. Hennessy, *Computer Organization and Design The Hardware Software Interface*: Morgan Kaufmann.
- [18] Chen Poki, Chen Chun-Chi, Peng Yu-Han, Wang Kai-Ming, and Wang Yu-Shin, "A Time-Domain SAR Smart Temperature Sensor With Curvature Compensation and a 3 σ Inaccuracy of - 0.4 °C ~ +0.6 °C Overa 0°C to 90°C Range," *Solid-State Circuits, IEEE Journal of*, vol. 45, no. 3, pp. 600-609, 2010.
- [19] M. A. P. Pertijs, K. A. A. Makinwa, and J. H. Huijsing, "A CMOS smart temperature sensor with a 3σ inaccuracy of ±0.1 °C from -55 °C to 125 °C," *Solid-State Circuits, IEEE Journal of*, vol. 40, no. 12, pp. 2805-2815, 2005.
- [20] Chen Poki, Chen Chun-Chi, Tsai Chin-Chung, and Lu Wen-Fu, "A time-to-digital-converter-based CMOS smart temperature sensor," *Solid-State Circuits, IEEE Journal of*, vol. 40, no. 8, pp. 1642-1648, 2005.
- [21] Eisuke Saneyoshi, Koichi Nose, Mikihiro Kajita, and Masayuki Mizuno, "A 1.1V 35μm × 35μm thermal sensor with supply voltage sensitivity of 2°C/10%-supply for thermal management on the SX-9 supercomputer," in VLSI Circuits, 2008 IEEE Symposium on, Jun. 2008, pp. 152-153.

- [22] Xilinx®, "Constraints Guide". Available: http://www.xilinx.com/itp/xilinx10/books/docs/cgd/cgd.pdf
- [23] ARM Limited., "Application Note 119 Implementing AHB Peripherals in Logic Tiles". Available: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0119e/index.html
- [24] Inc. Integrated Circuit Systems, "ICS307 Serially Programmable Clock Source". Available: http://www.datasheetcatalog.org/datasheet/icst/ICS307M-02T.pdf
- [25] Srikanth Vijayaraghavan, Ramanathan, Meyyappan, A Practical Guide for SystemVerilog Assertions, 2005.
- [26] Aeroflex Gaisler AB, "LEON3 SPARC V8 Processor core". Available: <u>http://www.gaisler.com/cms/index.php?option=com\_content&task=view&id=13&Itemid=53</u>

