



## Session Oral 1 (8/7 Wed. 13:30 – 15:00)

Session Topic: Circuit Design for Advanced Sensing Applications

Session Chair: Kun-Chih (Jimmy) Chen (National Sun Yat-Sen University)

Room: 6F 樂廳

1. 13:30 – 13:43 (SA11) A CMOS Temperature-to-Digital Converter Based on a Chopped Continuous-time

Delta-Sigma Modulator

Po-Yu Li, Wei-En Lee, and Tsung-Hsien Lin

**National Taiwan University** 

In this work, a resistor-based temperature is presented. The resistive temperature sensing module is embedded in a 2nd-order 1-bit continuous-time delta-sigma modulator (CTDSM) to realize a hardware-and energy-efficient temperature-to-digital converter. To achieve low noise, chopping technique is

applied to mitigate the mismatch and flicker noise of the first-stage amplifier. A finite impulse response (FIR) filter is added to the feedback path to address the noise fold-back that is attributed to the input dependent error voltage when using chopper. The proposed circuit consumes 183.6  $\mu$ W from a 1.8-V supply voltage. The temperature range is -40°C to 100 °C. At a single conversion time of 333  $\mu$ s, the temperature resolution is better than 0.003 °C (1  $\delta$ ), which lead to a resolution FoM of 0.55 pJ°C2.

2. 13:43 – 13:56 (SA12) A Novel High-Resolution Area-Efficient All-Digital Temperature Sensor Design in Temporal Domain

Kun-Chih (Jimmy) Chen and Chung-Hsien (Joseph) Chiu

National Sun Yat-Sen University

Due to the advancement of the process technology, the power density of the modern System-on-Chip (SoC) is increased and results in serious thermal problem. Many dynamic thermal managements (DTMs)

were proposed in recent years, the full-chip temperature information is captured relying on the on-chip temperature sensors. The conventional temperature sensors contain ADCs to convert the temperature sensitive voltage/current signal to the digital output results. However, the area overhead of the conventional analog temperature sensor becomes large when the requirement of the temperature sensing accuracy becomes strict. To solve the problem, we adopt the feature of the temperature-sensitive signal propagation delay to design an all-digital temperature sensor. The proposed all-digital thermal sensor adopts one single delay line to generate a temperature-sensitive pulse, and the width of the pulse is proportional to the measuring temperature. Afterward, we involve the time-to-digital converters (TDC) to convert the pulse width information to the digital temperature information. By using the TSMC CMOS 0.18µm process, the proposed all-digital temperature sensor can achieve more precise sensing resolution than the related works. The achieved sensing error is with -1.0°C to 1.0°C.

3. 13:56 – 14:09 (SA13) Power-efficient Cyclic Voltammetric Electrochemical Sensing Readout Circuitry with Current-Reducer Ramp Waveform Generation

Yi-Chia Chen, Shao-Yung Lu, Siang-Sin Shan, and Yu-Te Liao

National Chiao Tung University

This paper presents an electrochemical sensing chip with an integrated current-reducer pattern generator and a low-noise chopper-stabilization potentiostat circuit. The pattern generator, utilizing the current reducer technique and pseudo resistors, creates a sub-Hz ramp signal for the cyclic voltammetric measurement without large-size passive components. The proposed design adopts a chopper-stabilization potentiostat and time-based converter to reduce the amplitude noise effects. The design is fabricated using a 0.18- $\mu$ m CMOS process and achieves a 41pA current resolution in the current range of  $\pm$  5 $\mu$ A while maintaining the R2 linearity of 0.998. The power consumption of the design is  $16\mu$ W when a  $5\mu$ A sensing current is detected. The power efficiency of the readout interface is 0.31 and

the sensing current dynamic range is 108dB. The design is fully integrated into a single chip.

4. 14:09 - 14:22 (SA14) A 2.4 GHz Gm-Boosted Complementary Current-Reuse Colpitts VCO with FoM of 189 dBc/Hz in 0.18  $\mu$ m CMOS

Yu-Chieh Huang, Sheng-Kai Chang, and Kuang-Wei Cheng

**National Cheng Kung University** 

This paper presents a differential Colpitts voltage-controlled oscillator (VCO) for low power and low phase noise applications. Gm-boosted and complementary current-reuse techniques improves the current efficiency and achieves low power consumption. In addition, a clover-shaped inductor is utilized for reduction of inductive crosstalk. The prototype VCO is fabricated in a 0.18  $\mu$ m CMOS technology, and has a power dissipation of 1.4 mW from a 1.1-V supply voltage. The measurement results show that the VCO can operate at frequencies of 2.34 to 2.55 GHz with a tuning range of 8.6%, and achieves a phase noise of –122.85 dBc/Hz at 1MHz offset, with FoM of 189 dBc/Hz.

5. 14:22 – 14:35 (SA15) A 13.56-MHz Wireless Power Transfer Transmitter with Impedance Compression Network for Biomedical Applications

Fu-Wen Chang and Ping-Hsuan Hsieh

**National Tsing Hua University** 

A wireless power transfer (WPT) transmitter with class-E power amplifier is presented in this work. In the target application of implantable biomedical systems, variations of load condition and coil separation degrade the power transfer capability (PTC) and power conversion efficiency (PCE) significantly. In the proposed design, we adopted an impedance compression network (ICN) to compress the resulting impedance variation and to stabilize the performance. Duty-cycle control is introduced for further improvement. Designed and implemented with a 0.18-um CMOS process, at 13.56 MHz, simulation results show that more than 30 mW of output power can be obtained with coupling coefficient from

0.06 to 0.30. The maximum output power is 58.97 mW, and the maximum power conversion efficiency is 54.4% with drain efficiency of 81.0%. Compared to conventional structures, the proposed design achieves wide dynamic range while meeting the target output power and efficiency.

6. 14:35 – 14:48 (SA16) A High-Efficiency Power Management IC with Power-Aware Multi-path Rectifier for Wide-Range RF Energy Harvesting

Hao-Yi Kuo, Shu-Hsuan Lin, Chen-Yi Kuo, and Yu-Te Liao

National Chiao Tung University

A highly-integrated CMOS power-management system with wide-range RF for ultra-high frequency (UHF) wireless energy harvesting is presented. To avoid environment-caused sudden power loss and to scavenge energy efficiently, the proposed power management system adopts power-aware rectifier architecture and adaptive DC-DC conversion ratios according to the input power level. The proposed

system was fabricated in a 0.18- $\mu$ m CMOS process. The system achieved a peak RF/DC conver-sion efficiency of 59%, a sensitivity of -11.6dBm, and a 13.5dB RF input range for at least 20% power efficiency at a 100K $\Omega$  load. At the high input power region (>-9dBm), the proposed architecture improves to about 15% efficiency when compared to a conventional rectifier followed by a linear regulator. The peak efficiency of the entire system is 37%.

7. 14:48 – 15:01 (SA17) A High-Power-Efficiency Low-Input Low-Output Thermoelectric Energy Harvesting Interface for Internet-of-Things Devices

Meng-Jung Tsou, Tze-Yun Su, Philex Ming-Yan Fan, and Po-Hung Chen

National Chiao Tung University

This paper presents an energy harvesting interface for low-voltage energy-efficient Internet-of-Things (IoT) devices in standard 0.18-um CMOS technology. The proposed boost converter along with the capacitive bootstrapping technique converts the voltage from a thermoelectric generator to a

near-threshold output voltage. The proposed capacitive bootstrapping technique generates a positive and negative bias pair to reduce the significant conduction losses of power transistors in a low-voltage operation. Besides, to efficiently extend the system's output power range, the internal bias voltages are automatically adjusted according to the loading conditions. The converter combines constant on-time technique with digital zero current detection to achieve both low power consumption and low reverse inductor current. The experimental results demonstrate a maximum power conversion efficiency of 76% over a  $1\,\mu\text{W}$ –500  $\mu$ W load range.

# Session Oral 2 (8/7 Wed. 13:30 – 15:00)

Session Topic: Power, Sensor, and Other Analog Techniques

Session Chair: Ching-Jan Chen (National Taiwan University) and Hung-Wen Lin (Yuan Ze University)

Room: 6F 御廳

 13:30 – 13:43 (S0142) Single-inductor two-boost converter with bidirectional energy flow (Best Paper Candidates)

Hung-Hsien Wu, Chi-Hsiang Huang, Yen-Yu Chen, Yi-Ting Liou, and Chia-Ling Wei

National Cheng Kung University

A single-inductor two-boost dc—dc converter with bidirectional energy flow is proposed. This work combines two dc—dc converters into one by sharing a single inductor, and it is capable of storing energy and powering output load by using the bidirectional inductor current. With the storing state, the

variations on the output voltage are typically negligible when the input voltage changes. According to the measured results, the converter can startup successfully with a 0.55-V input voltage. The voltage range of the storing element is 1.2–1.4 V, and the maximal output power of the proposed converter is 18 mW with its output voltage setting at 1.8 V.

2. 13:43 – 13:56 (S0162) A Low Noise Optical Encoder with Background Light Cancellation Using Photodiodes in Series (Best Paper Candidates)

You-Shin Chen (1), Tzu-Hsiang Hsu (1), Chien-Wen Chen (2) and Chih-Cheng Hsieh (1)

(1) National Tsing Hua University and (2) Industrial Technology Research Institute

This paper presents a low noise readout circuit for the optical encoder. Both absolute and incremental encoders are implemented chip with dual sensor arrays and the corresponding readout circuits. 42 columns of the pixel with adjustable dual-threshold quantizer are implemented in the absolute encoder

with digitized outputs. Four quadrature sinusoidal signals dephasing 90° to each other are generated in the incremental encoder for interpolation. Two opposite phases of photodiodes placed in series are proposed for background light cancellation without adding any extra circuit. A differential transimpedance amplifier (TIA) is followed by the photodiodes to eliminate the residual signals and convert the differential currents into voltage signals. A programmable gain amplifier (PGA) is implemented to fit the input range of the following 12-b SAR ADC. Measurement results show that the SNR reaches 60dB and the maximum displacement error is 0.22μm.

3. 13:56 – 14:09 (S0151) A 0.5V Real-time Computational CMOS Image Sensor with Programmable Kernel for Always-on Feature Extraction (Best Paper Candidates)

Tzu-Hsiang Hsu and Chih-Cheng Hsieh

National Tsing Hua University

This paper presents a 0.5V computational CMOS image sensor (C2IS) with array-parallel computing capability for always-on feature extraction. By applying the developed pulsed-width modulation (PWM) pixel and switch-current integration, the in-sensor 8-directional matrix-parallel multiply-accumulate (MAC) operation is realized. Moreover, the analog-domain convolution-on-readout (COR) operation, the programmable 3x3 kernel with 4-bit weights, and the tunable-resolution column-parallel ADC (1b to 8b) are implemented to achieve the real-time feature extraction without use of additional memory. The C2IS prototype has been fabricated and verified to demonstrate the raw and feature images at 480fps with a power consumption of 77/91 (uW) and the resultant FoM of 9.8/11.6 (pJ/pix/frame), respectively.

4. 14:09 – 14:22 (S0016) A Wide-Range Capacitive DC-DC Converter with 2D-MPPT for Soil/Solar Energy Extraction

I-Che Ou (1), Jia-Ping Yang (1), Chia-Hung Liu (1), Kai-Jie Huang (1), Kun-Ju Tsai (2), Yu Lee (2), Yuan-Hua Chu (2), and Yu-Te Liao (1)

(1) National Chiao Tung University and (2) Industrial Technology Research Institute

This paper presents a capacitive DC-DC converter with adaptive DC-DC conversion ratios and maximum power point tracking (MPPT) for soil and solar energy extraction. To overcome the varying input power ranges of the soil/solar energy sources, a two-dimension power tracking loop with time-based current slope detection was employed. The design was fabricated in a 0.18- $\mu$ m CMOS process, achieving >80% efficiency in a throughput power range of 360 $\mu$ W to 25mW in the soil mode and from 400 $\mu$ W to 10mW in the solar mode while the peak system efficiency is 89.5%.

5. 14:22 – 14:35 (S0187) A Low-Area Programmable Low-Pass-Filter with Automatic -3dB Frequency Calibration

Zhi-Sheng Zhang, Tzu-Hao Lin, Hung-Wen Lin

Yuan Ze University

This paper proposes a wide f-3dB range low-pass filter (LPF) and its f-3dB calibration circuit. By using the replica LPF cell to design a oscillator and controlling the oscillation frequency, the f-3dB of LPF could be programmable among different process corners. Simulation results show that the calibrated f-3dB frequency has an error about 6% to the target frequency. The power consumption of the calibration system is about 3.7% to the LPF core.

6. 14:35 – 14:48 (S0124) A Bandgap Voltage Reference Circuit with Calibration Technique for Reducing Process Variation

Chiao-Han Yang and Shuenn-Yuh Lee

National Cheng Kung University

In this paper, we present a high precision bandgap voltage reference (BGR) circuit with calibration technique to reduce process variation. In traditional BGR circuits, the process variation can be calibrated

to achieve near-zero voltage variation by first order compensation, but the temperature coefficient is still varying from 5 to 50 ppm/°C in different processes corner. In order to get more precise reference voltage, a new compensation circuit is implemented in this paper to accomplish second order compensation. The proposed circuit can automatically calibrate the reference voltage of BGR circuit when BJTs and MOSFETs are operated in TT corner and Resistors operate in any process corner. The BGR circuit is realized in TSMC 0.18- $\mu$ m CMOS process occupying active area of 0.8217 mm2. The average temperature coefficient is 29.39 ppm/°C at temperature from 0°C to 120 °C, the average power consumption is 104.94  $\mu$ W under 1.8-V supply voltage, and power supply rejection ratio (PSRR) are –39.6 dB@100 Hz and -40.7 dB@1 kHz.

7. 14:48 — 15:01 (S0075) Distributed Diode-Triggered SCR for Broadband ESD Protection in CMOS Technology

Chun-Yu Lin, Yu-Hsuan Lai, and Zih-Jyun Dai

National Taiwan Normal University

Electrostatic discharge (ESD) protection design is needed for integrated circuits; however, the ESD protection devices beside the I/O pad may cause negative impact on the circuit performance. To achieve both excellent ESD robustness and good broadband performance, a silicon-controlled rectifier (SCR) with trigger diodes and matching inductor to form a novel distributed diode-triggered SCR ( $\pi$ -SCR) is presented in this work. As compared with the conventional  $\pi$ -diode in silicon, the  $\pi$ -SCR can reduce the clamping voltage during the critical positive-to-VSS (PS) ESD test, and the high-frequency performance is not seriously degraded (insertion loss <2dB within 0-20GHz in this work). Besides, the  $\pi$ -SCR does not suffer the latchup issue in low-voltage CMOS technology, and it has the potential to further reduce the clamping voltage during other ESD tests. Therefore, the  $\pi$ -SCR will be a good choice for broadband ESD protection in CMOS technology.

# Session Oral 3 (8/7 Wed. 13:30 – 15:00)

Session Topic: AI Computing and Acceleration

Session Chair: Shih-Hsu Huang (Chung Yuan Christian University) and Wei-Kai Cheng (Chung Yuan Christian University)

Room: 6F 茗廳

1. 13:30 – 13:43 (SB11) Low Accuracy Loss and Hardware-Friendly Model Compression Technique for DNNs Ya-Chu Chang and Juinn-Dar Huang

National Chiao Tung University

Deep neural networks (DNNs) are broadly utilized in numerous machine learning applications nowadays. In a large DNN with several hidden layers, the number of weights (or coefficients) required to complete an entire perceptron are indeed huge. However, these excessive number of weights not only require a big

chunk of memory but also create a significant memory access traffic, which incurs a heavy burden especially for small-scale embedded systems and edge devices. Therefore, several techniques have been proposed during the past few years. We present a new hardware-friendly model compression technique in this paper. It can achieve a compression rate of 20X~30X while keeping the accuracy loss below 1%.

2. 13:43 – 13:56 (SB12) A Simulator for Evaluating the Fault-Tolerance Capability of DNNs

Yung-Yu Tsai and Jin-Fu Li

**National Central University** 

Deep neural network (DNN) is considered as one effective technique for the artificial intelligence applications. A DNN is constituted by a large amount of neurons arranged in a form of multilayers. Typically, a DNN has overprovisioning neurons such that it has the property of fault tolerance [1][2]. However, how to evaluate the fault tolerance capability of DNNs is an important issue. In this paper, we propose a

simulator to estimate the loss of inference accuracy due to the faults in a DNN model or hardware accelerator. The simulator is implemented based on the platforms of Keras and Tensorflow. It can evaluate the fault-tolerance capability of a DNN at model and hardware levels. The proposed simulator can estimate the accuracy loss of a DNN model caused by a faulty neuron, a faulty link, or faulty input. The fault injection mechanism is done through the bitwise operation at the parameter of Tensorflow. The simulator integrates the Tensorflow and Keras platform to evaluate the accuracy of a DNN model with faulty elements. Also, the simulator can estimate the inference accuracy loss of a DNN accelerator caused by the faulty buffers. Simulation results of accuracy with respect to different fault rates for the LeNet and 4C2F models are conducted.

13:56 – 14:09 (SB13) Approximate Systolic Array-based Processor for AI Computation
 Wei-Kai Tseng, Huan-Jan Chou, Ning-Chi Huang, and Kai-Chiang Wu
 National Chiao Tung University

Approximate computing is an emerging strategy which trades computational accuracy for computational cost in terms of performance, energy, and/or area. We propose a novel sensor-based approximate adder, Carry Truncate Adder (CTA), for high-performance energy-efficient arithmetic computation, while considering the accuracy requirement of error-tolerant applications. On top of a fully-optimized ripple carry adder, the performance of our adder is enhanced by 2.17X. When applied in error-tolerant applications such as image processing and handwritten digit recognition, our approximate adder leads to very promising quality of results compared to the case when an accurate adder is used. Systolic arrays are widely used as matrix multiplication accelerators for DNNs. To improve the performance and energy efficiency of a systolic array, we apply the idea of timing speculation based on our proposed CTA into a systolic array. By using in-situ sensors for approximate multiplier-accumulator (MAC) computation in a systolic array, the computation which needs longer propagation time will drop the next result of multiplication and occupy two (adjacent) MACs to complete the current multiplication and accumulation.

In the experiments, compared to the original systolic array (without any approximation), our proposed approximate systolic array can reduce the clock period from 8.57 to 5.39 (ns) with only 1% accuracy loss on MNIST dataset.

4. 14:09 – 14:22 (SB14) Approximate Logic Circuit Design for Al Applications

Wei-Hung Lin, Hsu-Yu Kao, and Shih-Hsu Huang

Chung-Yuan Christian University

To reduce the power consumption of an embedded system, the design of approximate logic circuits appears as a promising solution for many error-resilient applications. In this paper, we will introduce the approximate logic circuit design for AI applications. We have developed a circuit library of approximate logic circuits for AI applications. Moreover, we also have constructed a neural network (NN) design framework for the users to utilize the circuit library to develop their AI applications. So far, we have

implemented ICNet, which is a famous semantic segmentation NN, by using the proposed NN design framework. Experimental results show that, compared with the original ICNet model, even if all the multiplications and activation functions are replaced by our approximate logic circuits, the accuracy loss is only 0.1%. Our future works is to provide more approximate logic circuits in the NN design framework for the trade-off. We will also try to develop more AI applications based on the NN design framework.

14:22 – 14:35 (SB15) Dataflow Exploration Framework for Data Reuse of CNN Computation
 Xiang-Yi Liu, Yuan-Chih Lo, Tsai-Yu Tsai, and Wei-Kai Cheng
 Chung-Yuan Christian University

In the edge intelligence system, due to the limited hardware resources, memory accesses become the bottleneck of DNN hardware accelerator. Different from CPU or GPU architecture, dataflow processing is an effective method to speed-up the efficiency of data migration in the DNN hardware accelerator.

However, memory accesses consume a high percentage of energy in this type of DNN architecture. In this paper, we propose a modified dataflow approach based on Eyeriss to reduce data volume of external memory access. Experimental results show that our dataflow approach can reduce data migration of kernel and input feature map between external DRAM and internal buffer.

6. 14:35 – 14:48 (S0168) AIP: Saving the DRAM Access Energy of CNNs Using Approximate Inner Products Cheng-Hsuan Cheng and Ren-Shuo Liu National Tsing Hua University

In this work, we propose AIP (Approximate Inner Product), which approximates the inner products of CNNs' fullyconnected (FC) layers by using only a small fraction (e.g., onesixteenth) of parameters. We observe that FC layers possess several characteristics that naturally fit AIP: the dropout training strategy, rectified linear units (ReLUs), and top-n operator. Experimental results show that 48% of DRAM access energy can be reduced at the cost of only 2% of top-5 accuracy loss (for VGG-f).

7. 14:48 – 15:01 (S0030) Filter Pruning based on Dynamic Convolutional Neural Network for Surveillance Video

Chun-Ya Tsai, De-Qin Gao, and Shanq-Jang Ruan

National Taiwan University of Science and Technology

The large-scale surveillance videos analysis becomes important as the development of the intelligent city; however, the heavy computational resources necessary for the state-of-the-art deep learning model makes real-time processing hard to be implemented. As the characteristic of high scene similarity generally existing in surveillance videos, we propose an effective compression architecture called dynamic convolution, which can reuse the previous feature maps to reduce the calculation amount; and combine with filter pruning to further speed up the performance.

## Session Oral 4 (8/7 Wed. 13:30 – 15:00)

Session Topic: SoC Design in Emerging System

Session Chair: Yu-Hsuan Lee (Yuan Ze University) and Ching-Hwa Cheng (Feng Chia University)

Room: 7F 論語廳+大學廳

 13:30 – 13:43 (S0094) 20 M bit/s Visible Light Communication Based on 16QAM OFDM Transmission by Using Red LED

Zhen-Hao Zhu (1), Wei-Ting Lin (1), Yu-Jung Wang (1), Siou-Lin You (1), Cheng-You Ho (2), Chi-Lun Hsu (2), Chun-Hsing Lee (2), and Hsi-Pin Ma (1)

(1) National Tsing Hua University and (2) Industrial Technology Research Institute

In this paper, we present a 20 Mbit/s visible light communication (VLC) system based on red light emitting diode (LED). For VLC systems, intensity modulation with direct detection (IM / DD) is widely

used, which means that the transmitted signal must be non-negative and real-valued. In this system, we implemented the DC-biased optical orthogonal frequency division multiplexing (DCO-OFDM) architecture and solved the synchronization problem in OFDM systems. In addition, the least square (LS) algorithm is used to estimate channel response. Bit error rates (BER) is under 0.00048 after 15 cm free space transmission.

2. 13:43 – 13:56 (S0078) A Design Platform for Smart Sensor Development

Wu

Chun-Ming Huang, Chih-Chyau Yang, Yi-Jie Hsieh, Chun-Wen Cheng, Yu-Tsang Chang, and Chien-Ming

Taiwan Semiconductor Research Institute, National Applied Research Laboratories

Due to the fast advance of IC fabrication and electronic design automation technologies, integrating a smart sensor design into a single chip has become practical. To assist the MEMS sensor teams in Taiwan

academia to accelerate their smart sensor development, this paper presents a design platform for smart sensor development which consists of a common platform unit and a sensor unit. Our proposed design platform provides the solutions of FPGA, Smart Sensor on Chip (SSoC), and Smart Sensor in Package (SSiP) for smart sensor implementation. The deliverables and design flows for the FPGA-based, SSoC and SSiP smart sensor implementations are also presented in this paper. The proposed smart sensor common platform unit was taped out with UMC 0.18um process to perform the silicon proof. The experiment results show that the presented smart sensor platform is very suitable for smart sensor development.

3. 13:56 – 14:09 (S0097) An Efficient VLSI Architecture Design for Real-time Image Object Tracking Applied to UAV

Yao-Fong Huang (2), Yu-Syuan Jhang (2), Ming-Hwa Sheu (2), Shin-Chi Lai (1), and Chi-Chia Sun (3)

(1) Nanhua University, (2) National Yunlin University of Science and Technology, and (3) National Formosa University

This paper presents an object tracking algorithm based on sample grouping in terms of hue, saturation, and value (HSV), and has been applied to unmanned aerial vehicle (UAV). The background group is determined by the information which is around the selected target object so that the target object can be effectively extracted from the background. By using the group information to divide the background and object, the HSV sample grouping can be elicited and grouped from the object area, and then the candidate of group samples is obtained. Here, we proposed a Gaussian mixture sample distribution (GMSD) method to enhance the performance. The candidates of object after grouping are used for searching the best matching region coordinate, where the matching criteria is based on the minimum difference between the group central value and the standard deviation. The experimental results show that the proposed method can real-time processes the full HD video, and the tracking performance of this work is better than that of previous works. Additionally, the proposed method can handle the illumination variation, object scaling variation, object fast motion, object shading, and object out of view.

4. 14:09 – 14:22 (S0140) A Novel 3-D Oximeter Image System for Breast Cancer Diagnostics

Wen-Jun Wu, Jia-Jiun Guo and Wai-Chi Fang

National Chiao Tung University

Diffuse optical tomography (DOT) is a relatively novel, noninvasive and nonionizing technique for breast tumor diagnosis. This study proposed 3-D Oximeter Image System as a novel approach for early detection of breast cancer. The proposed system reconstructed oxygen saturation (SpO2) distribution map of breast tissue with two wavelengths of 735nm and 890nm in real-time. The measuring method uses the diffused image of breast sample to create a three-dimensional distribution map of the entire measured region. The functionalities of a validation use an experimental human breast phantom. The proposed system can effectively detect tumors at a depth of 20mm. This system correctly identified

tumors for a sensitivity of 87.5 % and specificity of 93.7 % within the breast tissue.

5. 14:22 – 14:35 (S0045) Hierarchy Quadruple-Voltage Low-Power Chip Design Methodology

Ching-Hwa Cheng

Feng Chia University

A Hierarchy Multiple-Voltage EDA framework, HMulti-Vdd is proposed to effectively reduce power consumption. The proposed HMulti-Vdd can be utilized to identify how many voltage-domains are better to design a performance-power optimized multiple-Vdd chip. The power consumption can be effectively reduced up to 50%, and the performance loss can be kept to within 5% of this chip. HMulti-Vdd EDA framework can help designers to reduce the manual efforts to design the multiple-Vdd chip. This automation design framework identifies a separation of high-voltage and low-voltage module in front-end and back-end chip synthesis stages. HMulti-Vdd includes performance, area, and power design

guidance, while joint with several commercial circuit-synthesis and physical place-route EDA tools when designing the chip. By applying HMulti-Vdd, the multiple-Vdd chip design can be quickly redesigned based on the power, delay-time and gate-count optimization requirements.

- 6. 14:35 14:48 (S0080) A CHF Detection System with 1D CNN

  Bei-Lin Chuang (1), Yan-Hong Lin (2), Chi-Sheng Hung (2), and Hsi-Pin Ma (1)
  - (1) National Tsing Hua University and (2) National Taiwan University Hospital and National Taiwan University

Recently, with the significant increase in the number of cardiovascular diseases, automatic classification study of ECG signals (ECG) has always played a very important part in clinical diagnosis of cardiovascular diseases. In this paper, a 1D convolution neural network (CNN) based method is proposed to classify ECG signals. The proposed detection system could be composed of three portions: data pre-processing, model-establishment and classification. Afterwards, through the neural network structure to train the

model and get the classification result. The recognition accuracy between CHF and Control is up to 97.57% for training set, and 95.31% for testing set, significantly outperforming several typical ECG classification methods.

7. 14:48 – 15:01 (S0171) DI-SSD: Desymmetrized Interconnection Architecture and Dynamic Timing Calibration for Solid-State Drives

Ruei-Fong Chiu, Jian-Hao Huang, and Ren-Shuo Liu

**National Tsing Hua University** 

NAND flash-based solid-state drives (SSDs) have long been architected in the way that the interconnections between a flash controller and the associated flash memory chips operate at a symmetric speed in both directions. However, this commonly accepted and widely used architecture is suboptimal to SSDs because reading flash cells is 10 to 20× faster than writing them. In response, we propose desymmetrized interconnection SSD architecture (DI-SSD) and dynamic timing calibration (DTC),

which selectively push the flash-to-controller speed to the limit. We conduct comprehensive experiments including characterizing real SSD products, using industrial-strength IC test equipment to emulate a flash controller that adopts DTC, and simulating DI-SSD using simulators to demonstrate the benefits of our proposals.

# Session Oral 5 (8/7 Wed. 15:30 – 17:00)

Session Topic: Advanced Circuits and Signal Processing Systems for Biomedical Applications

Session Chair: Shin-Chi Lai (Nanhua University), Po-Yu Kuo (National Yunlin University of Science &

Technology)

Room: 6F 樂廳

1. 15:30 – 15:43 (SA21) A 10-Bits Low-Power SAR ADC for Biomedical Systems

Chi-Chang Lu and Sheng-Yen Lai

**National Formosa University** 

A low-energy and area-efficient switching scheme is proposed to design a low-power successive approximation register analog-to-digital converter (SAR ADC). There are several reasons for the significant reduction in power consumption. First, in the first and second conversions, the switching

energy of the capacitive digital-to-analog converter (CDAC) is zero because the voltage difference across the capacitor does not change. Moreover, in the third and subsequent conversions, only one switch changes its condition and the switching situation is changed from Vref to 1/2Vref or from 1/2Vref to ground. Compared with the conventional CDAC, the switching energy of the proposed CDAC is reduced by 97.66% and the total required capacitance is also reduced by 75%. The 10-bits low-power SAR ADC has been designed using TSMC 0.18um 1P6M technology. At a 1.2 V supply and the conversion rate of 100KS/s, when the input signal is 12.4KHz, the dynamic parameters SNDR and SFDR are 60.52dB and 67.34dB, respectively. The ENOB is 9.76 bits. The static parameter DNL is between -0.32LSB and 0.36LSB, and the INL is between -0.23LSB and 0.25LSB.

2. 15:43 – 15:56 (SA22) Multi-functional Cushion and Visual Feedback Rehabilitation Training System Po-Cheng Su, Ming-Ta Ke, Ya-Hsin Hsueh

National Yunlin University of Science and Technology

In this study, we create a multi-function cushion and visual feedback rehabilitation training system. This system could give the warning to remind the user when the sitting posture is sloping or slipping down from the chair. In addition, we also design a computer game for raising users' interest in rehabilitation training. The system can be used in a battery-powered manner. Therefore, this system is not subject to power and will be easier to use. The purpose of the study is to help people to correct sitting posture to avoid unhealthy sitting posture and lead to spine tilt. With the appropriate game interface, users can carry out relevant training, increase their concentration during training and achieve better rehabilitation training.

3. 15:56 – 16:09 (SA23) Investigations on Impact of Sensing Characteristics for RuO2 Urea Biosensor Affected by Power Noise

Po-Yu Kuo and Ze-Lin Lian

#### National Yunlin University of Science & Technology

The urea biosensor was studied and developed for many years. Many researchers designed high performance urea biosensor using different sensing films. However, the characteristic of most reported biosensors was measured using simple instrumentation amplifiers. This simple measurement system does not consider the noise generated by power supply. The biosensor signal is generally operated under 1KHz and easily affected by the power line frequency component (60HZ). Therefore, this non-ideal effect can not be ignored. In this paper, the impact of power supply noise for ruthenium dioxide (RuO2) urea biosensor is analyzed. A new readout circuit is demonstrated, in the proposed circuit, a notch filter is applied to cancel the 60Hz power line noise. By applying the notch filter, the power line frequency component at 60HZ is suppressed.

4. 16:09 – 16:22 (SA24) A Prototype Design of EEG Sensing System with a Compact Sliding DFT Algorithm Shin-Chi Lai(1), S M Salahuddin Morsalin(1), Yu-Syuan Jhang(2), and Ming-Hwa Sheu(2)

(1) Nanhua University and (2) National Yunlin University of Science and Technology

This paper presents an Electroencephalography (EEG) signal acquisition system design with incorporating the circuit design, and spectrum analysis. The data acquisition procedure consists four stages: 1) The acquisition of original EEG signal can be done by the active electrode and an instrumentation amplifier with a smaller gain; 2) Improves the signal quality by using band-pass filter and band-stop filter with IC OPAMP; 3) Those EEG signals were converted into the digital code through the analog-to-digital converter (ADC) that was integrated to a micro-controller; 4) a compact sliding discrete Fourier transform is applied to obtain the desired real-time spectrum information. The experimental results show that the system could implement the acquisition and storage of the EEG signals efficiently.

5. 16:22 – 16:35 (SA25) Analysis and Discussion of an Analog High-Order Low-Pass Filter Performance Evaluation Techniques

Hsin-Wen Ting(1) and Chi-Yuan Chen(2)

(1) National Kaohsiung University of Science and Technology and (2) ILI Technology Corporation

Analog filters are fundamental building blocks for biomedical application, and their functionalities are directly related to the requirements of acquisition of bioelectric signal and removal of noise. This paper analyzes and discusses a technique for evaluation of an analog high-order low-pass filter. The evaluation procedure is divided into three modes, first to estimate the passband characteristic, second, to estimate the stopband characteristic, and finally, to evaluate the performance of the analog filter by the analysis of the relationship between the attenuations at the passband and stopband. The technique is investigated to quantify the trade-offs between the evaluation accuracy and hardware cost, estimate the probability of having a specification-passed device, and estimate the evaluation time. The evaluated result is related to a "ratio" rather than a "specific value." The experimental verification demonstrates the functionality, effectiveness, and feasibility of the technique.

16:35 – 16:48 (SA26) A Voltage Compensation Embedded All Digital Temperature Sensors
 Jyun-Da Huang, Jhih-Yu Syu, and Po-Hui Yang
 National Yunlin University of Science and Technology

In this paper, a VDD variation compensated negative feedback compensation all-digital temperature sensing delay cell is proposed. This temperature sensing delay cell is designed in a differential loop for a ring oscillator, which quantifies the oscillation frequency with temperature changes. With digital temperature code, this temperature sensing system can be easily embedded in a digital system chip to provide absolute temperature monitoring of high-performance system chips. The circuit has verified in a 0.18µm CMOS process with a temperature range of 0°C to 100°C and a temperature resolution of 0.1°C. The voltage variation sensitivity of the supply voltage of 1.8V±10% is as low as 0.016MHz/ mV, compared to the traditional differential oscillator based temperature sensors, the voltage sensitivity improved by

up to 90%.

**National Taiwan University** 

7. 16:48 – 17:01 (S0163) Synthesis of Nondeterministic Behavior in Recombinase-Based Genetic Circuits Zi-Jun Lin, Wei-Chih Huang, and Jie-Hong Roland Jiang

Recombinases have been exploited in synthetic biology as a technique to engineer genetic circuits for various application tasks. Prior works mostly studied the construction of combinational or sequential circuits with deterministic behavior. Nevertheless, nondeterminism is ubiquitous in biochemical systems and is an essential resource to enable various biochemical processes, such as cell differentiation and pattern formation. In this work, we study the synthesis of nondeterministic recombinase-based genetics circuits specified by a Boolean relation. We develop methods to create nondeterminism and synthesize the intended nondeterministic circuit. The synthesis methodology is experimented to evaluate the effectiveness of DNA sequence length reduction in the constructed genetic circuits.

## Session Oral 6 (8/7 Wed. 15:30 – 17:00)

Session Topic: EDA, Testing, and AI

Session Chair: Yu-Guang Chen (Yuan Ze University) and Tong-Yu Hsieh (National Sun Yat-sen University)

Room: 6F 御廳

1. 15:30 – 15:43 (S0189) Clock-less DFT for Dual-rail Asynchronous Circuits (Best Paper Candidates)

Chia-Cheng Pai, Tsai-Chieh Chen, Yi-Zhan Hsieh, and James Chien-Mo Li

**National Taiwan University** 

In this paper, we propose asynchronous circuit scan (A-scan) latch, which can flip between Valid and Empty so that we can shift in and out without any clock. We also propose circuit models that enable traditional ATPG to generate high test coverage patterns for A-scan. Our stuck-at test coverage is 99.64%. This paper

provides a DFT and ATPG solution for testing asynchronous circuits.

- 2. 15:43 15:56 (S0041) NCTUcell: A DDA-Aware Cell Library Generator for FinFET Structure with Implicitly Adjustable Grid Map (Best Paper Candidates)
  - Yih-Lang Li (1), Shih-Ting Lin (1), Shinichi Nishizawa (2), Hong-Yan Su (1), Ming-Jie Fong (1), Oscar Chen (3), and Hidetoshi Onodera (4)
  - (1) National Chiao Tung University, (2) Saitama University, (3) AnaGlobe Technology, Inc., and (4) Kyoto University

For 7nm technology node, cell placement with drain-to-drain abutment (DDA) requires additional filler cells, increasing placement area. This is the first work to fully automatically synthesize a DDA-aware cell library with optimized number of drains on cell boundary based on ASAP 7nm PDK. We propose a DDA-aware transistor placement. Previous works ignore the use of M0 layer in cell routing. We firstly propose an ILP-based M0 routing planning. To improve the routing resource utilization, we propose an implicitly

adjustable grid map, making the maze routing able to explore more routing solutions. Experimental results show that block placement using the DDA-aware cell library requires reduce filler cells by 70.9%, which achieves a block area reduction rate of 5.7%.

3. 15:56 – 16:09 (S0107) AlFood: A Large-scale Food Image Dataset for Ingredient Recognition (Best Paper Candidates)

Gwo Giun (Chris) Lee, Chin-Wei Huang, Jia-Hong Chen, Shih-Yu Chen National Cheng Kung University

In this paper, we introduce a large-scale food image dataset namely AIFood, which is constructed to aim ingredient recognition in food image research. AIFood dataset includes 24 categories and totally 372,095 food images around the world. We collect food images from eight existing food image datasets and a food website. The food images are relabeled using 24 categories. We preliminarily label each image using

existing food information such as dish name and ingredient information. Next, we manually check food images to find out undiscovered ingredients and relabel them. Every image can be labeled more than one category. In addition, food images may have color cast or uneven contrast problems, which may disturb performance of image recognition system. So, we applied preprocessing method which contains automatic white balancing and contrast limited adaptive histogram equalization to improve visual quality of food images. We set constraints which are defined by luminance and chrominance of image to determine if the image is to be preprocessed.

4. 16:09 – 16:22 (S0156) ROAD: An Asymmetric Aging Approach for Improving Reliability of Multi-core Systems

Yu-Guang Chen (1), Jian-Ting Ke (2), Shu-Ting Cheng (3), and Ing-Chao Lin (2)

(1) National Central University, (2) National Cheng Kung University, and (3) Yuan Ze University

Multi-core systems have been widely applied in modern computers to obtain stronger calculation power and better performance. However, Negative-Bias Temperature Instability (NBTI) has become one of the most drastic reliability threats. Previous researchers proposed various task assignment and/or dynamic voltage frequency scaling algorithms to tolerance NBTI by maintaining all cores in the multi-core system under similar aging conditions (symmetric aging). We observe that the symmetric aging may reduce the lifetime of a multi-core system. If a critical task (i.e., a task with tight timing constraints) arrives when the system has already operated for years, it is possible that none of the equivalently aged cores can complete the critical task within its timing constraints. This unavoidable timing failure then will shorten the lifetime of the system. With the above observation, this paper proposes a novel reliability improvement framework which realize the concept of asymmetric aging by task graph Retiming, task Ordering, task Assignment under asymmetric aging, and Dynamic voltage selection (ROAD) for multi-core systems. Experimental results show that our approach can significantly increase the system lifetime with no or insignificant energy overhead.

5. 16:22 – 16:35 (S0209) Morphed Standard Cell Layouts for Pin Length Reduction

Cheng-Wei Tai and Rung-Bin Lin

Yuan Ze University

This article presents a concept called morphed layouts which are layouts of a standard cell with different footprints on the pins of each layout for pin length reduction. The proposed approach can on average reduce total pin length by 12.1% and total wire length by 3.4% without via count increase.

6. 16:35 – 16:48 (S0150) Time-Frame Folding: Back to the Sequentiality

Po-Chun Chien and Jie-Hong Roland Jiang

**National Taiwan University** 

In this paper we formulate time-frame folding (TFF) as the reverse operation of time-frame unfolding (TFU), or commonly known as time-frame expansion in automatic test pattern generation (ATPG) and (un)bounded model checking. While the latter converts a sequential circuit into a combinational one with respect to some expansion bound of k time-frames, the former attempts the opposite. TFF arises naturally in the context of testbench generation and bounded strategy generalization, and yet remains unstudied. Unlike TFU, TFF can be highly non-trivial as the sub-circuit of each time-frame can be distinct. We propose an algorithm that finds a minimum-state finite state machine consistent with the input- output behavior of the combinational circuit under folding. Empirical evaluation of our method demonstrates its ability in circuit size compaction and suggests potential use in different application domains.

- 7. 16:48 17:01 (S0085) An Effective Heuristic for 1st- to 2nd-Order Threshold Logic Gate Transformation Li-Cheng Zheng (1), Yung-Chih Chen (2), and Jing-Yang Jou (1)
  - (1) National Central University and (2) Yuan Ze University

This paper presents a non-ILP based method for transforming a 1st-order threshold logic gate (TLG) to a 2nd-order TLG with lower hardware implementation cost. The method works by first extracting 2nd- order inputs based on two sufficient conditions, and then optimizing the weights and the threshold value. The experimental results show that the quality of the proposed method is competitive with the ILP-based method and the proposed method is much more efficient.

## Session Oral 7 (8/7 Wed. 15:30 – 17:00)

Session Topic: Neural Network Accelerators

Session Chair: Chi-Chia Sun (National Formosa University) and Pei-Jun Lee (National Chi Nan University)

Room: 6F 茗廳

1. 15:30 – 15:43 (S0208) A bio-potential and piezoelectric sensing system with intelligent real-time

computing, compressing and communication (Best Paper Candidates)

Ing-Jer Huang, Hsu-Kang Dow, and Shih-Jung Pao

National Sun Yat-sen University

We present a wearable bio-signal sensing system that monitors electromyography (EMG),

electrocardiogram (ECG), vibration, and temperature for elderly care. The system contains analog

front-end circuits (AFE) for bio-potential signal sampling, amplifying and digitization, a phase-lock loop

(PLL) circuit, a digital signal controller for AFE control and signal calibration, an 32-bit microprocessor for compression and communication. The prototype is implemented with a small 5x5cm development board which is then attached to smart T-shirt and arm bands as a light-weight wearable system. A tablet-based game has been implemented to engage elder people to exercise by interacting their muscles with the game character while monitoring their muscle fatigue and heart rate variation.

2. 15:43 – 15:56 (S0177) An Electrocardiogram Classification System with Neural Network Hardware Implementation (Best Paper Candidates)

Yu-Yi Liao, Peng-Wei Huang, and Shuenn-Yuh Lee

National Cheng-Kung University

This paper presents a real-time identification system for electrocardiogram (ECG) classification with the neural network (NN) classifier. The identification flow of the proposed system is described as following

step by step: 1. Collecting ECG lead II signal. 2. Filtering original signals by wavelet transform. 3. Calculating twenty feature values and normalizing these features. 4. Using principal component analysis (PCA) to reduce feature number. 5. Classifying the normal beat, premature atrial complex (PAC) and premature ventricular contraction (PVC) by classifier. The accuracy of the proposed method are evaluated using different normal and abnormal ECG signals taken from the standard MIT-BIH arrhythmia database. The proposed system is verified on software design. The software part is designed by python and tested by Matlab, and the hardware is implemented by the chip fabricated with TSMC 0.18um CMOS technology. All machine learning processors, including preprocessing, feature extraction, and classifier, are implemented on a chip. The training data and testing data are independent each other. In other words, the person included in training data set never appears in testing data set for blind test. The accuracy of the proposed system is about 95.45% by the verification on the software. It reveals the proposed architecture is effective for ECG classification.

3. 15:56 – 16:09 (S0072) An Acoustic DSP Processor with CNN-FFT Accelerators for Speech Enhancement (Best Paper Candidates)

Yu-Chi Lee (1), Tai-Shih Chi (2), and Chia-Hsiang Yang (1)

(1) National Taiwan University and (2) National Chiao Tung University

This paper proposes an acoustic DSP processor with a neural network core for speech enhancement. Accelerators for convolutional neural network (CNN) and fast Fourier transform (FFT) are embedded. The CNN-based speech enhancement algorithm is adopted in this work. An array of multiply-accumulator (MAC) and coordinate rotation digital computer (CORDIC) engines are deployed to efficiently compute linear and nonlinear functions. Hardware sharing is applied to reduce hardware area by leveraging the high similarity between CNN and FFT computations. The proposed DSP processor chip is fabricated in a 40-nm CMOS technology with a core area of 4.3 mm^2. The chip's power dissipation is 2.17 mW at an operating frequency of 5 MHz. The speech intelligibility can be enhanced by up to 41% under low SNR

conditions.

4. 16:09 – 16:22 (S0159) Low-Complexity Neural Network-Based Digital Pre-Distortion of 5G Wideband Power Amplifiers with Hybrid Architecture Design

You-Cheng Lu, Ching-Chun Liao, Sin-Sheng Wong, and An-Yeu (Andy) Wu

**National Taiwan University** 

Digital pre-distortion (DPD), is exploited for power amplifier (PA) linearization. Due to the ultra-high linearity required modulations employed in 5G communications, the performance of PA linearization based on conventional polynomial-based (MP) DPD becomes limited. Besides, despite of superior linearization performance, the complexity of recently proposed deep learning-based DPD is too high to be efficiently implemented in hardware. In this paper, a low-complexity hybrid neural network (NN)-based DPD of 5G wideband PA is proposed. The linearization error can be jointly compensated by a

hybrid architecture design with the proposed NN compensation model and a coarse-grained MP linearizer. The simulation results show that, with comparable linearization performance, the total parameters can be reduced by 80% compared with the state-of-the-art NN model.

5. 16:22 – 16:35 (S0160) Recurrent Neural Network-based Equalizer with Utilization of Coding Gain in Advance

Chieh-Fang Teng, Han-Mo Ou, and An-Yeu (Andy) Wu

**National Taiwan University** 

Recently, deep learning has been exploited in many fields with revolutionized breakthroughs. In the light of this, deep learning-assisted communication systems have also attracted much attention in recent years and have potential to break down the conventional design rule for communication systems. In this work, a recurrent neural network-based equalizer is proposed, which not only eliminates channel fading,

but also exploits the code structure with utilization of coding gain in advance. The equalizer in conventional block-based design may destroy the code structure and degrade the capacity of coding gain for decoder. On the contrary, our proposed approach can increase the overall utilization of coding gain with more than 1.5 dB gain.

6. 16:35 – 16:48 (S0118) A New and Efficient SVM Accelerator Design Jian-Jhang Chen, Jer-Min Jou and Ming-Han Shieh National Cheng Kung University

Support vector machines (SVMs) are widely used in various artificial intelligence (AI) applications. Due to AI applications' high computation complexity and real-time requirement, it is critical to speed up the SVM operation efficiently. The most part of the SVM computation is the kernel functions, which dominate the overall SVM speed and need to be implemented with special hardware. In this paper, we designed a new SVM hardware accelerator that speeds up efficiently the calculation of kernel functions

by changing the form of the decision function and by tiling the loops in it. And, we had also designed a new efficient fixed-width multiplier with very low errors for use in this SVM accelerator. Therefore, our SVM accelerator has a significantly improved detection speed compared to others, and the fixed-width multiplier has the lowest errors than other approximate multipliers.

7. 16:48 – 17:01 (S0105) CNN Training Acceleration Solution Based on the FloatSD Technique and Its Systolic Array FPGA Implementation

Mu-Kai Sun, Chu-King Kung, and Tzi-Dar Chiueh

**National Taiwan University** 

In this paper, we propose a CNN training acceleration system design based on FPGA implementation for the floating-point signed digit (FloatSD) number representation and update method [1]. The FloatSD technology exploits the imprecision tolerant characteristic of neural network training and adopts only a couple of non-zero digits in a neural network weight, reducing the convolution multiplication to addition of two shifted partial products. Furthermore, the mantissa field and the exponent field of neuron activations and gradients during training are also quantized. In addition, we describe in detail how the overall system operates with cooperation between software and acceleration hardware in FPGA. Finally, we present the design of a systolic array based PE Cube for execution of convolution based on FloatSD arithmetic. The measure power consumption results indicate that the FloatSD MAC is more than 20 times energy efficient than the counterpart FP32 MAC.

## Session Oral 8 (8/7 Wed. 15:30 – 17:00)

Session Topic: VLSI Design for Biomedical Applications

Session Chair: Yuan-Ho Chen (Chang Gung University) and Shih-Lun Chen (Chung Yuan Christian University)

Room: 7F 論語廳+大學廳

1. 15:30 – 15:43 (SC21) VLSI Chip Design for Wireless Body Sensor Network

Shih-Lun Chen

**Chung Yuan Christian University** 

Nowadays, applications of wireless body sensor networks (WBSNs), Internet of Things (IoT), and wearable devices have become wider and wider. These applications provide an effective solution for sustained monitoring, mobile health, self-health management and biological analysis in home-care system. As the demand of light-weight for wearable and portable applications, VLSI circuit has become a

significant trend. For this purpose, we proposed a VLSI chip design which includes an asynchronous interface, a register bank, a reconfigurable filter, a lossless data encoder, an encryption encoder, an effort correct coding encoder, a power management, a resolution controller, a multi-sensor controller, an encryption encoder and a QRS complex detector. The proposed VLSI chip design was synthesized by a TSMC 0.18-µm CMOS process and it can operate at 100-MHz processing rate. Compared with previous designs, this design achieved higher performance, higher security, higher reliability, more functions, more flexibility, higher compatibility and lower cost than previous designs.

2. 15:43 – 15:56 (SC22) Hardware/Software Codesign for Portable Optical Coherence Tomography (OCT)

Applications

Song-Nien Tang, Fu-Chue Tsai, and Yu-Ci Li

Chung Yuan Christian University

This paper presents a hardware-software codesign scheme for the image formation of the Fourier-domain optical coherence tomography (FDOCT) system. Using a hardware processor, the fast Fourier transform (FFT) together with its front-end DC noise removal and re-sampling operations can be efficiently performed. In cooperation with the hardware unit, a software platform can properly execute the flexible magnitude compression and dynamic range mapping which are associated with the OCT image display. The proposed design could be developed based on a small-scale system, which lends support to the trend of portable FDOCT applications. System-level design verification was performed using an FPGA module and a mobile phone to evaluate the efficacy of the proposed hardware/software codesign scheme. By accessing the raw data through a pattern generator, the 16.2-fps OCT image display (in the 1024x1000 resolution) can be achieved in the developed FPGA-phone verification system.

3. 15:56 – 16:09 (SC23) A Fourier-Domain Optical Coherence Tomography (FDOCT) Imaging System Using the GPU of Raspberry PI for Portable OCT Applications

Song-Nien Tang, Fu-Chue Tsai, and Yu-Ci Li

**Chung Yuan Christian University** 

Recently, the optical coherence tomography (OCT) technology based on the interference principle has been widely applied to the medical image inspection. The Fourier-domain (FDOCT) technology currently is the mainstream modality. In this project, we present an FDOCT image formation system based on the Raspberry PI platform capable of performing all FDOCT imaging operations, including dc removal, re-sampling, real-valued fast Fourier transform (RFFT), magnitude compression and display processing. Moreover, considering the OCT imaging rate, the RFFT computation could be accelerated using the graphics processing unit (GPU) of Raspberry PI. Thus, the high-throughput RFFT calculation can be achieved through sixteen-path parallel processing in common with the employment of an operation-efficient RFFT algorithm. Through the system measurement, a frame of OCT image with the resolution of 1024(axial) by 1000(lateral) pixels can be displayed in 2.7 seconds. Also, the OCT imaging rate based on the GPU acceleration was 5 times as fast as that generated by pure CPU-based imaging operations.

4. 16:09 – 16:22 (SC24) VLSI Implementation of the Integral Pulse Frequency Modulation Model for Heart Rate Variability System

Yen Juan, Shung-Ping Wang, and Yuan-Ho Chen

**Chang Gung University** 

Heart rate variability (HRV) can be used to assess autonomous control activities. In the HRV analysis method, an integrated pulse frequency modulation (IPFM) model is used to function as a pacemaker and generates a series of heartbeats. In this study, the IPFM model was implemented into a VLSI chip, and the activity spectrum of the autonomic nervous system was estimated using a compression sensing (CS) method. The chip uses the TSMC 180nm CMOS process to design the IPFM model. In this model, the

look-up table method is used to calculate the sine/cosine operations and many nonlinear operations to achieve a low-cost design. In matrix operations, we use a multiplexer to control the signal so that the CS algorithm can be easily applied. The results show that the proposed chip has a gate count of 10.2 k at a 62.5 MHz operating frequency. We can effectively estimate the spectrum of HRV through VLSI implementation on the CS.

5. 16:22 – 16:35 (SC25) A VLSI Implementation of Low Cost Independent Component Analysis (ICA) for Biomedical Signal Separation

Shung-Ping Wang, Yen Juan, and Yuan-Ho Chen

**Chang Gung University** 

Independent component analysis (ICA) is a recently developed algorithm for analyzing blind source separation (BSS). The ICA algorithm can separate directly numbers of mixed signals, without any information about the mixed process or the source signals. It is suitable for digital signal processing,

particularly for dealing with biomedical signals. In this study, we develop a hardware implementation of the extended infomax ICA algorithm for the separation of super-Gaussian signal sources using integrated circuitry (IC). To reducing circuit area and achieving low cost, our proposed design is based on systolic array multiplication, which is usefully reducing many multiplications on the circuit. We also use a lookup table to replace the complicated calculation of the hyperbolic functions  $\tanh\theta$ . When implemented using the TSMC 0.18- $\mu$ m CMOS process, the proposed ICA circuit achieve an operating frequency of 50 MHz with a gate count of 47 k. According to our simulation results, the architecture is applicable to the separation of mixed medical signals into independent sources.

16:35 – 16:48 (S0108) Variation-Resilient Design Techniques for Energy-Constrained Systems
 Bing-Chen Wu and Tsung-Te Liu
 National Taiwan University

Process, voltage, and temperature (PVT) variations substantially increase the variability of digital CMOS

logics and reduce the operation robustness, especially for energy-constrained systems with aggressive voltage scaling. This paper reviews several variation-resilient design techniques for addressing PVT variations to improve the energy efficiency of digital CMOS VLSI circuits. The scope includes static and adaptive design techniques for design-time and run-time optimization, respectively. In addition, an emerging adaptive design strategy combining the fully integrated voltage regulator for system-level optimization is also introduced.

7. 16:48 – 17:01 (S0212) Design of the Compiler for a Reconfigurable Accelerator for Edge Computing with Binarized Convolutional Neural Networks

Ching-Zong Chang, Chi-Jhe Li, and Hsin-Chou Chi

National Dong Hwa University

The recent rapid advance of deep learning and Internet of things (IoT) technology has triggered many applications. For these applications, one of the keys is the optimized convolutional neural network (CNN).

Besides, high-performance hardware is critical for supporting the required massive computation. In this paper, we propose a versatile compiler for the FPGA accelerator with binarized CNN. The compiler accepts the high-level description of the CNN, and automatically allocates the FPGA circuits efficiently based on the description. An abstract layer of API is also provided, such that knowledge of FPGA is not required for the engineers of deep learning. We have implemented the accelerator with our compiler. The performance evaluation results show that our system significantly outperforms the popular CPU and GPU solutions with good accuracy.

# Session Oral 9 (8/8 Thu. 11:30 – 12:42)

Session Topic: Data Converter, RF/mm-Wave Circuits, and High-Speed Building Blocks

Session Chair: Wei-Bin Yang (Tamkang University) and Chun-Hsing Li (National Tsing Hua University)

Room: 6F 樂廳

1. 11:30 – 11:42 (S0204) Monolithic CMOS Microwave Heater with Programmable Thermostat Function

Tzu-Yu Tseng, Hsiao-Chin Chen and Jenq-Shiou Leu

National Taiwan University of Science and Technology

A monolithic CMOS microwave heater with programmable thermostat function is implemented for thermotherapy. To achieve microwave heating, the heater adopts a 2.4-GHz oscillator to generate the microwave and uses an amplifier with LC resonance load to enhance the signal strength. The load inductor is employed as the heat applicator as large electrical field would be created around it. The

microwave heater is then integrated with temperature sensors, a 11-bit SAR-ADC and a comparator to achieve the thermostat function. Dissipating the power of 145 mW, the amplifier delivers the output power of 13.2 dBm. The temperature sensors achieve sensing range from 20  $^{\circ}$ C to 50  $^{\circ}$ C, with the sensitivity of 47 mV/ $^{\circ}$ C. When 427mg of agar phantoms is placed above the heater, it can be heated up by 1  $^{\circ}$ C with 0.1  $^{\circ}$ C accuracy when the thermostat function is activated.

2. 11:42 – 11:54 (S0146) Device Area Allocation for Yield Optimization in Integrated Circuits

Poki Chen and Ahmad Shahid Bhatti

National Taiwan University of Science and Technology

There are very few papers focusing on integrated circuit layout instead of design. Most of them are devoted to the generation of layout patterns to cancel the error caused by systematic mismatch. There are really few papers dealing with random mismatch especially by allocating proper areas to critical

devices for yield optimization. Area allocation strategies for yield optimization are proposed for some important analogue circuits in this paper. To demonstrate the performance, not only full-coverage simulations but also theoretical analyses are revealed. A test chip of the most representative circuit, R-2R ladder network, has been realized in a TSMC 0.35  $\mu$ m standard CMOS process to verify the excellence of the proposed area allocation strategy. To further ease the burden of analog IC designers and layout engineers, a rule of thumb based on device weights for area allocation with at least close-to-optimum yield is also presented.

3. 11:54 - 12:06 (S0122) A 14-bit Low Power 2-MS/s SAR ADC with Residue Oversampling

Sheng-Wen Huang and Soon-Jyh Chang

National Cheng Kung University

This paper presents a 14-bit successive-approximation register (SAR) analog-to-digital converter (ADC),

which adopts Residue Oversampling and Detect-and-Skip (DAS) techniques for high resolution and low power requirements. The proof-of-concept prototype was fabricated in TSMC 40-nm CMOS technology. At 2-MS/s sampling rates, the measured peak SNDR is 64.94dB without calibration. With a 0.03 standard deviation at each unit capacitor, static performance shows that INL and DNL are +1.25/-0.80 and +1.05/-0.97, respectively.

12:06 – 12:18 (S0060) A 10-Gb/s Equalizer with Digital Adaptation
 Jui-Cheng Hsiao, Dai-En Jhou, Hsiu Hsien Ting, and Tai-Cheng Lee
 National Taiwan University

An equalizer using a digital adaptive algorithm is proposed to minimize hardware cost. The proposed algorithm uses two analog reference levels to detect the low-frequency and high-frequency components of the input amplitude, respectively. By monitoring the two reference levels, the proposed equalizer can tune its high-frequency gain to compensate the channel loss appropriately. This work has been

fabricated in a 40-nm process, and the equalizer core circuit occupies 0.014 mm2 and consumes 10 mW from a 1-V supply.

12:18 – 12:30 (S0003) A Flip-Chip-Assembled W-Band Receiver in 90-nm CMOS and IPD Technologies
 Te-Yen Chiu, Wan-Ting Hsieh, and Chun-Hsing Li
 National Tsing Hua University

A flip-chip-assembled W-band receiver composed of a 90-nm CMOS chip and an integrated-passive-device (IPD) carrier is presented in this work. The chip which integrates a low-noise amplifier (LNA), a single-sideband mixer, a frequency doubler (FD), and a wide-band variable-gain amplifier (VGA), is flip-chip packaged to the IPD carrier through a low-loss interconnect. Experimental results show that the proposed packaged receiver can provide a variable gain from 11.3 to 48.2 dB while having an input 1-dB compression point from -43.7 to -29 dBm as the RF frequency is 90 GHz. The IF

bandwidth and minimum noise figure can be 1.0 GHz and 7.8 dB, respectively. The proposed receiver only consumes 73.9 mW from a 1.2-V supply. To the best of authors' knowledge, this is the first W-band CMOS receiver assembled on an IPD carrier reported thus far.

- 12:30 12:42 (S0063) An S-Band CMOS Mixer-First Single-RF-Port Duplexing FMCW Radar Hao-Chung Chou (1), Chun-Chieh Peng (1), Yu-Jiu Wang (2), and Ta-Shun Chu (1)
   (1) National Tsing Hua University and (2) Tron Future Tech Inc.
  - A mixer-first single-RF-port duplexing RF frontend is proposed and implemented for frequency-modulated continuous-wave (FMCW) radar applications in this paper. The RF frontend is a bidirectional simultaneous frequency up-and-down converter. Equations of basic parameters of the frontend are derived to provide design criteria. The proposed radar architecture has been evaluated with an S-band (3.3-3.6 GHz) FMCW radar. The radar chip is fabricated in a 65nm CMOS process, and it

consumes 190 mW of DC power under 1.2V supply. A wireless distance measurement has verified the function of the radar chip.

### Session Oral 10 (8/8 Thu. 11:30 – 12:30)

Session Topic: Emerging Trends of Intelligent Healthcare

Session Chair: Shanq-Jang Ruan (National Taiwan University of Science and Technology)

Room: 6F 御廳

1. 11:30 – 11:42 (S0095) A Low-Power Multi-mode Sensing Device for Sleep Apnea Homecare

Po-Cheng Hsu, Yu-Chiang Cheng, and Hsi-Pin Ma

**National Tsing Hua University** 

In this paper, we proposed a low-power wearable device that can support various operation modes for sleep apnea homecare. The proposed device measures temperature, humidity, ECG, respiration, and body posture. The body posture part provides two modes, which are ECG-3axis mode and ECG-9axis mode. It let user to select to measure 3-axis signal or 9-axis signal. In terms of data storage, two versions are

available. The Bluetooth version and the NAND Flash version. In addition, we will detect the quality of ECG signal to ensure the correctness of ECG signal storage. If the ECG signal quality is poor, we close the sensor to save power and provide alerts to users. The most power-consuming mode in this sensor is ECG-9axis mode in Bluetooth version. But its average current consumption is only 7.5523mA. And in the case of a 300mAh battery, the device lifetime up to 39.72 hours.

2. 11:42 – 11:54 (S0083) Sample Preparation for Reactant Minimization on Digital Microfluidic Biochips under Timing Constraints

Ling-Yen Song, Yu-Ying Li, Yung-Chun Lei, and Juinn-Dar Huang

National Chiao Tung University

Sample preparation is one of the essential processes for most biochemical assays on biochips. Many studies have been conducted for dealing with the reactant minimization problem during sample

preparation. Nevertheless, those approaches try to minimize reactant consumption at the cost of more extra operations, which may lead to deterioration of reactant and even wrong results. In this paper, we propose a time-constrained sample preparation algorithm for reactant minimization on digital microfluidic biochips (DMFBs). The experimental results show that our algorithm achieves a 33% reactant reduction over a delay-optimal method with the same operation time. Meanwhile, the proposed method can save 5% operation count as compared with a state-of-the-art reactant minimization algorithm under the same reactant consumption.

- 3. 11:54 12:06 (S0178) Stride Count and Walking Distance Measurement via Knee Angle Calculation

  Teng-Chia Wang (1), Yan-Ping Chang (1), Chun-Jui Chen (1), Chia-Chun Lin (1), Yung-Chih Chen (2), and

  Chun-Yao Wang (1)
  - (1) National Tsing Hua University and (2) Yuan Ze University

To calculate the knee angle, stride counts, and walking distance, we propose a system, iKneePad, fusing two 9-axis sensors with Bluetooth equipped on the thigh and shank segments. The changing rates of hip and knee an- gles are used to determine the beginning and the ending of a stride. The thigh length, shank length, hip angle, and knee angle are used to calculate the walking distance. The experimental results show that the accuracy of stride count is 100%, the absolute mean errors of knee angle are 2.99° and 1.42° for the maximum and minimum flexion angles, respectively. For walking distance, the mean error rates are -2.40% and -2.26% for short (10m) and long (33m) distances, re-spectively. The proposed system also instantly provides feedback to users by showing on an Android smartphone when conducting rehabilitation or exercise with iKneePad.

4. 12:06 – 12:18 (S0213) Wearable Parkinson's Disease Finger Tapping Quantitative Evaluation Chip Design Combined with Impedance and Accelerometer Sensing

Yu-Chuan Lu (1), Zhi-Xiong Feng (1), and I-Chyn Wey (1,2)

### (1) Chang Gung University and (2) Chang Gung Memorial Hospital

In this paper, we proposed the use of accelerometers and impedance measurements to achieve a wearable sensing chip design that can assess the finger tapping quantitative evaluation in patients with Parkinson's disease. The accelerometer is used to measure and evaluate the large motion, and the impedance measurement is used for the fine detail of finger tapping. By evaluating the condition of 10 subjects in normal finger tapping and comparing them with the simulated PD's shakings, the proposed approach processes the sensing signals and computes the characteristic signals in time domain, which is more timing, power, and hardware efficient. In this way, we can accurately distinguish the symptoms of finger fibrillation in patients with PD and meet the wearable demand as well.

5. 12:18 – 12:30 (S0197) Fast Remaining Useful Life Estimation by Using K-means-based Data Labeling Mechanism and Non-time Related Artificial Neural Network

Kun-Chih (Jimmy) Chen and Geng-Ming (Kevin) Liu

National Sun Yat-sen University

Health prognostic benefits the industry by maintaining machinery in more efficient way and been widely discussed since the fourth industrial revolution. However, due to the unpredictable human error and uncontrollable chain reaction during the machine operation, it is difficult to estimate the remaining useful life (RUL) of the machinery with a lite estimating method, which leads to large computing power and cost. To consider the tradeoff problem between computing cost and the efficiency of RUL estimation, we propose a kind of artificial neural network (ANN) accompany with a K-means- based labeling algorithm to construct feasible targets (i.e., health prognostic) for learning. Compared with the conventional approaches, the proposed method does not need the historical data and achieve similar RUL estimation results with much lower computational complexity.

## Session Oral 11 (8/8 Thu. 11:30 – 12:30)

Session Topic: Memory/Computing Cooperation & Optimization

Session Chair: Hsie-Chia Chang (National Chiao Tung University)

Room: 6F 茗廳

1. 11:30 – 11:42 (SC11) Memory-contention Aware Warp Scheduler for Computing GPU

Chien-Ming Chiu, Kuan-Chung Chen, Jhi-Han Jheng, Kuan-Lin Huang, Tsung-Han Tsou, Feng-Ming Hsu,

Juin-Ming Lu, and Chung-Ho Chen

**National Cheng Kung University** 

We will first introduce a computing GPU which supports both OpenCL and TensorFlow framework. The proposed GPU aims at the deployment of edge AI computing devices. To address the memory contention problem of the GPU, a serious performance bottleneck in resource-limited GPUs, we propose a Memory

Contention Aware Warp Scheduler (MAWS) to strike a dynamic balance between the memory workload requirements and the given memory resources. By measuring the Load/Store Unit (LSU) Stall ratio in a sampling interval and accurately monitoring the variations in memory contention, MAWS finds a suitable warp concurrency that fits the limited memory resources well, and as a result significantly improve the effective throughput.

2. 11:42 – 11:54 (SC12) Establishing Cooperation Between Camera Applications and Flash-Based Storage to Improve JPEG File Reliability

Yu-Chun Kuo, Chia-Yu Hu, Ruei-Fong Chiu, and Ren-Shuo Liu

National Tsing Hua University

NAND flash-based storage such as SD cards and eMMC chips are the most widely used media for Camera Applications. In this work, we propose to establish cooperation between camera applications and

flash-based storage to improve the reliability of JPEG files stored in the storage. We conduct realsystem experiments by storing JPEG files on flash chips to evaluate the benefits of our proposed techniques. Experimental results demonstrate that the reliability of JPEG files can be significantly enhanced.

3. 11:54 – 12:06 (SC13) Considerations of Integrating Computing-In-Memory and Processing-In-Sensor into Convolutional Neural Network Accelerators for Low-Power Edge Devices

Kea-Tiong Tang (1), Wei-Chen Wei (1), Zuo-Wei Yeh (1), Tzu-Hsiang Hsu (1), Yen-Cheng Chiu (1), Cheng-Xin Xue (1), Yu-Chun Kuo (1), Tai-Hsing Wen (1), Mon-Shu Ho (2), Chung-Chuan Lo (1), Ren-Shuo

Liu (1), Chih-Cheng Hsieh (1), and Meng-Fan Chang (1)

(1) National Tsing Hua University and (2) National Chung Hsin University.

In quest to explore emerging deep learning algorithms at edge devices, developing low-power and low-latency deep learning acceelerators (DLAs) have become top priority. To achieve this goal, data

processing techniques in sensor and memory utilizing the array structure have drawn much attention. Processing-in-sensor (PIS) solutions could reduce data transfer, computingin-memory (CIM) macros could reduce memory access and intermediate data movement. We propose a new architecture to integrate PIS and CIM to realize low-power DLA. The advantages of using these techniques and the challenges from system point-of-view are discussed.

 12:06 – 12:18 (SC14) STT-MRAM for Deep Convolutional Neural Network Acceleration Chih-Cheng Chang, Chun-Hsien Li, Tian-Sheuan Chang, and Tuo-Hung Hou National Chiao Tung University

Binary STT-MRAM is a highly anticipated embedded non-volatile memory technology in advanced logic nodes < 28 nm. How to enable its in-memory computing (IMC) capability is critical for enhancing AI Edge. Based on the soon-available STT-MRAM, we report the first binary deep convolutional neural network

capable of both local and remote learning. Exploiting intrinsic cumulative switching probability, accurate online training of CIFAR-10 color images (~ 90%) is realized using a relaxed endurance spec (switching 20 times) and hybrid digital/IMC design. For offline training, the accuracy loss due to imprecise weight placement can be mitigated using a rapid non-iterative training-with-noise and fine-tuning scheme.

12:18 – 12:30 (SC15) Efficient Design of Multiple Writes for Algorithmic Multi-ported Memory
 Bo-Ya Chen, Bo-En Chen, and Bo-Cheng Lai
 National Chiao Tung University

This paper proposes REMAP+, a novel design that enables efficient write scheme for algorithmic multi-ported memory, and attains better performance with smaller area. REMAP+ applies the banking structure of memory design and implements the remap table with SRAM cells instead of costly registers. In the remap table, REMAP+ only keeps the most significant bit of write addresses to more efficiently utilize the space in the table. The hash write controller is simplified with the first fit algorithm to handle

write conflict with shorter latency. REMAP+ is implemented in a pipeline scheme to further increase the processing throughput. For a 3W1R memory with 16K depth, REMAP+ has attained 22% shorter access latency and 31.3% smaller area when compared with the previous design.

## Session Oral 12 (8/8 Thu. 11:30 – 12:30)

Session Topic: Design and Optimization for Storage, Memory, and System

Session Chair: Chien-Chung Ho (National Chung Cheng University)

Room: 7F 論語廳+大學廳

 11:30 – 11:42 (SD11) On Enhancing Lifetime and Performance of Fine-tuning Neural Networks Through an Overhead-Reduced Design on NVM-based System

Szu-Yu Chen and Chien-Chung Ho

**National Chung Cheng University** 

Convolutional neural network (CNN) which is one class of neural network has become one of the dominated applications in the computer vision field. Due to the needs of reducing training time and improving training performance, the fine-tuning neural network is widely adopted to avoid the time-

consuming procedure of training a neural network from scratch. Since the continuously growing data/model size and a large number of supported CNN techniques on the neural network system, it needs to increase the DRAM size significantly. However, it is impractical to scale up the DRAM size on the neural network system since DRAM can incur many issues, such as scaling limitation and leakage power problems. This work aims at exploring a solution of how to resolve issues caused by adopted large scale CNN application on DRAM. To explore a cost-efficient solution for the large scale CNNs without using DRAM, this work proposes to exploit non-volatile memory (NVM) as main memory because of its high scalability, low read latency, and near-zero leakage power. However, the inherent properties of NVM, such as longer write latency and worse endurance, can significantly affect the performance of CNNs and reduce the lifetime of NVM. To improve the performance and lifetime issues of NVM-based CNN system, this work proposes a split-FFT approach to reduce the number of write operation on NVM while fine-tuning neural networks. Besides, this work also proposes a writing strategy based on the characteristic of fine-tuning neural network with our proposed split-FFT approach. To examine the effectiveness of the proposed approaches, a series of experiments were conducted. The experiment results show that the proposed approaches successfully average the bit flip and enhance the lifetime of NVM. To be more specific, compared to the conventional FFT convolution approach, a 1.95x performance improvement, and a 50% reduction of bit flip were achieved.

 11:42 – 11:54 (SD12) Design a Fault Tolerant System Using System-Level Redundancy Sih-Kai Shen and Peng-Sheng Chen

National Chung Cheng University

In this paper, we design a fault tolerant system using system-level redundant techniques. The whole structure consists of a primary system connected to a redundant system using network socket programming APIs. A heartbeat mechanism checks whether the primary system is alive. GlusterFS, an open-source distributed file system, aggregates the disk storage resources to provide dependable storage.

In addition, distributed multithreaded checkpointing (DMTCP) stores the execution states of the application to allow resumption from failure. A GUI tool is also developed to assist users in building up the proposed fault tolerant system. Preliminary experimental results for benchmarking and test situations show that the proposed approach can improve fault tolerance and allow the operation to resume after failure.

#### 3. 11:54 – 12:06 (SD13) Pthread's Spinlock is Unfair

Shi-Wu Lo

National Chung Cheng University

Pthread is a standard library defined by POSIX. Most operating systems including BSD, Linux, Solaris, HPUX and Window use pthreads as their thread library. Although a higher-level programming language defines object-oriented or function-based thread library, the underlying layer uses Pthread. For example, Java,

Android, and OpenMP use Pthread to implement their multi-thread libraries. In our research, it was found that the implementation of the spinlock in GNU's Pthread library is unfair in the many core architecture. A few cores have a chance to get a lock several times higher than other cores. We reimplemented the lock and unlock functions of Pthread's spinlock library and changed it to an algorithm called ticket-lock. The performance of the new spinlock has dropped by 10%, but it guarantees fairness.

4. 12:06 – 12:18 (SD14) Real-World Anomaly Detection in Videos Using Spatio-Temporal Autoencoders

Po-Ju Lin and Pao-Ann Hsiung

National Chung Cheng University

Surveillance videos capture a variety of realistic anomalies, which are challenging to detect due to the fuzzy definition of anomalous behavior and complex monitoring scenarios. This article proposes a novel spatio-temporal autoencoder network using 3DCNN and ConvLSTM to learn the characteristics of video

anomalies. Experimental results show that this method can detect anomalies in the video with at least 2.4% improvement in AUC accuracy compared to the state-of-the-art ConvLSTM.

5. 12:18 – 12:30 (SD15) A Novel Approach for Story Generation

Wei Lin, Ting-Hsuan Chien, and Rong-Guey Chang

National Chung Cheng University

The sequence transformer models are based on complex recurrent neural network or convolutional networks that include an encoder and a decoder. High-accuracy models are usually represented by used connect the encoder and decoder through an attention mechanism. Neural story generation is an important thing. If we can let computers learn the ability of story-telling, computers can help people do more things. Actually, the squence2squence model combine attention mechanism is being used to Chinese poetry generation. However, it difficult to apply in Chinese story generation, because there are

some rules in Chinese poetry generation. Therefore, we trying to use 500 human-labeled summarization of paragraphs from a classic novel named "Demi-Gods and Semi-Devils" (天龍八部) to train the transformer network by the low resource. In our experiment, we got a low loss rate between different epoch.

## Session Oral 13 (8/9 Fri. 10:20 – 11:20)

Session Topic: Design and Implementation of Deep Learning Based Object Detection Technologies

Session Chair: Kuan-Hung Chen (Feng-Chia University) and Chih-Peng Fan (National Chung Hsing University)

Room: 6F 樂廳

1. 10:20 – 10:32 (SC31) Drunk Driving Detection from Face Images Using Deep Neural Networks

Chia-Yu Wang, Te-Yun Ma, and Robert Chen-Hao Chang

**National Chung Hsing University** 

Drunk driving usually leads to severe injury or death accidents. Regulations set by government are to prevent it from happening. Thus, this paper will introduce the drunk driving detection system using facial

images captured by a webcam. The Breathalyzer was used to label the data to be drunk driving or not.

The deep neural network is trained to have the ability to identify whether the individual is drunk driving

or not. This resolves the issue of traditional machine learning not being general enough. Experimental results show that the proposed detection system can have better validation accuracy and test accuracy than other works.

10:32 – 10:44 (SC32) Pedestrian Direction Detection Using YOLO-Based Deep Learning Networks
 Shih-Chieh Lin, Min-Chi Lin, Yin-Tsung Hwang, and Chih-Peng Fan
 National Chung Hsing University

In this paper, a simple and effective deep learning based detection and recognition design by YOLO (You only look once) network is studied for pedestrian direction detections. The proposed image-based detector provides both information of directions and positions simultaneously for pedestrians when the intelligent self-propelled vehicle is moving in crowds. In experimental results, the performances of precision and recall are shown by using the proposed YOLO-based design.

3. 10:44 – 10:56 (SC33) Car Collision-Avoidance Warning System with Deep Learning on Portable Devices Chuan-Wei Huang, Yu-Hau Huang, Yu-Chieh Chung, and Yeong-Kang Lai National Chung Hsing University

Developing car collision-avoidance warning system on portable devices aiming to alert drivers about driving environments has become more and more popular. In these systems, robust and reliable car detection is a critical step. This paper presents a vision-based vehicle detection system using a deep learning approach on portable platforms. We focus on the mobile system with camera which is mounted on the vehicle. Integrating detection with tracking is also discussed to illustrate the benefits of deep learning for car detection. Finally, we present the high efficient experimental results based on a portable device mounted on a car. The proposed car collision-avoidance warning system is suitable for selfdriving car applications.

 10:56 – 11:08 (SC34) A reconfigurable processing unit hardware design for AI accelerator Ching-Shun Wang, Yu-Cheng Hsueh, Hui-Ru Chung, and Chung-Bin Wu National Chung Hsing University

In this paper, a reconfigurable and high-throughput processing unit hardware design for deep learning neural networks accelerator is proposed. To reduce data access between DRAM and processing unit, high data reuse and high computational unit usage architecture are provided. The architecture proposed in this paper implements a Quantization aware INT8 precision, 64-bit AXI bus protocol, and parallel processing with 72 sets of processing units. In this architecture, the internal memory usage is 200 Kbytes, the proposed design working at 100 MHz operating frequency can provide 12.5 GOPS throughput and the average operating unit usage rate is 98.82.

### 5. 11:08 – 11:20 (SC35) Design and Implementation of Visual Deep Learning Network via HLS

Yu-Ta Lu, Wen-Shen Gu, and Kuan-Hung Chen

Feng-Chia University

Object detection is a technology in the very first priority for machine to become intelligent. Deep learning algorithms have brought obvious detection performance improvement; however, the accompanying huge model size in terms of hundreds of layers and thousands of megabytes of weights sticks the step of physical realization. This article describes the design and implementation in hardware for image recognition with CNN model. It details the design process of hardware accelerator. Instead of manually coded Verilog, the High Level Synthesis (HLS) tool is adopted to generate RTL code used on FPGA to conquer the challenging gap between deep learning algorithm and customized hardware. As a result, a compressed CNN model, i.e., the Agile model [1], with a small size of only 2189KB can be successfully applied to FPGA PYNQ-Z2 for recognition of car, motorcycle, bus and pedestrian.

# Session Oral 14 (8/9 Fri. 10:20 – 11:20)

Session Topic: Machine Learning for EDA

Session Chair: Ing-Chao Lin (National Cheng Kung University)

Room: 6F 茗廳

1. 10:20 – 10:32 (SB21) Register Clustering by Effective Mean Shift Algorithm

Iris Hui-Ru Jiang and Tung-Wei Lin

**National Taiwan University** 

As the wide adoption of FinFET technology in mass production, dynamic power becomes the bottleneck to achieving low power. Therefore, clock power reduction is crucial in modern IC design. Register clustering can effectively save clock power because of significantly reducing the number of clock sinks and register pin capacitance, clock routed wire length, and the number of clock buffers. In this talk, we review prior

works on register clustering and present effective mean shift to naturally form clusters according to register distribution without placement disruption. Unlike clique partitioning and k-means, effective mean shift fulfills the requirements to be a good register clustering algorithm because it needs no prespecified number of clusters, is insensitive to initializations, is robust to outliers, is tolerant of various register distributions, is efficient and scalable, and balances clock power reduction against timing degradation. Experimental results show that effective mean shift achieves superior power and timing balancing, as well as efficiency and scalability.

2. 10:32 – 10:44 (SB22) Early-stage power grid analysis based on machine learning techniques

Chi-Hsien Pao and Yu-Min Lee

National Chiao Tung University

With new technology advances, the growing size of the power grid makes IR drop analysis become more

challenging and can no longer be solved efficiently by traditional methods. As a result, we propose a machine-learning-based static IR drop prediction model that can predict the IR drop of any given node on the power grid. Whenever designers locally change the power grid structure for optimization or robustness, we do not have to re-analyze the whole circuit again as a brand new circuit, which is time-consuming. We extract eleven features that can capture the behavior of the power grid and are very scalable, and use XGBoost as our prediction model, which is based on the regression tree ensemble. XGBoost is extremely powerful and used by most of the winners in many competitions. Experimental results show our method can accurately and efficiently be applied to large-scaled design.

- 3. 10:44 10:56 (SB23) Decision-Tree-Based Classification Method for SMD Electronic Components Yun-Jie Ni (1), Yi-Ting Chen (2), Yan-Jhih Wang (2), and Tsung-Yi Ho (1)
  - (1) Naional Tsing Hua University and (2) FootPrintKu Inc.

Achieving automation by machine learning has been an upward trend for recent years. However, some know-how only keep in engineers' mind which is not recorded such as footprint drawing for Printed Circuit Board (PCB). Drawing rules for footprints such as area of Design for Assembly (DFA) bound, route keepout, and etc, are actually vary between types of components. Therefore, a classification method for footprints without type labels in databases is needed. In this paper, we propose a decision-tree-based classification method for Surface Mounting Device (SMD) electronic components. Decision trees can deal with numeric and categorical data and are understandable which helping analyzing drawing rules. Information in footprints, such as pads number, component height, and pin pitch from SMD components is considered as input features. Objective function of decision trees is adjusted to reduce leaf nodes of the tree. Thus, this method provides a better way to classify SMD electronic components and help analyzing design rule.

4. 10:56 – 11:08 (SB24) On Efficient Learning-based Performance Exploration for Analog Circuit Synthesis Po-Cheng Pan, Chien-Chia Huang, and Hung-Ming Chen

#### **National Chiao Tung University**

An efficient synthesis technique for modern analog circuits is important yet challenging due to the repeatedly re-synthesis process. To precisely explore the analog circuit performance limitation on the required technology is time consuming. This work presents a learning-based framework for searching the limitation of analog circuits. With hierarchical architecture, the dimension of solution space can be reduced. Bayesian linear regression and support vector machine model are selected to speed up the algorithm and better performance quality can be retrieved. Experimental results show that our approach on two analog circuits can achieve up to 9x runtime speed-up without surrendering performance qualities.

5. 11:08 – 11:20 (SB25) Machine Learning-based Pin Accessibility Prediction and Optimization during Placement

Tao-Chun Yu and Shao-Yun Fang

### National Taiwan University of Science and Technology

With the continuous scaling down of process nodes, standard cells become much smaller and cell counts are dramatically increased. Pin accessibility becomes one of the major issues causing design rule violations (DRVs). To tackle this problem, many recent works apply machine learning-based techniques to predict whether a local region has DRV or not by regarding global routing (GR) congestion and local pin density as the main features during the training process. Empirically, however, DRV occurrence is not necessary to be strongly correlated with the two features in advanced nodes. In this paper, we propose a deep learningbased DRV predictor without referring to GR congestion and pin density to identify whether a DRV will exist or not due to bad pin accessibility. Experimental results show that the proposed models are superior than those of previous studies in terms of all quantitative metrics. Additionally, the numbers of DRVs can be dramatically reduced by applying the proposed model-guided detailed placement flow.