順序 | 口頭報告論文 |
1 | 11:30 – 11:42 (SC11) Memory-contention Aware Warp Scheduler for Computing GPU Chien-Ming Chiu, Kuan-Chung Chen, Jhi-Han Jheng, Kuan-Lin Huang, Tsung-Han Tsou, Feng-Ming Hsu, Juin-Ming Lu, and Chung-Ho Chen National Cheng Kung University Abstract: We will first introduce a computing GPU which supports both OpenCL and TensorFlow framework. The proposed GPU aims at the deployment of edge AI computing devices. To address the memory contention problem of the GPU, a serious performance bottleneck in resource-limited GPUs, we propose a Memory Contention Aware Warp Scheduler (MAWS) to strike a dynamic balance between the memory workload requirements and the given memory resources. By measuring the Load/Store Unit (LSU) Stall ratio in a sampling interval and accurately monitoring the variations in memory contention, MAWS finds a suitable warp concurrency that fits the limited memory resources well, and as a result significantly improve the effective throughput. |
2 | 11:42 – 11:54 (SC12) Establishing Cooperation Between Camera Applications and Flash-Based Storage to Improve JPEG File Reliability Yu-Chun Kuo, Chia-Yu Hu, Ruei-Fong Chiu, and Ren-Shuo Liu National Tsing Hua University Abstract: NAND flash-based storage such as SD cards and eMMC chips are the most widely used media for Camera Applications. In this work, we propose to establish cooperation between camera applications and flash-based storage to improve the reliability of JPEG files stored in the storage. We conduct realsystem experiments by storing JPEG files on flash chips to evaluate the benefits of our proposed techniques. Experimental results demonstrate that the reliability of JPEG files can be significantly enhanced. |
3 | 11:54 – 12:06 (SC13) Considerations of Integrating Computing-In-Memory and Processing-In-Sensor into Convolutional Neural Network Accelerators for Low-Power Edge Devices Kea-Tiong Tang (1), Wei-Chen Wei (1), Zuo-Wei Yeh (1), Tzu-Hsiang Hsu (1), Yen-Cheng Chiu (1), Cheng-Xin Xue (1), Yu-Chun Kuo (1), Tai-Hsing Wen (1), Mon-Shu Ho (2), Chung-Chuan Lo (1), Ren-Shuo Liu (1), Chih-Cheng Hsieh (1), and Meng-Fan Chang (1) (1)National Tsing Hua University and (2) National Chung Hsin University. Abstract: In quest to explore emerging deep learning algorithms at edge devices, developing low-power and low-latency deep learning acceelerators (DLAs) have become top priority. To achieve this goal, data processing techniques in sensor and memory utilizing the array structure have drawn much attention. Processing-in-sensor (PIS) solutions could reduce data transfer, computingin-memory (CIM) macros could reduce memory access and intermediate data movement. We propose a new architecture to integrate PIS and CIM to realize low-power DLA. The advantages of using these techniques and the challenges from system point-of-view are discussed. |
4 | 12:06 – 12:18 (SC14) STT-MRAM for Deep Convolutional Neural Network Acceleration Chih-Cheng Chang, Chun-Hsien Li, Tian-Sheuan Chang, and Tuo-Hung Hou National Chiao Tung University Abstract: Binary STT-MRAM is a highly anticipated embedded non-volatile memory technology in advanced logic nodes < 28 nm. How to enable its in-memory computing (IMC) capability is critical for enhancing AI Edge. Based on the soon-available STT-MRAM, we report the first binary deep convolutional neural network capable of both local and remote learning. Exploiting intrinsic cumulative switching probability, accurate online training of CIFAR-10 color images (~ 90%) is realized using a relaxed endurance spec (switching 20 times) and hybrid digital/IMC design. For offline training, the accuracy loss due to imprecise weight placement can be mitigated using a rapid non-iterative training-with-noise and fine-tuning scheme. |
5 | 12:18 – 12:30 (SC15) Efficient Design of Multiple Writes for Algorithmic Multi-ported Memory Bo-Ya Chen, Bo-En Chen, and Bo-Cheng Lai National Chiao Tung University Abstract: This paper proposes REMAP+, a novel design that enables efficient write scheme for algorithmic multi-ported memory, and attains better performance with smaller area. REMAP+ applies the banking structure of memory design and implements the remap table with SRAM cells instead of costly registers. In the remap table, REMAP+ only keeps the most significant bit of write addresses to more efficiently utilize the space in the table. The hash write controller is simplified with the first fit algorithm to handle write conflict with shorter latency. REMAP+ is implemented in a pipeline scheme to further increase the processing throughput. For a 3W1R memory with 16K depth, REMAP+ has attained 22% shorter access latency and 31.3% smaller area when compared with the previous design. |
指導單位:
教育部資訊及科技教育司
主辦單位:
臺灣積體電路設計學會
承辦單位:
國立中正大學電機工程學系
國立中正大學資訊工程學系
協辦單位:
科技部工程司工程科技推展中心
國研院台灣半導體研究中心
中國電機工程學會
智慧聯網整合推動聯盟中心
中央研究院資訊科學研究所
贊助單位:
聯發科技股份有限公司
日月光半導體製造股份有限公司
威鋒電子股份有限公司
奇景光電股份有限公司
荷蘭商益華國際電腦科技(股)公司台灣分公司
台灣新思科技股份有限公司
財團法人工業技術研究院
瑞昱半導體股份有限公司
聯詠科技股份有限公司
鈦思科技股份有限公司
台灣是德科技股份有限公司
和澄科技股份有限公司
愛爾蘭商明導股份有限公司台灣分公司
一元素科技股份有限公司
力旺電子股份有限公司
博鑫醫電股份有限公司
群聯電子股份有限公司
晶心科技股份有限公司