第29届IEEE International Symposium on High-Performance Computer Architecture (HPCA-29)会议将于2023年2月27日至3月1日在加拿大蒙特利尔举行。HPCA会议是计算机架构领域的重要会议,主要关注高性能计算机架构,覆盖加速器、近似计算、新架构编译器及编程语言、CPU/微架构、GPGPU/GPU计算、FPGA/CGRA/可重构系统、云计算和边缘计算、近/内存计算等议题。
本次HPCA会议将与CGO、CC、PPoPP三个学术会议共同举行,是疫情后首次线下会议。华为多伦多异构编译器实验室合作教授兼高级顾问阿尔伯塔大学Nelson Amaral教授担任会议主席。多伦多异构编译器实验室作为HPCA-29的赞助方,将派出专家团队前往会议进行现场交流。
本文将介绍如下重点议程前瞻:
01 Architecture
HPCA Session 3B: Datacenters and HPC
论文:
RAMBDA: RDMA-driven Acceleration Framework for Memory-intensive us-scale Datacenter Applications
作者:
Yifan Yuan (UIUC/Intel Labs), Jinghan Huang (UIUC), Yan Sun (UIUC), Tianchen Wang (UIUC), Jacob Nelson (Microsoft Research), Dan Ports (Microsoft Research), Yipeng Wang (Intel Labs), Ren Wang (Intel Labs), Charlie Tai (Intel Labs), Nam Sung Kim (UIUC)
摘要:
Responding to the “datacenter tax” and “killer microseconds” problems for datacenter applications, diverse solutions including Smart NIC-based ones have been proposed. Nonetheless, they often suffer from high overhead of communications over network and/or PCIe links. To tackle the limitations of the current solutions, this paper proposes ORCA, a holistic network and architecture co-design solution that leverages current RDMA and emerging cache-coherent off-chip interconnect technologies. Specifically, ORCA consists of four hardware and software components: (1) unified abstraction of inter- and intra-machine communications managed by one-sided RDMA write and cache-coherent memory write; (2) efficient notification of requests to accelerators assisted by cache coherence; (3) cache-coherent accelerator architecture directly processing requests received by NIC; and (4) adaptive device-tohost data transfer for modern server memory systems consisting of both DRAM and NVM exploiting state-of-the-art features in CPUs and PCIe. We prototype ORCA with a commercial system and evaluate three popular datacenter applications: in-memory key-value store, chain replication-based distributed transaction system, and deep learning recommendation model inference. The evaluation shows that ORCA provides 30.1∼69.1% lower latency, up to 2.5× higher throughput, and ∼3× higher power efficiency than the current state-of-the-art solutions.
领军专家:

Charlie Tai, Intel Labs
Charlie T. Tai is a senior architect of Intel Architecture Labs, and has focused on communications and networking technologies and strategies. He joined Intel in 1992 and has held a number of positions in research, architecture development, and program management. He has represented Intel in the UPnP Forum Steering Committee since its inception, and also has represented Intel in many other standards and industry organizations. He received his M.S. and Ph.D. degrees in computer science from UCLA, and his B.S. in computer science from National Taiwan University.
领军教授

Nam Sung Kim, UIUC
Nam Sung Kim is a full professor of electrical and computer engineering at the University of Illinois at Urbana–Champaign an IEEE and ACM Fellow. His interdisciplinary research incorporates device, circuit, architecture, and software for power-efficient computing. He has published more than 200 refereed articles to highly selective conferences and journals in the field of digital circuit, processor architecture, and computer-aided design. The top three most frequently cited papers have more than 4500 citations and the total number of citations of all his papers exceeds 12000.
HPCA Session 7B: Microarchitecture and memory Systems
论文:
A Storage-Effective BTB Organization for Servers
作者:
Truls Asheim (Norwegian University of Science and Technology), Boris Grot (University of Edinburgh), Rakesh Kumar (Norwegian University of Science and Technology (NTNU))
摘要:
Many contemporary applications feature multi-megabyte instruction footprints that overwhelm the capacity of branch target buffers (BTB) and instruction caches (L1-I), causing frequent front-end stalls that inevitably hurt performance. BTB capacity is crucial for performance as a sufficiently large BTB enables the front-end to accurately resolve the upcoming execution path and steer instruction fetch appropriately. Moreover, it also enables highly effective fetch-directed instruction prefetching that can eliminate a large portion L1-I misses. For these reasons, commercial processors allocate vast amounts of storage capacity to BTBs.
This work aims to reduce BTB storage requirements by optimizing the organization of BTB entries. Our key insight is that storing branch target offsets, instead of full or compressed targets, can drastically reduce BTB storage cost as the vast majority of dynamic branches have short offsets requiring just a handful of bits to encode. Based on this insight, we size the ways of a set associative BTB to hold different number of target offset bits such that each way stores offsets within a particular range. Doing so enables a dramatic reduction in storage for target addresses. Our final design, called BTB-X, uses an 8-way set associative BTB with differently sized ways that enables it to track about 2.24x more branches than a conventional BTB and 1.3x more branches than a storage-optimized state-of-the-art BTB organization, called PDede, with the same storage budget.
论文:
HoPP: Hardware-Software Co-Designed Page Prefetching for Disaggregated Memory
作者:
Haifeng Li (Institute of Computing Technology, Chinese Academy of Sciences, University of Chinese Academy of Sciences), Ke Liu (Institute of Computing Technology, Chinese Academy of Sciences), Ting Liang (Institute of Computing Technology, Chinese Academy of Sciences, University of Chinese Academy of Sciences), Zuojun Li (Institute of Computing Technology, Chinese Academy of Sciences), Tianyue Lu (Institute of Computing Technology, Chinese Academy of Sciences), Hui Yuan (Huawei), Yinben Xia (Huawei), Yungang Bao (ICT, CAS), Mingyu Chen (Institute of Computing Technology, Chinese Academy of Sciences), Yizhou Shan (Huawei Cloud) Speculative Register Reclamation, Sanyam Mehta (HPCA) CARE: A Concurrency-Aware Enhanced Lightweight Cache Management Framework, Xiaoyang Lu, Rujia Wang, Xian-He Sun (HPCA)
摘要:
Memory disaggregation is a promising direction to mitigate memory contention in datacenters. To make memory disaggregation practical, prior efforts expose remote memory to applications transparently via virtual memory subsystem’s swapping interface. However, due to the semantic gap between OS and applications– OS cannot know the memory accessing sequences of an application but via page faults. This approach has two limitations. First, it learns little from page faults’ access history, which leads to sub-optimal prefetching predictions. Second, a page fault can still occur even if there is a prefetch-hit which leads to a large kernel overhead. To address such limitations, our key insight is to decouple the address capturing from page faults by collecting full memory access traces in the memory controller. Using this idea, we build HoPP– a hardware-software co-designed prefetching framework. HoPP adds hardware modules to the memory controller to feed sufficient hot pages to OS in real-time, which has three benefits in HoPP’s software design: 1) it improves existing prefetching algorithms with simple revamps, also offers more insights to build better policies; 2) the prefetch algorithm can run as a separate data path alongside the normal remote data path via page faults, potentially hiding the swap latency from applications, and enabling fine-grained control over prefetching behaviors; 3) the prefetch-hit overhead can be eliminated by early page table entry (PTE) injection, i.e., inject PTE for the prefetched page as soon as it returns. We implemented a proof-of-concept prototype using commodity servers along with a hardwarebased memory tracking tool called HMTT to emulate a modified memory controller. Results show that compared to Fastswap and Leap, HoPP-optimized prefetching algorithm achieves over 90% accuracy and coverage, which leads to up to 59% completion time improvement for various datacenter applications.
领军教授:

Mingyu Chen, Institute of Computing Technology, Chinese Academy of Sciences
Mingyu Chen (陈明宇) received the BEng degree in electronics engineering, and information science from the University of Science and Technology of China, He Fei, China, and the PhD degree in computer architecture from the Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS), Beijing, China, in 1994 and 2000, respectively. He is currently a professor at ICT, CAS. He is currently a professor at ICT, CAS, and the University of Chinese Academy of Sciences (UCAS), Beijing, China. His current research interests include high-performance computer architecture and operating system.
02 Neural Networks - Acceleration & Edge Computing
HPCA Session 2A: Accelerators
论文:
HIRAC: A Hierarchical Accelerator with Sorting-based Packing for SpGEMMs in DNN Applications
作者:
Hesam Shabani (Lehigh University), Abhishek Singh (Lehigh University), Bishoy Youhana (Lehigh University), Xiaochen Guo (Lehigh University)
摘要:
暂无
论文:
ViTCoD: Vision Transformer Acceleration via Dedicated Algorithm and Accelerator Co-Design
作者:
Haoran You (Georgia Tech), Zhanyi Sun (Rice University), Huihong Shi (Nanjing University), Zhongzhi Yu (Georgia Tech), Yang Zhao (Rice University), Yongan Zhang (Georgia Tech), Chaojian Li (Georgia Tech), Baopu Li (Oracle Health and AI), Yingyan (Celine) Lin (Georgia Tech)
摘要:
Vision Transformers (ViTs) have achieved state-of-the-art performance on various vision tasks. However, ViTs' self-attention module is still arguably a major bottleneck, limiting their achievable hardware efficiency. Meanwhile, existing accelerators dedicated to NLP Transformers are not optimal for ViTs. This is because there is a large difference between ViTs and NLP Transformers: ViTs have a relatively fixed number of input tokens, whose attention maps can be pruned by up to 90% even with fixed sparse patterns; while NLP Transformers need to handle input sequences of varying numbers of tokens and rely on on-the-fly predictions of dynamic sparse attention patterns for each input to achieve a decent sparsity (e.g., >=50%). To this end, we propose a dedicated algorithm and accelerator co-design framework dubbed ViTCoD for accelerating ViTs. Specifically, on the algorithm level, ViTCoD prunes and polarizes the attention maps to have either denser or sparser fixed patterns for regularizing two levels of workloads without hurting the accuracy, largely reducing the attention computations while leaving room for alleviating the remaining dominant data movements; on top of that, we further integrate a lightweight and learnable auto-encoder module to enable trading the dominant high-cost data movements for lower-cost computations. On the hardware level, we develop a dedicated accelerator to simultaneously coordinate the enforced denser/sparser workloads and encoder/decoder engines for boosted hardware utilization. Extensive experiments and ablation studies validate that ViTCoD largely reduces the dominant data movement costs, achieving speedups of up to 235.3x, 142.9x, 86.0x, 10.1x, and 6.8x over general computing platforms CPUs, EdgeGPUs, GPUs, and prior-art Transformer accelerators SpAtten and Sanger under an attention sparsity of 90%, respectively.
HPCA Session 4A: Neural Networks and Accelerators 2
论文:
ISOSceles: Accelerating Sparse CNNs through Inter-Layer Pipelining
作者:
Yifan Yang (MIT), Joel Emer (MIT/NVIDIA), Daniel Sanchez (MIT)
摘要:
暂无
领军教授:

Daniel Sanchez, MIT
Daniel Sanchez is an Associate Professor at MIT's Electrical Engineering and Computer Science Department and a member of the Computer Science and Artificial Intelligence Laboratory. He works in computer architecture and computer systems. His current research focuses on large-scale multicores with hundreds to thousands of cores, scalable and efficient memory hierarchies, architectures with quality-of-service guarantees, and scalable runtimes and schedulers. Before joining MIT in September 2012, he earned a Ph.D. in Electrical Engineering from Stanford University, where he worked with Professor Christos Kozyrakis.
HPCA Session 4C: Quantum and FPGAs
论文:
Duet: Creating Harmony between Processors and Embedded FPGAs
作者:
Ang Li (Princeton University), August Ning (Princeton University), David Wentzlaff (Princeton University)
摘要:
The demise of Moore's Law has led to the rise of hardware acceleration. However, the focus on accelerating stable algorithms in their entirety neglects the abundant fine-grained acceleration opportunities available in broader domains and squanders host processors' compute power. This paper presents Duet, a scalable, manycore-FPGA architecture that promotes embedded FPGAs (eFPGA) to be equal peers with processors through non-intrusive, bi-directionally cache-coherent integration. In contrast to existing CPU-FPGA hybrid systems in which the processors play a supportive role, Duet unleashes the full potential of both the processors and the eFPGAs with two classes of post-fabrication enhancements: fine-grained acceleration, which partitions an application into small tasks and offloads the frequently-invoked, compute-intensive ones onto various small accelerators, leveraging the processors to handle dynamic control flow and less accelerable tasks; hardware augmentation, which employs eFPGA-emulated hardware widgets to improve processor efficiency or mitigate software overheads in certain execution models. An RTL-level implementation of Duet is developed to evaluate the architecture with high fidelity. Experiments using synthetic benchmarks show that Duet can reduce the processor-accelerator communication latency by up to 82% and increase the bandwidth by up to 9.5x. The RTL implementation is further evaluated with seven application benchmarks, achieving 1.5-24.9x speedup.
HPCA Session 5A: Cloud and Edge Computing
论文:
eNODE: Energy-Efficient and Low-Latency Edge Inference and Training of Neural ODEs
作者:
Junkang Zhu (University of Michigan, Ann Arbor), Yaoyu Tao (University of Michigan, Ann Arbor), Zhengya Zhang (University of Michigan, Ann Arbor)
摘要:
暂无
论文:
SpecFaaS: Accelerating Serverless Applications with Speculative Function Execution
作者:
Jovan Stojkovic (University of Illinois at Urbana-Champaign), Tianyin Xu (University of Illinois at Urbana-Champaign), Hubertus Franke (IBM Research), Josep Torrellas (University of Illinois Urbana-Champaign)
摘要:
Serverless computing automates fine-grained resource scaling and simplifies the development and deployment of online services with stateless functions. However, it is still non-trivial for users to allocate appropriate resources due to various function types, dependencies, and input sizes. Misconfiguration of resource allocations leaves functions either under-provisioned or over-provisioned and leads to continuous low resource utilization. This paper presents Freyr, a new resource manager (RM) for serverless platforms that maximizes resource efficiency by dynamically harvesting idle resources from over-provisioned functions to under-provisioned functions. Freyr monitors each function's resource utilization in real-time, detects over-provisioning and under-provisioning, and learns to harvest idle resources safely and accelerates functions efficiently by applying deep reinforcement learning algorithms along with a safeguard mechanism. We have implemented and deployed a Freyr prototype in a 13-node Apache OpenWhisk cluster. Experimental results show that 38.8% of function invocations have idle resources harvested by Freyr, and 39.2% of invocations are accelerated by the harvested resources. Freyr reduces the 99th-percentile function response latency by 32.1% compared to the baseline RMs.
领军教授:

Josep Torrellas, UIUC
Josep Torrellas is Professor and Willett Faculty Scholar in the Department of Computer Science and a research faculty for the Universal Parallel Computing Research Center at the University of Illinois at Urbana-Champaign. Torrellas has made many contributions to shared-memory multiprocessor architectures and thread-level speculation (TLS) over more than thirty years. He was the first to apply TLS ideas to parallel architectures and programs: He used speculative multithreading to eliminate stalling due to synchronization (Speculative Synchronization), identify and debug data races (ReEnact), monitor memory accesses (I-Watcher), provide fault tolerance (ReVive), and inexpensively enforce sequential consistency (BulkSC). Some of these ideas impacted IBM's Blue Gene and other commercial machines.
论文:
Know Your Enemy To Save Cloud Energy: Energy-Performance Characterization of Machine Learning Serving
作者:
Jun Yeol Ryu (Sungkyunkwan University), Jongseok Kim (Sungkyunkwan University), Euiseong Seo (Sungkyunkwan University)
摘要:
暂无
HPCA Session 6A: Industry Track Session
论文:
High Performance and Power Efficient Accelerator for Cloud Inference
作者:
Jianguo Yao (SJTU/Enflame-Tech Inc.), Hao Zhou (Enflame-Tech Inc.), Yalin Zhang (Enflame-Tech Inc.), Ying Li (Enflame-Tech Inc.), Chuang Feng (Enflame-Tech Inc.), Shi Chen (Enflame-Tech Inc.), Jiaoyan Chen (Enflame-Tech Inc.), Yongdong Wang (Enflame-Tech Inc.), Qiaojuan Hu (Enflame-Tech Inc.)
摘要:
暂无
信息来源:
https://hpca-conf.org/2023/
后续毕昇编译公众号将持续关注HPCA-29会议技术动向,为大家带来精彩的技术分享!

第29届IEEE International Symposium on High-Performance Computer Architecture (HPCA-29)会议将于2023年2月27日至3月1日在加拿大蒙特利尔举行。HPCA会议是计算机架构领域的重要会议,主要关注高性能计算机架构,覆盖加速器、近似计算、新架构编译器及编程语言、CPU/微架构、GPGPU/GPU计算、FPGA/CGRA/可重构系统、云计算和边缘计算、近/内存计算等议题。
本次HPCA会议将与CGO、CC、PPoPP三个学术会议共同举行,是疫情后首次线下会议。华为多伦多异构编译器实验室合作教授兼高级顾问阿尔伯塔大学Nelson Amaral教授担任会议主席。多伦多异构编译器实验室作为HPCA-29的赞助方,将派出专家团队前往会议进行现场交流。
本文将介绍如下重点议程前瞻:
01 Architecture
HPCA Session 3B: Datacenters and HPC
论文:
RAMBDA: RDMA-driven Acceleration Framework for Memory-intensive us-scale Datacenter Applications
作者:
Yifan Yuan (UIUC/Intel Labs), Jinghan Huang (UIUC), Yan Sun (UIUC), Tianchen Wang (UIUC), Jacob Nelson (Microsoft Research), Dan Ports (Microsoft Research), Yipeng Wang (Intel Labs), Ren Wang (Intel Labs), Charlie Tai (Intel Labs), Nam Sung Kim (UIUC)
摘要:
Responding to the “datacenter tax” and “killer microseconds” problems for datacenter applications, diverse solutions including Smart NIC-based ones have been proposed. Nonetheless, they often suffer from high overhead of communications over network and/or PCIe links. To tackle the limitations of the current solutions, this paper proposes ORCA, a holistic network and architecture co-design solution that leverages current RDMA and emerging cache-coherent off-chip interconnect technologies. Specifically, ORCA consists of four hardware and software components: (1) unified abstraction of inter- and intra-machine communications managed by one-sided RDMA write and cache-coherent memory write; (2) efficient notification of requests to accelerators assisted by cache coherence; (3) cache-coherent accelerator architecture directly processing requests received by NIC; and (4) adaptive device-tohost data transfer for modern server memory systems consisting of both DRAM and NVM exploiting state-of-the-art features in CPUs and PCIe. We prototype ORCA with a commercial system and evaluate three popular datacenter applications: in-memory key-value store, chain replication-based distributed transaction system, and deep learning recommendation model inference. The evaluation shows that ORCA provides 30.1∼69.1% lower latency, up to 2.5× higher throughput, and ∼3× higher power efficiency than the current state-of-the-art solutions.
领军专家:
Charlie Tai, Intel Labs
Charlie T. Tai is a senior architect of Intel Architecture Labs, and has focused on communications and networking technologies and strategies. He joined Intel in 1992 and has held a number of positions in research, architecture development, and program management. He has represented Intel in the UPnP Forum Steering Committee since its inception, and also has represented Intel in many other standards and industry organizations. He received his M.S. and Ph.D. degrees in computer science from UCLA, and his B.S. in computer science from National Taiwan University.
领军教授
Nam Sung Kim, UIUC
Nam Sung Kim is a full professor of electrical and computer engineering at the University of Illinois at Urbana–Champaign an IEEE and ACM Fellow. His interdisciplinary research incorporates device, circuit, architecture, and software for power-efficient computing. He has published more than 200 refereed articles to highly selective conferences and journals in the field of digital circuit, processor architecture, and computer-aided design. The top three most frequently cited papers have more than 4500 citations and the total number of citations of all his papers exceeds 12000.
HPCA Session 7B: Microarchitecture and memory Systems
论文:
A Storage-Effective BTB Organization for Servers
作者:
Truls Asheim (Norwegian University of Science and Technology), Boris Grot (University of Edinburgh), Rakesh Kumar (Norwegian University of Science and Technology (NTNU))
摘要:
Many contemporary applications feature multi-megabyte instruction footprints that overwhelm the capacity of branch target buffers (BTB) and instruction caches (L1-I), causing frequent front-end stalls that inevitably hurt performance. BTB capacity is crucial for performance as a sufficiently large BTB enables the front-end to accurately resolve the upcoming execution path and steer instruction fetch appropriately. Moreover, it also enables highly effective fetch-directed instruction prefetching that can eliminate a large portion L1-I misses. For these reasons, commercial processors allocate vast amounts of storage capacity to BTBs.
This work aims to reduce BTB storage requirements by optimizing the organization of BTB entries. Our key insight is that storing branch target offsets, instead of full or compressed targets, can drastically reduce BTB storage cost as the vast majority of dynamic branches have short offsets requiring just a handful of bits to encode. Based on this insight, we size the ways of a set associative BTB to hold different number of target offset bits such that each way stores offsets within a particular range. Doing so enables a dramatic reduction in storage for target addresses. Our final design, called BTB-X, uses an 8-way set associative BTB with differently sized ways that enables it to track about 2.24x more branches than a conventional BTB and 1.3x more branches than a storage-optimized state-of-the-art BTB organization, called PDede, with the same storage budget.
论文:
HoPP: Hardware-Software Co-Designed Page Prefetching for Disaggregated Memory
作者:
Haifeng Li (Institute of Computing Technology, Chinese Academy of Sciences, University of Chinese Academy of Sciences), Ke Liu (Institute of Computing Technology, Chinese Academy of Sciences), Ting Liang (Institute of Computing Technology, Chinese Academy of Sciences, University of Chinese Academy of Sciences), Zuojun Li (Institute of Computing Technology, Chinese Academy of Sciences), Tianyue Lu (Institute of Computing Technology, Chinese Academy of Sciences), Hui Yuan (Huawei), Yinben Xia (Huawei), Yungang Bao (ICT, CAS), Mingyu Chen (Institute of Computing Technology, Chinese Academy of Sciences), Yizhou Shan (Huawei Cloud) Speculative Register Reclamation, Sanyam Mehta (HPCA) CARE: A Concurrency-Aware Enhanced Lightweight Cache Management Framework, Xiaoyang Lu, Rujia Wang, Xian-He Sun (HPCA)
摘要:
Memory disaggregation is a promising direction to mitigate memory contention in datacenters. To make memory disaggregation practical, prior efforts expose remote memory to applications transparently via virtual memory subsystem’s swapping interface. However, due to the semantic gap between OS and applications– OS cannot know the memory accessing sequences of an application but via page faults. This approach has two limitations. First, it learns little from page faults’ access history, which leads to sub-optimal prefetching predictions. Second, a page fault can still occur even if there is a prefetch-hit which leads to a large kernel overhead. To address such limitations, our key insight is to decouple the address capturing from page faults by collecting full memory access traces in the memory controller. Using this idea, we build HoPP– a hardware-software co-designed prefetching framework. HoPP adds hardware modules to the memory controller to feed sufficient hot pages to OS in real-time, which has three benefits in HoPP’s software design: 1) it improves existing prefetching algorithms with simple revamps, also offers more insights to build better policies; 2) the prefetch algorithm can run as a separate data path alongside the normal remote data path via page faults, potentially hiding the swap latency from applications, and enabling fine-grained control over prefetching behaviors; 3) the prefetch-hit overhead can be eliminated by early page table entry (PTE) injection, i.e., inject PTE for the prefetched page as soon as it returns. We implemented a proof-of-concept prototype using commodity servers along with a hardwarebased memory tracking tool called HMTT to emulate a modified memory controller. Results show that compared to Fastswap and Leap, HoPP-optimized prefetching algorithm achieves over 90% accuracy and coverage, which leads to up to 59% completion time improvement for various datacenter applications.
领军教授:
Mingyu Chen, Institute of Computing Technology, Chinese Academy of Sciences
Mingyu Chen (陈明宇) received the BEng degree in electronics engineering, and information science from the University of Science and Technology of China, He Fei, China, and the PhD degree in computer architecture from the Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS), Beijing, China, in 1994 and 2000, respectively. He is currently a professor at ICT, CAS. He is currently a professor at ICT, CAS, and the University of Chinese Academy of Sciences (UCAS), Beijing, China. His current research interests include high-performance computer architecture and operating system.
02 Neural Networks - Acceleration & Edge Computing
HPCA Session 2A: Accelerators
论文:
HIRAC: A Hierarchical Accelerator with Sorting-based Packing for SpGEMMs in DNN Applications
作者:
Hesam Shabani (Lehigh University), Abhishek Singh (Lehigh University), Bishoy Youhana (Lehigh University), Xiaochen Guo (Lehigh University)
摘要:
暂无
论文:
ViTCoD: Vision Transformer Acceleration via Dedicated Algorithm and Accelerator Co-Design
作者:
Haoran You (Georgia Tech), Zhanyi Sun (Rice University), Huihong Shi (Nanjing University), Zhongzhi Yu (Georgia Tech), Yang Zhao (Rice University), Yongan Zhang (Georgia Tech), Chaojian Li (Georgia Tech), Baopu Li (Oracle Health and AI), Yingyan (Celine) Lin (Georgia Tech)
摘要:
Vision Transformers (ViTs) have achieved state-of-the-art performance on various vision tasks. However, ViTs' self-attention module is still arguably a major bottleneck, limiting their achievable hardware efficiency. Meanwhile, existing accelerators dedicated to NLP Transformers are not optimal for ViTs. This is because there is a large difference between ViTs and NLP Transformers: ViTs have a relatively fixed number of input tokens, whose attention maps can be pruned by up to 90% even with fixed sparse patterns; while NLP Transformers need to handle input sequences of varying numbers of tokens and rely on on-the-fly predictions of dynamic sparse attention patterns for each input to achieve a decent sparsity (e.g., >=50%). To this end, we propose a dedicated algorithm and accelerator co-design framework dubbed ViTCoD for accelerating ViTs. Specifically, on the algorithm level, ViTCoD prunes and polarizes the attention maps to have either denser or sparser fixed patterns for regularizing two levels of workloads without hurting the accuracy, largely reducing the attention computations while leaving room for alleviating the remaining dominant data movements; on top of that, we further integrate a lightweight and learnable auto-encoder module to enable trading the dominant high-cost data movements for lower-cost computations. On the hardware level, we develop a dedicated accelerator to simultaneously coordinate the enforced denser/sparser workloads and encoder/decoder engines for boosted hardware utilization. Extensive experiments and ablation studies validate that ViTCoD largely reduces the dominant data movement costs, achieving speedups of up to 235.3x, 142.9x, 86.0x, 10.1x, and 6.8x over general computing platforms CPUs, EdgeGPUs, GPUs, and prior-art Transformer accelerators SpAtten and Sanger under an attention sparsity of 90%, respectively.
HPCA Session 4A: Neural Networks and Accelerators 2
论文:
ISOSceles: Accelerating Sparse CNNs through Inter-Layer Pipelining
作者:
Yifan Yang (MIT), Joel Emer (MIT/NVIDIA), Daniel Sanchez (MIT)
摘要:
暂无
领军教授:
Daniel Sanchez, MIT
Daniel Sanchez is an Associate Professor at MIT's Electrical Engineering and Computer Science Department and a member of the Computer Science and Artificial Intelligence Laboratory. He works in computer architecture and computer systems. His current research focuses on large-scale multicores with hundreds to thousands of cores, scalable and efficient memory hierarchies, architectures with quality-of-service guarantees, and scalable runtimes and schedulers. Before joining MIT in September 2012, he earned a Ph.D. in Electrical Engineering from Stanford University, where he worked with Professor Christos Kozyrakis.
HPCA Session 4C: Quantum and FPGAs
论文:
Duet: Creating Harmony between Processors and Embedded FPGAs
作者:
Ang Li (Princeton University), August Ning (Princeton University), David Wentzlaff (Princeton University)
摘要:
The demise of Moore's Law has led to the rise of hardware acceleration. However, the focus on accelerating stable algorithms in their entirety neglects the abundant fine-grained acceleration opportunities available in broader domains and squanders host processors' compute power. This paper presents Duet, a scalable, manycore-FPGA architecture that promotes embedded FPGAs (eFPGA) to be equal peers with processors through non-intrusive, bi-directionally cache-coherent integration. In contrast to existing CPU-FPGA hybrid systems in which the processors play a supportive role, Duet unleashes the full potential of both the processors and the eFPGAs with two classes of post-fabrication enhancements: fine-grained acceleration, which partitions an application into small tasks and offloads the frequently-invoked, compute-intensive ones onto various small accelerators, leveraging the processors to handle dynamic control flow and less accelerable tasks; hardware augmentation, which employs eFPGA-emulated hardware widgets to improve processor efficiency or mitigate software overheads in certain execution models. An RTL-level implementation of Duet is developed to evaluate the architecture with high fidelity. Experiments using synthetic benchmarks show that Duet can reduce the processor-accelerator communication latency by up to 82% and increase the bandwidth by up to 9.5x. The RTL implementation is further evaluated with seven application benchmarks, achieving 1.5-24.9x speedup.
HPCA Session 5A: Cloud and Edge Computing
论文:
eNODE: Energy-Efficient and Low-Latency Edge Inference and Training of Neural ODEs
作者:
Junkang Zhu (University of Michigan, Ann Arbor), Yaoyu Tao (University of Michigan, Ann Arbor), Zhengya Zhang (University of Michigan, Ann Arbor)
摘要:
暂无
论文:
SpecFaaS: Accelerating Serverless Applications with Speculative Function Execution
作者:
Jovan Stojkovic (University of Illinois at Urbana-Champaign), Tianyin Xu (University of Illinois at Urbana-Champaign), Hubertus Franke (IBM Research), Josep Torrellas (University of Illinois Urbana-Champaign)
摘要:
Serverless computing automates fine-grained resource scaling and simplifies the development and deployment of online services with stateless functions. However, it is still non-trivial for users to allocate appropriate resources due to various function types, dependencies, and input sizes. Misconfiguration of resource allocations leaves functions either under-provisioned or over-provisioned and leads to continuous low resource utilization. This paper presents Freyr, a new resource manager (RM) for serverless platforms that maximizes resource efficiency by dynamically harvesting idle resources from over-provisioned functions to under-provisioned functions. Freyr monitors each function's resource utilization in real-time, detects over-provisioning and under-provisioning, and learns to harvest idle resources safely and accelerates functions efficiently by applying deep reinforcement learning algorithms along with a safeguard mechanism. We have implemented and deployed a Freyr prototype in a 13-node Apache OpenWhisk cluster. Experimental results show that 38.8% of function invocations have idle resources harvested by Freyr, and 39.2% of invocations are accelerated by the harvested resources. Freyr reduces the 99th-percentile function response latency by 32.1% compared to the baseline RMs.
领军教授:
Josep Torrellas, UIUC
Josep Torrellas is Professor and Willett Faculty Scholar in the Department of Computer Science and a research faculty for the Universal Parallel Computing Research Center at the University of Illinois at Urbana-Champaign. Torrellas has made many contributions to shared-memory multiprocessor architectures and thread-level speculation (TLS) over more than thirty years. He was the first to apply TLS ideas to parallel architectures and programs: He used speculative multithreading to eliminate stalling due to synchronization (Speculative Synchronization), identify and debug data races (ReEnact), monitor memory accesses (I-Watcher), provide fault tolerance (ReVive), and inexpensively enforce sequential consistency (BulkSC). Some of these ideas impacted IBM's Blue Gene and other commercial machines.
论文:
Know Your Enemy To Save Cloud Energy: Energy-Performance Characterization of Machine Learning Serving
作者:
Jun Yeol Ryu (Sungkyunkwan University), Jongseok Kim (Sungkyunkwan University), Euiseong Seo (Sungkyunkwan University)
摘要:
暂无
HPCA Session 6A: Industry Track Session
论文:
High Performance and Power Efficient Accelerator for Cloud Inference
作者:
Jianguo Yao (SJTU/Enflame-Tech Inc.), Hao Zhou (Enflame-Tech Inc.), Yalin Zhang (Enflame-Tech Inc.), Ying Li (Enflame-Tech Inc.), Chuang Feng (Enflame-Tech Inc.), Shi Chen (Enflame-Tech Inc.), Jiaoyan Chen (Enflame-Tech Inc.), Yongdong Wang (Enflame-Tech Inc.), Qiaojuan Hu (Enflame-Tech Inc.)
摘要:
暂无
信息来源:
https://hpca-conf.org/2023/
后续毕昇编译公众号将持续关注HPCA-29会议技术动向,为大家带来精彩的技术分享!