
2023年IEEE/ACM International Symposium on Code Generation and Optimization (CGO)会议将于2023年2月25日至3月1日在加拿大蒙特利尔举行。CGO会议是编译器领域的重要会议,主要关注优化和代码生成技术及相关问题,探讨从纯静态到完全动态的方法、从纯基于软件到特定体系结构的功能以及对代码生成和优化的支持等议题。
本次CGO会议将与CC、HPCA、PPoPP三个学术会议共同举行,是疫情后首次线下会议。华为多伦多异构编译器实验室作为CGO 2023的赞助方,将派出专家团队前往会议现场进行交流。华为编译器与编程语言实验室主任Yaoqing Gao同实验室高级顾问、合作教授阿尔伯塔大学Nelson Amaral教授合办CGO 2023专题研讨会Languages, Architectures, and Tools for Heterogeneous Computing Workshop (LATHC)。
本文将介绍如下重点议程前瞻:
01 State-of-the-Art Compilers and Programming Tools
CGO Session 1: It's all about loops!
论文:
Code Synthesis for Sparse Tensor Format Conversion and Optimization
作者:
Tobi Popoola (Boise State University), Tuowen Zhao (University of Utah), Aaron St. George (Boise State University), Kalyan Bhetwal (Boise State University), Michelle Strout (University of Arizona), Mary Hall (University of Utah), Catherine R. M. Olschanowsky (Boise State University)
摘要:
Many scientific applications compute using sparse data and store that data in a variety of sparse formats because each format has unique space and performance benefits. Optimizing applications that use sparse data involves translating the sparse data into the chosen format and transforming the computation to iterate over that format. This paper presents a formal definition of sparse tensor formats and an automated approach to synthesize the transformation between formats. This approach is unique in that it supports ordering constraints not supported by other approaches and synthesizes the transformation code in a high-level intermediate representation suitable for applying composable transformations such as loop fusion and temporary storage reduction. We demonstrate that the synthesized code for COO to CSR with optimizations is 2.85x faster than TACO, Intel MKL, and SPARSKIT while the more complex COO to DIA is 1.4x slower than TACO but faster than SPARSKIT and Intel MKL using the geometric average of execution time.
领军教授:

Mary Hall, University of Utah
Mary Hall is a professor in the Computer Science department at University of Utah. She directs the Compiler Technology to Optimize Performance (CTOP) research group. Her research interests cover Automatic performance tuning, Model-guided empirical optimization, Interprocedural analysis and optimization, parallelizing compilers, programming support for optimization and parallelization, PIM-based architectures, compiling to FPGA-based systems.
CGO Session 2: Tool and Practical Experience
论文:
Lifting Code Generation of Cardiac Physiology Simulation to Novel Compiler Technology
作者:
Arun Thangamani (ICube Lab., University of Strasbourg and INRIA Nancy-Grand Est), Tiago Trevisan Jost (ICube Lab., University of Strasbourg and INRIA Nancy-Grand Est), Vincent Loechner (ICube Lab., University of Strasbourg and INRIA Nancy-Grand Est), Stéphane Genaud (ICube Lab., University of Strasbourg and INRIA Nancy-Grand Est), Bérenger Bramas (ICube Lab., University of Strasbourg and INRIA Nancy-Grand Est)
摘要:
The study of numerical models for the human body has become a major focus of the research community in biology and medicine. For instance, numerical ionic models of a complex organ, such as the heart, must be able to represent individual cells and their interconnections through ionic channels, forming a system with billions of cells, and requiring efficient code to handle such a large system. The modeling of the electrical system of the heart combines a compute-intensive kernel that calculates the intensity of current flowing through cell membranes, and feeds a linear solver for computing the electrical potential of each cell.
Considering this context, we propose limpetMLIR, a code generator and compiler transformer to accelerate the kernel phase of ionic models and bridge the gap between compiler technology and electrophysiology simulation. LimpetMLIR makes use of the MLIR infrastructure, its dialects, and transformations to drive forward the study of ionic models, and accelerate the execution of multi-cell systems. Experiments conducted in 43 ionic models show that our limpetMLIR based code generation greatly outperforms current state-of-the-art simulation systems by an average of 2.9×, reaching peak speedups of more than 15× in some cases. To the best of our knowledge, this is the first work that deeply connects an optimizing compiler infrastructure to electrophysiology models of the human body, showing the potential benefits of using compiler technology in the simulation of human cell interactions.
CGO Session 6: Tool and Practical Experience Ⅱ
论文:
Bridging Control-Centric and Data-Centric Optimization
作者:
Tal Ben-Nun (Lawrence Livermore National Laboratory), Berke Ates (ETH Zurich), Alexandru Calotoiu (ETH Zurich), Torsten Hoefler (ETH Zurich)
摘要:
With the rise of specialized hardware and new programming languages, code optimization has shifted its focus towards promoting data locality. Most production-grade compilers adopt a control-centric mindset - instruction-driven optimization augmented with scalar-based dataflow - whereas other approaches provide domain-specific and general purpose data movement minimization, which can miss important control-flow optimizations. As the two representations are not commutable, users must choose one over the other. In this paper, we explore how both control- and data-centric approaches can work in tandem via the Multi-Level Intermediate Representation (MLIR) framework. Through a combination of an MLIR dialect and specialized passes, we recover parametric, symbolic dataflow that can be optimized within the DaCe framework. We combine the two views into a single pipeline, called DCIR, showing that it is strictly more powerful than either view. On several benchmarks and a real-world application in C, we show that our proposed pipeline consistently outperforms MLIR and automatically uncovers new optimization opportunities with no additional effort.
领军教授:

Torsten Hoefler, ETH Zurich
Torsten Hoefler is a Professor of Computer Science at ETH Zurich, a member of Academia Europaea, and a Fellow of the IEEE. He directs the Scalable Parallel Computing Laboratory (SPCL) at D-INFK ETH Zurich. He received his PhD degree in 2007 at Indiana University and started his first professor appointment in 2011 at the University of Illinois at Urbana-Champaign. Following a “Performance as a Science” vision, he combines mathematical models of architectures and applications to design optimized computing systems.
论文:
Parsimony: Enabling SIMD/Vector Programming in Standard Compiler Flows
作者:
Vijay Kandiah (Northwestern University), Daniel Lustig (NVIDIA), Oreste Villa (NVIDIA), David Nellans (NVIDIA), Nikos Hardavellas (Northwestern University)
摘要:
Achieving peak throughput on modern CPUs requires maximizing the use of single-instruction, multiple-data (SIMD) or vector compute units. Single-program, multiple-data (SPMD) programming models are an effective way to use high-level programming languages to target these ISAs. Unfortunately, many SPMD frameworks have evolved to have either overly-restrictive language specifications or under-specified programming models, and this has has slowed the widescale adoption of SPMD-style programming. This paper introduces Parsimony (PARallel SIMd), a SPMD programming approach built with semantics designed to be compatible with multiple languages and to cleanly integrate into the standard optimizing compiler toolchains for those languages. We first explain the Parsimony programming model semantics and how they enable a standalone compiler IR-to-IR optimization pass that can perform vectorization independently of other compiler passes, improving the language and toolchain compatibility of SPMD programming. We then demonstrate a LLVM prototype of the Parsimony approach that matches the performance of ispc, a popular but more restrictive SPMD programming language, as well as achieving 97% of the performance of hand-written AVX-512 SIMD intrinsics on over 70 benchmarks ported from the Simd Library. We finally discuss where Parsimony has exposed parts of existing language and compiler flows where slight improvements could further enable improved SPMD program vectorization.
02 Neural Networks - Acceleration & Edge Computing
CGO Keynote
演讲主题:
PyTorch 2.0 - the Journey to Bringing Compiler Technologies to the Core of PyTorch
演讲人:
Peng Wu
演讲人介绍:

Peng Wu, Meta AI
Dr. Peng Wu is the Engineering Manager for PyTorch Compiler(s) team at Meta. She founded the Programming Technologies Lab in Huawei (one of the first in a major Chinese company) in 2015.
CGO Session 7: Neural Network Accelerators
论文:
Accelerating Deep Neural Networks on Mobile Multicore NPUs
作者:
Hanwoong Jung (Samsung Advanced Institute of Technology), Hexiang Ji (Samsung R&D Institute China Xian), Alexey Pushchin (Samsung R&D Institute Russia), Maxim Ostapenko (Samsung Advanced Institute of Technology), Wenlong Niu (Samsung R&D Institute China Xian), Ilya Palachev (Samsung R&D Institute Russia), Yutian Qu (Samsung R&D Institute China Xian), Pavel Fedin (Samsung R&D Institute Russia), Yuri Gribov (Samsung R&D Institute Russia), Heewoo Nam (Samsung Advanced Institute of Technology), Dongguen Lim (Samsung Advanced Institute of Technology), Hyunjun Kim (Samsung Advanced Institute of Technology), Joonho Song (Samsung Advanced Institute of Technology), Seungwon Lee (Samsung Advanced Institute of Technology), Hwansoo Han (Sungkyunkwan University)
摘要:
Neural processing units (NPUs) have become indispensable parts of mobile SoCs. Furthermore, integrating multiple NPU cores into a single chip becomes a promising solution for ever-increasing computing power demands in mobile devices. This paper addresses techniques to maximize the utilization of NPU cores and reduce the latency of on-device inference. Mobile NPUs typically have a small amount of local memory (or scratch pad memory, SPM) that provides space only enough for input/output tensors and weights of one layer operation in deep neural networks (DNNs). Even in multicore NPUs, such local memories are distributed across the cores. In such systems, executing network layer operations in parallel is the primary vehicle to achieve performance. By partitioning a layer of DNNs into multiple sub-layers, we can execute them in parallel on multicore NPUs. Within a core, we can also employ pipelined execution to reduce the execution time of a sub-layer. In this execution model, synchronizing parallel execution and loading/storing intermediate tensors in global memory are the main bottlenecks. To alleviate these problems, we propose novel optimization techniques which carefully consider partitioning direction, execution order, synchronization, and global memory access. Using six popular convolutional neural networks (CNNs), we evaluate our optimization techniques in a flagship mobile SoC with three cores. Compared to the highest-performing partitioning approach, our techniques improve performance by 23%, achieving a speedup of 2.1x over single-core systems.
论文:
PIMFlow: Compiler and Runtime Support for CNN Models on Processing-in-Memory DRAM
作者:
Yongwon Shin (Pohang University of Science and Technology), Juseong Park (Pohang University of Science and Technology), Sungjun Cho (POSTECH, Hyojin Sung Pohang University of Science and Technology)
摘要:
Processing-in-Memory (PIM) has evolved over decades into a viable solution to mitigate the main memory bottleneck by acing computational logic in or near memory devices. Recently, DRAM manufacturers shared their ideas of commercial digital DRAM-PIM with HW constraint-aware MAC logic, which showed a significant speedup for memory-intensive operations in deep learning models. While convolutional neural networks have not been the main target for PIM acceleration due to high arithmetic intensity and data reuse, recent CNN models increasingly adopt computationally lightweight blocks with 1x1 and depthwise convolutional layers. Motivated by the potential for software interfaces that extend the scope of DRAM-PIM acceleration to 1x1 convolutional layers without hardware changes, we propose PIMFlow, an end-to-end compiler and runtime support to offload CNN models on a PIM-enabled GPU memory. PIMFlow not only supports task- and data-parallel execution across GPU and PIM, but also transforms DL model graphs to expose more PIM acceleration opportunities. PIMFlow achieves up to 34% end-to-end speedup and reduces energy consumption by 24% on average for a range of CNN models.
03 Languages, Architectures, and Tools for Heterogeneous Computing Workshop (LATHC)
多伦多异构编译器实验室专家将在LATHC探讨会上进行三个主题的分享。
分享一
主题:
Matrix Computation Acceleration in the Presence of Data Layout Conversions
分享人:

Amy Wang, 多伦多异构编译器实验室资深编译器专家
摘要:
Accelerators for performing matrix multiplications have flourished due to the importance of GEMM in applications from the HPC and AI/ML domains. The matrix multiplication hardware often requires special data layout, for instance, the row interleaved layout in Intel's advanced matrix extensions (AMX) and the 4D fractal layout in Huawei's DaVinci cube unit. However, programmers or tranditional software stacks are accustomed to 2D data layouts such as the row and column major layouts. Commonly used BLAS library supports only row and column major layouts. Thus, to leverage the power of the accelerators, pre- and post-processing on the host to convert data into the required accelerator layout are needed. This additional processing is overhead that eats into the end-to-end application performance. The application is modified to either perform the conversions on-demand or when possible, push the conversions to the entry and exit of the application such that data is kept in the accelerator layout throughout, in order to minimize the overhead.
分享二
主题:
Active mask computed in predication instruction vs stored in branch unit for SIMT execution
分享人:

Kevin Lin, 多伦多异构编译器实验室GPU编译专家
摘要:
This paper is to explore the trade off between using active mask explicitly generated in compiler predication instruction vs implicitly used in divergent PC table of a GPU branch unit. To quantify the results, we constructed a GPU simulator to evaluate different approaches for handling divergent execution in common applications for SIMT execution.
分享三
主题:
Structure Peeling Using Runtime Memory Identifiers
分享人:

Henry Kao, 多伦多异构编译器实验室毕昇编译工程师
摘要:
Structure Peeling is a compiler performance optimization that transforms an array-of-structures (AoS) into a structure-of-arrays (SoA). Instead of structures being placed contiguously in AoS form, Structure Peeling will transform the memory layout of the AoS such that same fields of the contiguous structures are grouped together in their own continuous memory regions – SoA form. This transformation can improve the spatial locality of memory accesses and hence improve performance of an application. We propose a novel method of Structure Peeling which allows us to safely peel multiple AoSs when static analysis cannot determine a single memory region where uses of an AoS may point to. We introduce a unique identifier, memory ID, as a tag for each live copy of an AoS that exists in the program. The memory ID is set and reference at runtime to determine which of the multiple memory regions are accessed, eliminating the need to statically determine where each AoS originates from. Compared to a state-of-the-art techniques, we are able to obtain 13% more speedup in the SPEC CPU2017 MCF application.
信息来源:
https://conf.researchr.org/home/cgo-2023
https://jnamaral.github.io/LATHC/
后续毕昇编译公众号将持续关注CGO 2023会议技术动向,为大家带来精彩的技术分享!

2023年IEEE/ACM International Symposium on Code Generation and Optimization (CGO)会议将于2023年2月25日至3月1日在加拿大蒙特利尔举行。CGO会议是编译器领域的重要会议,主要关注优化和代码生成技术及相关问题,探讨从纯静态到完全动态的方法、从纯基于软件到特定体系结构的功能以及对代码生成和优化的支持等议题。
本次CGO会议将与CC、HPCA、PPoPP三个学术会议共同举行,是疫情后首次线下会议。华为多伦多异构编译器实验室作为CGO 2023的赞助方,将派出专家团队前往会议现场进行交流。华为编译器与编程语言实验室主任Yaoqing Gao同实验室高级顾问、合作教授阿尔伯塔大学Nelson Amaral教授合办CGO 2023专题研讨会Languages, Architectures, and Tools for Heterogeneous Computing Workshop (LATHC)。
本文将介绍如下重点议程前瞻:
01 State-of-the-Art Compilers and Programming Tools
CGO Session 1: It's all about loops!
论文:
Code Synthesis for Sparse Tensor Format Conversion and Optimization
作者:
Tobi Popoola (Boise State University), Tuowen Zhao (University of Utah), Aaron St. George (Boise State University), Kalyan Bhetwal (Boise State University), Michelle Strout (University of Arizona), Mary Hall (University of Utah), Catherine R. M. Olschanowsky (Boise State University)
摘要:
Many scientific applications compute using sparse data and store that data in a variety of sparse formats because each format has unique space and performance benefits. Optimizing applications that use sparse data involves translating the sparse data into the chosen format and transforming the computation to iterate over that format. This paper presents a formal definition of sparse tensor formats and an automated approach to synthesize the transformation between formats. This approach is unique in that it supports ordering constraints not supported by other approaches and synthesizes the transformation code in a high-level intermediate representation suitable for applying composable transformations such as loop fusion and temporary storage reduction. We demonstrate that the synthesized code for COO to CSR with optimizations is 2.85x faster than TACO, Intel MKL, and SPARSKIT while the more complex COO to DIA is 1.4x slower than TACO but faster than SPARSKIT and Intel MKL using the geometric average of execution time.
领军教授:
Mary Hall, University of Utah
Mary Hall is a professor in the Computer Science department at University of Utah. She directs the Compiler Technology to Optimize Performance (CTOP) research group. Her research interests cover Automatic performance tuning, Model-guided empirical optimization, Interprocedural analysis and optimization, parallelizing compilers, programming support for optimization and parallelization, PIM-based architectures, compiling to FPGA-based systems.
CGO Session 2: Tool and Practical Experience
论文:
Lifting Code Generation of Cardiac Physiology Simulation to Novel Compiler Technology
作者:
Arun Thangamani (ICube Lab., University of Strasbourg and INRIA Nancy-Grand Est), Tiago Trevisan Jost (ICube Lab., University of Strasbourg and INRIA Nancy-Grand Est), Vincent Loechner (ICube Lab., University of Strasbourg and INRIA Nancy-Grand Est), Stéphane Genaud (ICube Lab., University of Strasbourg and INRIA Nancy-Grand Est), Bérenger Bramas (ICube Lab., University of Strasbourg and INRIA Nancy-Grand Est)
摘要:
The study of numerical models for the human body has become a major focus of the research community in biology and medicine. For instance, numerical ionic models of a complex organ, such as the heart, must be able to represent individual cells and their interconnections through ionic channels, forming a system with billions of cells, and requiring efficient code to handle such a large system. The modeling of the electrical system of the heart combines a compute-intensive kernel that calculates the intensity of current flowing through cell membranes, and feeds a linear solver for computing the electrical potential of each cell.
Considering this context, we propose limpetMLIR, a code generator and compiler transformer to accelerate the kernel phase of ionic models and bridge the gap between compiler technology and electrophysiology simulation. LimpetMLIR makes use of the MLIR infrastructure, its dialects, and transformations to drive forward the study of ionic models, and accelerate the execution of multi-cell systems. Experiments conducted in 43 ionic models show that our limpetMLIR based code generation greatly outperforms current state-of-the-art simulation systems by an average of 2.9×, reaching peak speedups of more than 15× in some cases. To the best of our knowledge, this is the first work that deeply connects an optimizing compiler infrastructure to electrophysiology models of the human body, showing the potential benefits of using compiler technology in the simulation of human cell interactions.
CGO Session 6: Tool and Practical Experience Ⅱ
论文:
Bridging Control-Centric and Data-Centric Optimization
作者:
Tal Ben-Nun (Lawrence Livermore National Laboratory), Berke Ates (ETH Zurich), Alexandru Calotoiu (ETH Zurich), Torsten Hoefler (ETH Zurich)
摘要:
With the rise of specialized hardware and new programming languages, code optimization has shifted its focus towards promoting data locality. Most production-grade compilers adopt a control-centric mindset - instruction-driven optimization augmented with scalar-based dataflow - whereas other approaches provide domain-specific and general purpose data movement minimization, which can miss important control-flow optimizations. As the two representations are not commutable, users must choose one over the other. In this paper, we explore how both control- and data-centric approaches can work in tandem via the Multi-Level Intermediate Representation (MLIR) framework. Through a combination of an MLIR dialect and specialized passes, we recover parametric, symbolic dataflow that can be optimized within the DaCe framework. We combine the two views into a single pipeline, called DCIR, showing that it is strictly more powerful than either view. On several benchmarks and a real-world application in C, we show that our proposed pipeline consistently outperforms MLIR and automatically uncovers new optimization opportunities with no additional effort.
领军教授:
Torsten Hoefler, ETH Zurich
Torsten Hoefler is a Professor of Computer Science at ETH Zurich, a member of Academia Europaea, and a Fellow of the IEEE. He directs the Scalable Parallel Computing Laboratory (SPCL) at D-INFK ETH Zurich. He received his PhD degree in 2007 at Indiana University and started his first professor appointment in 2011 at the University of Illinois at Urbana-Champaign. Following a “Performance as a Science” vision, he combines mathematical models of architectures and applications to design optimized computing systems.
论文:
Parsimony: Enabling SIMD/Vector Programming in Standard Compiler Flows
作者:
Vijay Kandiah (Northwestern University), Daniel Lustig (NVIDIA), Oreste Villa (NVIDIA), David Nellans (NVIDIA), Nikos Hardavellas (Northwestern University)
摘要:
Achieving peak throughput on modern CPUs requires maximizing the use of single-instruction, multiple-data (SIMD) or vector compute units. Single-program, multiple-data (SPMD) programming models are an effective way to use high-level programming languages to target these ISAs. Unfortunately, many SPMD frameworks have evolved to have either overly-restrictive language specifications or under-specified programming models, and this has has slowed the widescale adoption of SPMD-style programming. This paper introduces Parsimony (PARallel SIMd), a SPMD programming approach built with semantics designed to be compatible with multiple languages and to cleanly integrate into the standard optimizing compiler toolchains for those languages. We first explain the Parsimony programming model semantics and how they enable a standalone compiler IR-to-IR optimization pass that can perform vectorization independently of other compiler passes, improving the language and toolchain compatibility of SPMD programming. We then demonstrate a LLVM prototype of the Parsimony approach that matches the performance of ispc, a popular but more restrictive SPMD programming language, as well as achieving 97% of the performance of hand-written AVX-512 SIMD intrinsics on over 70 benchmarks ported from the Simd Library. We finally discuss where Parsimony has exposed parts of existing language and compiler flows where slight improvements could further enable improved SPMD program vectorization.
02 Neural Networks - Acceleration & Edge Computing
CGO Keynote
演讲主题:
PyTorch 2.0 - the Journey to Bringing Compiler Technologies to the Core of PyTorch
演讲人:
Peng Wu
演讲人介绍:
Peng Wu, Meta AI
Dr. Peng Wu is the Engineering Manager for PyTorch Compiler(s) team at Meta. She founded the Programming Technologies Lab in Huawei (one of the first in a major Chinese company) in 2015.
CGO Session 7: Neural Network Accelerators
论文:
Accelerating Deep Neural Networks on Mobile Multicore NPUs
作者:
Hanwoong Jung (Samsung Advanced Institute of Technology), Hexiang Ji (Samsung R&D Institute China Xian), Alexey Pushchin (Samsung R&D Institute Russia), Maxim Ostapenko (Samsung Advanced Institute of Technology), Wenlong Niu (Samsung R&D Institute China Xian), Ilya Palachev (Samsung R&D Institute Russia), Yutian Qu (Samsung R&D Institute China Xian), Pavel Fedin (Samsung R&D Institute Russia), Yuri Gribov (Samsung R&D Institute Russia), Heewoo Nam (Samsung Advanced Institute of Technology), Dongguen Lim (Samsung Advanced Institute of Technology), Hyunjun Kim (Samsung Advanced Institute of Technology), Joonho Song (Samsung Advanced Institute of Technology), Seungwon Lee (Samsung Advanced Institute of Technology), Hwansoo Han (Sungkyunkwan University)
摘要:
Neural processing units (NPUs) have become indispensable parts of mobile SoCs. Furthermore, integrating multiple NPU cores into a single chip becomes a promising solution for ever-increasing computing power demands in mobile devices. This paper addresses techniques to maximize the utilization of NPU cores and reduce the latency of on-device inference. Mobile NPUs typically have a small amount of local memory (or scratch pad memory, SPM) that provides space only enough for input/output tensors and weights of one layer operation in deep neural networks (DNNs). Even in multicore NPUs, such local memories are distributed across the cores. In such systems, executing network layer operations in parallel is the primary vehicle to achieve performance. By partitioning a layer of DNNs into multiple sub-layers, we can execute them in parallel on multicore NPUs. Within a core, we can also employ pipelined execution to reduce the execution time of a sub-layer. In this execution model, synchronizing parallel execution and loading/storing intermediate tensors in global memory are the main bottlenecks. To alleviate these problems, we propose novel optimization techniques which carefully consider partitioning direction, execution order, synchronization, and global memory access. Using six popular convolutional neural networks (CNNs), we evaluate our optimization techniques in a flagship mobile SoC with three cores. Compared to the highest-performing partitioning approach, our techniques improve performance by 23%, achieving a speedup of 2.1x over single-core systems.
论文:
PIMFlow: Compiler and Runtime Support for CNN Models on Processing-in-Memory DRAM
作者:
Yongwon Shin (Pohang University of Science and Technology), Juseong Park (Pohang University of Science and Technology), Sungjun Cho (POSTECH, Hyojin Sung Pohang University of Science and Technology)
摘要:
Processing-in-Memory (PIM) has evolved over decades into a viable solution to mitigate the main memory bottleneck by acing computational logic in or near memory devices. Recently, DRAM manufacturers shared their ideas of commercial digital DRAM-PIM with HW constraint-aware MAC logic, which showed a significant speedup for memory-intensive operations in deep learning models. While convolutional neural networks have not been the main target for PIM acceleration due to high arithmetic intensity and data reuse, recent CNN models increasingly adopt computationally lightweight blocks with 1x1 and depthwise convolutional layers. Motivated by the potential for software interfaces that extend the scope of DRAM-PIM acceleration to 1x1 convolutional layers without hardware changes, we propose PIMFlow, an end-to-end compiler and runtime support to offload CNN models on a PIM-enabled GPU memory. PIMFlow not only supports task- and data-parallel execution across GPU and PIM, but also transforms DL model graphs to expose more PIM acceleration opportunities. PIMFlow achieves up to 34% end-to-end speedup and reduces energy consumption by 24% on average for a range of CNN models.
03 Languages, Architectures, and Tools for Heterogeneous Computing Workshop (LATHC)
多伦多异构编译器实验室专家将在LATHC探讨会上进行三个主题的分享。
分享一
主题:
Matrix Computation Acceleration in the Presence of Data Layout Conversions
分享人:
Amy Wang, 多伦多异构编译器实验室资深编译器专家
摘要:
Accelerators for performing matrix multiplications have flourished due to the importance of GEMM in applications from the HPC and AI/ML domains. The matrix multiplication hardware often requires special data layout, for instance, the row interleaved layout in Intel's advanced matrix extensions (AMX) and the 4D fractal layout in Huawei's DaVinci cube unit. However, programmers or tranditional software stacks are accustomed to 2D data layouts such as the row and column major layouts. Commonly used BLAS library supports only row and column major layouts. Thus, to leverage the power of the accelerators, pre- and post-processing on the host to convert data into the required accelerator layout are needed. This additional processing is overhead that eats into the end-to-end application performance. The application is modified to either perform the conversions on-demand or when possible, push the conversions to the entry and exit of the application such that data is kept in the accelerator layout throughout, in order to minimize the overhead.
分享二
主题:
Active mask computed in predication instruction vs stored in branch unit for SIMT execution
分享人:
Kevin Lin, 多伦多异构编译器实验室GPU编译专家
摘要:
This paper is to explore the trade off between using active mask explicitly generated in compiler predication instruction vs implicitly used in divergent PC table of a GPU branch unit. To quantify the results, we constructed a GPU simulator to evaluate different approaches for handling divergent execution in common applications for SIMT execution.
分享三
主题:
Structure Peeling Using Runtime Memory Identifiers
分享人:
Henry Kao, 多伦多异构编译器实验室毕昇编译工程师
摘要:
Structure Peeling is a compiler performance optimization that transforms an array-of-structures (AoS) into a structure-of-arrays (SoA). Instead of structures being placed contiguously in AoS form, Structure Peeling will transform the memory layout of the AoS such that same fields of the contiguous structures are grouped together in their own continuous memory regions – SoA form. This transformation can improve the spatial locality of memory accesses and hence improve performance of an application. We propose a novel method of Structure Peeling which allows us to safely peel multiple AoSs when static analysis cannot determine a single memory region where uses of an AoS may point to. We introduce a unique identifier, memory ID, as a tag for each live copy of an AoS that exists in the program. The memory ID is set and reference at runtime to determine which of the multiple memory regions are accessed, eliminating the need to statically determine where each AoS originates from. Compared to a state-of-the-art techniques, we are able to obtain 13% more speedup in the SPEC CPU2017 MCF application.
信息来源:
https://conf.researchr.org/home/cgo-2023
https://jnamaral.github.io/LATHC/
后续毕昇编译公众号将持续关注CGO 2023会议技术动向,为大家带来精彩的技术分享!