
2023年Principles and Practice of Parallel Programming (PPoPP)会议将于2023年2月25日至3月1日在加拿大蒙特利尔举行。PPoPP会议是并行编程领域的重要会议,涵盖理论基础、技术、语言、编译器、运行时系统、工具和实践等并发和并行系统领域研究。随着并行体系结构在消费市场和数据中心的发展,PPoPP日益关注解决新的并行工作负载和极端规模应用程序或云平台产生的问题,以及提高并行编程生产力或努力改善与此类新兴体系结构协同效应的技术发展。
本次PPoPP会议将与CGO、CC、HPCA三个学术会议共同举行,是疫情后首次线下会议。华为多伦多异构编译器实验室作为PPoPP 2023的赞助方,将派出专家团队前往会议现场,针对神经网络加速、边缘计算、GPU编译等领域前沿趋势,进行交流和互动。
本文将介绍如下重点议程前瞻:
PPoPP Session 2: Programming
论文:
High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs
作者:
William S. Moses (Massachusetts Institute of Technology), Ivan Radanov Ivanov (Tokyo Institute of Technology), Jens Domke (RIKEN Center for Computational Science), Toshio Endo (Tokyo Institute of Technology), Johannes Doerfert (Lawrence Livermore National Laboratory), Oleksandr Zinenko (Google)
摘要:
While parallelism remains the main source of performance, architectural implementations and programming models change with each new hardware generation, often leading to costly application re-engineering. Most tools for performance portability require manual and costly application porting to yet another programming model.
We propose an alternative approach that automatically translates programs written in one programming model (CUDA), into another (CPU threads) based on Polygeist/MLIR. Our approach includes a representation of parallel constructs that allows conventional compiler transformations to apply transparently and without modification and enables parallelism-specific optimizations. We evaluate our framework by transpiling and optimizing the CUDA Rodinia benchmark suite for a multi-core CPU and achieve a 58% geomean speedup over handwritten OpenMP code. Further, we show how CUDA kernels from PyTorch can efficiently run and scale on the CPU-only Supercomputer Fugaku without user intervention. Our PyTorch compatibility layer making use of transpiled CUDA PyTorch kernels outperforms the PyTorch CPU native backend by 2.7$\times$.
PPoPP Session 7: Machine Learning
论文:
Elastic Averaging for Efficient Pipelined DNN Training
作者:
Zihao Chen (East China Normal University), Chen Xu (East China Normal University), Weining Qian (East China Normal University), Aoying Zhou (East China Normal University)
摘要:
Nowadays, the size of DNN models has grown rapidly. To train a large model, pipeline parallelism-based frameworks partition the model across GPUs and slice each batch of data into multiple micro-batches. However, pipeline parallelism suffers from a bubble issue and low peak utilization of GPUs. Recent work tries to address the two issues, but fails to exploit the benefit of vanilla pipeline parallelism, i.e., overlapping communication with computation. In this work, we employ an elastic averaging-based framework which explores elastic averaging to add multiple parallel pipelines. To help the framework exploit the advantage of pipeline parallelism while reducing the memory footprints, we propose a schedule, advance forward propagation. Moreover, since the numbers of parallel pipelines and micro-batches are essential to the framework performance, we propose a profiling-based tuning method to automatically determine the settings. We integrate those techniques into a prototype system, namely AvgPipe, based on PyTorch. Our experiments show that AvgPipe achieves a 1.7x speedups over state-of-the-art solutions of pipeline parallelism on average.
论文:
TGOpt: Redundancy-Aware Optimizations for Temporal Graph Attention Networks
作者:
Yufeng Wang (University of Illinois at Urbana-Champaign), Charith Mendis (University of Illinois at Urbana-Champaign)
摘要:
Temporal Graph Neural Networks are gaining popularity in modeling interactions on dynamic graphs. Among them, Temporal Graph Attention Networks (TGAT) have gained adoption in predictive tasks, such as link prediction, in a range of application domains. Most optimizations and frameworks for Graph Neural Networks (GNNs) focus on GNN models that operate on static graphs. While a few of these optimizations exploit redundant computations on static graphs, they are either not applicable to the self-attention mechanism used in TGATs or do not exploit optimization opportunities that are tied to temporal execution behavior.
In this paper, we explore redundancy-aware optimization opportunities that specifically arise from computations that involve temporal components in TGAT inference. We observe considerable redundancies in temporal node embedding computations, such as recomputing previously computed neighbor embeddings and time-encoding of repeated time delta values. To exploit these redundancy opportunities, we developed TGOpt which introduces optimization techniques based on deduplication, memoization, and precomputation to accelerate the inference performance of TGAT. Our experimental results show that TGOpt achieves a geomean speedup of $4.9\times$ on CPU and $2.9\times$ on GPU when performing inference on a wide variety of dynamic graphs, with up to $6.3\times$ speedup for the Reddit Posts dataset on CPU.
论文:
DSP: Efficient GNN Training with Multiple GPUs
作者:
Zhenkun Cai (The Chinese University of Hong Kong), Qihui Zhou (The Chinese University of Hong Kong), Xiao Yan (Southern University of Science and Technology), Da Zheng (Amazon Web Services), Xiang Song (Amazon Web Services), Chenguang Zheng (The Chinese University of Hong Kong), James Cheng (The Chinese University of Hong Kong), George Karypis (Amazon Web Services)
摘要:
Jointly utilizing multiple GPUs to train graph neural networks (GNNs) is crucial for handling large graphs and achieving high efficiency. However, we find that existing systems suffer from \textit{high communication costs} and \textit{low GPU utilization} due to improper data layout and training procedures. Thus, we propose a system dubbed Distributed Sampling and Pipelining (DSP) for multi-GPU GNN training. DSP adopts a tailored data layout to utilize the fast NVLink connections among the GPUs, which stores the graph topology and popular node features in GPU memory. For efficient graph sampling with multiple GPUs, we introduce a \textit{collective sampling primitive} (CSP), which pushes the sampling tasks to data to reduce communication. We also design a \textit{producer-consumer-based pipeline}, which allows tasks from different mini-batches to run congruently to improve GPU utilization. We compare DSP with state-of-the-art GNN training frameworks, and the results show that DSP consistently outperforms the baselines under different datasets, GNN models and GPU counts. The speedup of DSP can be an order of magnitude and is over 2x in most cases.
论文:
PiPAD: Pipelined and Parallel Dynamic GNN Training on GPUs
作者:
Chunyang Wang (Beihang University), Desen Sun (Beihang University), Yuebin Bai (Beihang University)
摘要:
Dynamic Graph Neural Networks (DGNNs) have been widely applied in various real-life applications, such as link prediction and pandemic forecast, to capture both static structural information and temporal characteristics from dynamic graphs. Combining both time-dependent and -independent components, DGNNs manifest substantial parallel computation and data reuse potentials, but suffer from severe memory access inefficiency and data transfer overhead under the canonical one-graph-at-a-time training pattern. To tackle these challenges, we propose PiPAD, a Pipelined and PArallel DGNN training framework for the end-to-end performance optimization on GPUs. From both algorithm and runtime level, PiPAD holistically reconstructs the overall training paradigm from the data organization to computation manner. Capable of processing multiple graph snapshots in parallel, PiPAD eliminates unnecessary data transmission and alleviates memory access inefficiency to improve the overall performance. Our evaluation across various datasets shows PiPAD achieves 1.22x - 9.57x speedup over the state-of-the-art DGNN frameworks on three representative models.
信息来源:
https://ppopp23.sigplan.org/
后续毕昇编译公众号将持续关注PPoPP 2023会议技术动向,为大家带来精彩的技术分享!

2023年Principles and Practice of Parallel Programming (PPoPP)会议将于2023年2月25日至3月1日在加拿大蒙特利尔举行。PPoPP会议是并行编程领域的重要会议,涵盖理论基础、技术、语言、编译器、运行时系统、工具和实践等并发和并行系统领域研究。随着并行体系结构在消费市场和数据中心的发展,PPoPP日益关注解决新的并行工作负载和极端规模应用程序或云平台产生的问题,以及提高并行编程生产力或努力改善与此类新兴体系结构协同效应的技术发展。
本次PPoPP会议将与CGO、CC、HPCA三个学术会议共同举行,是疫情后首次线下会议。华为多伦多异构编译器实验室作为PPoPP 2023的赞助方,将派出专家团队前往会议现场,针对神经网络加速、边缘计算、GPU编译等领域前沿趋势,进行交流和互动。
本文将介绍如下重点议程前瞻:
PPoPP Session 2: Programming
论文:
High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs
作者:
William S. Moses (Massachusetts Institute of Technology), Ivan Radanov Ivanov (Tokyo Institute of Technology), Jens Domke (RIKEN Center for Computational Science), Toshio Endo (Tokyo Institute of Technology), Johannes Doerfert (Lawrence Livermore National Laboratory), Oleksandr Zinenko (Google)
摘要:
While parallelism remains the main source of performance, architectural implementations and programming models change with each new hardware generation, often leading to costly application re-engineering. Most tools for performance portability require manual and costly application porting to yet another programming model.
We propose an alternative approach that automatically translates programs written in one programming model (CUDA), into another (CPU threads) based on Polygeist/MLIR. Our approach includes a representation of parallel constructs that allows conventional compiler transformations to apply transparently and without modification and enables parallelism-specific optimizations. We evaluate our framework by transpiling and optimizing the CUDA Rodinia benchmark suite for a multi-core CPU and achieve a 58% geomean speedup over handwritten OpenMP code. Further, we show how CUDA kernels from PyTorch can efficiently run and scale on the CPU-only Supercomputer Fugaku without user intervention. Our PyTorch compatibility layer making use of transpiled CUDA PyTorch kernels outperforms the PyTorch CPU native backend by 2.7$\times$.
PPoPP Session 7: Machine Learning
论文:
Elastic Averaging for Efficient Pipelined DNN Training
作者:
Zihao Chen (East China Normal University), Chen Xu (East China Normal University), Weining Qian (East China Normal University), Aoying Zhou (East China Normal University)
摘要:
Nowadays, the size of DNN models has grown rapidly. To train a large model, pipeline parallelism-based frameworks partition the model across GPUs and slice each batch of data into multiple micro-batches. However, pipeline parallelism suffers from a bubble issue and low peak utilization of GPUs. Recent work tries to address the two issues, but fails to exploit the benefit of vanilla pipeline parallelism, i.e., overlapping communication with computation. In this work, we employ an elastic averaging-based framework which explores elastic averaging to add multiple parallel pipelines. To help the framework exploit the advantage of pipeline parallelism while reducing the memory footprints, we propose a schedule, advance forward propagation. Moreover, since the numbers of parallel pipelines and micro-batches are essential to the framework performance, we propose a profiling-based tuning method to automatically determine the settings. We integrate those techniques into a prototype system, namely AvgPipe, based on PyTorch. Our experiments show that AvgPipe achieves a 1.7x speedups over state-of-the-art solutions of pipeline parallelism on average.
论文:
TGOpt: Redundancy-Aware Optimizations for Temporal Graph Attention Networks
作者:
Yufeng Wang (University of Illinois at Urbana-Champaign), Charith Mendis (University of Illinois at Urbana-Champaign)
摘要:
Temporal Graph Neural Networks are gaining popularity in modeling interactions on dynamic graphs. Among them, Temporal Graph Attention Networks (TGAT) have gained adoption in predictive tasks, such as link prediction, in a range of application domains. Most optimizations and frameworks for Graph Neural Networks (GNNs) focus on GNN models that operate on static graphs. While a few of these optimizations exploit redundant computations on static graphs, they are either not applicable to the self-attention mechanism used in TGATs or do not exploit optimization opportunities that are tied to temporal execution behavior.
In this paper, we explore redundancy-aware optimization opportunities that specifically arise from computations that involve temporal components in TGAT inference. We observe considerable redundancies in temporal node embedding computations, such as recomputing previously computed neighbor embeddings and time-encoding of repeated time delta values. To exploit these redundancy opportunities, we developed TGOpt which introduces optimization techniques based on deduplication, memoization, and precomputation to accelerate the inference performance of TGAT. Our experimental results show that TGOpt achieves a geomean speedup of $4.9\times$ on CPU and $2.9\times$ on GPU when performing inference on a wide variety of dynamic graphs, with up to $6.3\times$ speedup for the Reddit Posts dataset on CPU.
论文:
DSP: Efficient GNN Training with Multiple GPUs
作者:
Zhenkun Cai (The Chinese University of Hong Kong), Qihui Zhou (The Chinese University of Hong Kong), Xiao Yan (Southern University of Science and Technology), Da Zheng (Amazon Web Services), Xiang Song (Amazon Web Services), Chenguang Zheng (The Chinese University of Hong Kong), James Cheng (The Chinese University of Hong Kong), George Karypis (Amazon Web Services)
摘要:
Jointly utilizing multiple GPUs to train graph neural networks (GNNs) is crucial for handling large graphs and achieving high efficiency. However, we find that existing systems suffer from \textit{high communication costs} and \textit{low GPU utilization} due to improper data layout and training procedures. Thus, we propose a system dubbed Distributed Sampling and Pipelining (DSP) for multi-GPU GNN training. DSP adopts a tailored data layout to utilize the fast NVLink connections among the GPUs, which stores the graph topology and popular node features in GPU memory. For efficient graph sampling with multiple GPUs, we introduce a \textit{collective sampling primitive} (CSP), which pushes the sampling tasks to data to reduce communication. We also design a \textit{producer-consumer-based pipeline}, which allows tasks from different mini-batches to run congruently to improve GPU utilization. We compare DSP with state-of-the-art GNN training frameworks, and the results show that DSP consistently outperforms the baselines under different datasets, GNN models and GPU counts. The speedup of DSP can be an order of magnitude and is over 2x in most cases.
论文:
PiPAD: Pipelined and Parallel Dynamic GNN Training on GPUs
作者:
Chunyang Wang (Beihang University), Desen Sun (Beihang University), Yuebin Bai (Beihang University)
摘要:
Dynamic Graph Neural Networks (DGNNs) have been widely applied in various real-life applications, such as link prediction and pandemic forecast, to capture both static structural information and temporal characteristics from dynamic graphs. Combining both time-dependent and -independent components, DGNNs manifest substantial parallel computation and data reuse potentials, but suffer from severe memory access inefficiency and data transfer overhead under the canonical one-graph-at-a-time training pattern. To tackle these challenges, we propose PiPAD, a Pipelined and PArallel DGNN training framework for the end-to-end performance optimization on GPUs. From both algorithm and runtime level, PiPAD holistically reconstructs the overall training paradigm from the data organization to computation manner. Capable of processing multiple graph snapshots in parallel, PiPAD eliminates unnecessary data transmission and alleviates memory access inefficiency to improve the overall performance. Our evaluation across various datasets shows PiPAD achieves 1.22x - 9.57x speedup over the state-of-the-art DGNN frameworks on three representative models.
信息来源:
https://ppopp23.sigplan.org/
后续毕昇编译公众号将持续关注PPoPP 2023会议技术动向,为大家带来精彩的技术分享!