Background

High-Performance Computing (HPC) is a parallel computing system that can process a large amount of data and high-speed computing tasks that cannot be processed by common computers. It splits large-scale computing tasks, distributes the tasks to servers for parallel computing, and then summarizes the computing results to obtain the final results, implementing powerful computing functions.

The HPC solution has been widely used in industries such as government HPC, manufacturing, meteorology, and EDA, but still faces the following challenges:

Complex command interaction
- Most HPC cluster users are not IT professionals and are not familiar with job and database management commands.
- Resource management dimensions designed are different from the conventional perspective. You need to learn some concepts before using them.
- The resource management configuration is complex, hard to understand, and thus error-prone.
Complex industrial applications
Industries applying HPC solutions come in large varieties. Multiple types of software coexist and need to be upgraded continuously.
Data security
Data is frequently transmitted. Therefore, data efficiency and security need to be ensured.
Exception monitoring
Cluster exceptions are detected in a timely manner, unified management of multiple clusters is realized, and abnormal user behavior is kept in check.

Discontinuous workflow
Traditional HPC systems do not satisfy users' existing workflow. Users need to log in to different terminals to process and transmit data. No unified portal is available for the integration of simulation service flows.
Low resource utilization
Users work only on their own desktop workstations. This isolated working mode makes collaboration difficult and may cause low resource utilization.
Small-scale clusters
The clusters are small. Multiple clusters are required to support services, which increases the O&M workload and cannot support large-scale MPI jobs.
Low cluster throughput and low resource utilization
The cluster throughput decreases as the cluster scale increases. The existing open-source or commercial scheduler architecture is outdated and cannot support new applications. As a result, a cluster only supports a service, further reducing resource utilization.
Slow message passing interface (MPI) performance
The low MPI communication becomes a key bottleneck of cluster computing.

HPC 22.0 well addresses the preceding challenges. By integrating three Huawei-developed applications – Donau Portal, an offline unified cluster management and scheduling platform, Donau Scheduler, a cluster management scheduler, and Hyper MPI, a high-performance communication library, HPC 22.0 is able to build a core software system of HPC clusters. It can enhance computing performance, realize intelligent cluster management and scheduling, and optimize communication interface performance.