Rate This Document
Findability
Accuracy
Completeness
Readability

Government HPC

Scenarios

The supercomputer system is the most cutting-edge information technology complex that showcases the technological strength of a country. Supercomputing provides essential technical support for scientific research. Computers with powerful computing, communication, and data processing capabilities are used to process data and online transactions, provide information services, and perform scientific engineering computing. Supercomputing plays an important role in the scientific engineering computing tasks of multiple disciplines, including new material design, new nanostructure, global climate change research, industrial engineering design, aerospace vehicle manufacturing, and AI research.

Provinces or cities in China that intend to stand out in the technology competition are endeavoring to build a national or regional government HPC center. In the short run, the government HPC center provides scenario-based HPC and AI capabilities to supercharge regional economy, industry upgrade, scientific research, and urban services. In the long run, the government HPC center provides industry-leading services, such as HPC, AI, and big data processing, to streamline the value chain and enhance collaboration for an ecosystem of shared success. The government HPC center is also a new driver for digital economy and the foundation for City, Industrial, and Scientific Intelligent Twins.

The government HPC center provides resources and tool services to support diversified businesses and customers at the application layer. The majority of them are large-load businesses, such as scientific research, government, and security-sensitive industries. Resources need to be flexibly scheduled based on loads to reduce power consumption and improve resource utilization. The government HPC center applies to a broad array of fields, including quantum chemistry, molecular simulation, weather forecast, weather research, oil and gas exploration, hydrodynamics, structural mechanics, and nuclear reaction. To supercharge the fast-developing scientific research, economy, and national defense, better HPC infrastructure and environment are needed. The demand for HPC applications grows exponentially, and HPC has been applied to a wider range of fields. In particular, HPC is essential for many emerging sectors, such as resource environment, aerospace, new materials, new energy, healthcare, finance, Internet, and cultural industries.

Challenges

  • Diversified computing resources

    As there are various types of HPC centers applications that deal with complex requirements, the CPU+GPU heterogeneous computing technology is needed to meet different requirements. In this case, storage with higher bandwidth and lower latency is required to fully unleash the computing potential. In addition, the HPC system needs to adapt to different performance models to cope with diversified service loads, such as the support for both bandwidth and operations per second (OPS).

  • Mainstream networks: IB/OPA

    Not all application software requires high bandwidth for computing. However, a large amount of application software that performs large-scale parallel computing still needs InfiniBand (IB) or Omni-Path (OPA) network, which combines high bandwidth with low latency to boost the application performance and scalability. IB or OPA network has become increasingly essential for many HPC applications to handle the growing pressure on communication between nodes, which is caused by the multi-core processors that deliver stronger single-node computing performance. In addition, high-performance clusters need data access through the network to meet the requirement for the shared file system. The high-bandwidth InfiniBand and OPA network will drive the rapid growth of data access performance.

  • Parallelization and tiering of file systems

    The HPC center incorporates a large number of applications with strong computing capabilities. In addition to some I/O-intensive applications, the concurrent read/write of massive tasks also poses heavy load on the shared file system. To handle massive data volumes and unified image files, the HPC center needs a single partition with massive files. Currently, the common solution is the parallel file system that uses software to implement single partitioning and concurrent read/write of multiple storage spaces. This solution breaks the bottleneck of hardware resource design and greatly improves scalability and performance.

    In terms of data security, the permission varies among a large number of users at different levels. As a result, the data security levels also vary among users. In addition, the type of files that can be accessed also varies with applications. Some are large files, and some are a large number of small files. Therefore, it is recommended that tiered storage be used in the solution design.

  • Refined management and scheduling systems

    The HPC center has a large number of users with different permissions, which makes user management difficult. The O&M team of the HPC center provides services mainly for users, requiring strict management on user rights and accounting. Therefore, the HPC center demands for high-performance management and scheduling software. The software is required to support a wide range of functions, such as the flexible allocation of policies and permissions, job accounting, user preemption, user login restriction, alarm reporting, and quick system recovery. In addition, certain rules and regulations need to be formulated to standardize the application, use, and allocation of resources.

  • Growing requirements for energy-efficient HPC centers

    Generally an HPC center, due to its large scale, will consume tens of thousands or even hundreds of thousands of kWh of power a year, and the electricity expenses remain high. Therefore, improving energy efficiency will help greatly reduce the energy consumption of devices and the consequent O&M costs. Low-consumption processors, energy-saving software, and infrastructure with high cooling efficiency (water-cooled units or closed cooling cabinets) are used to reduce energy consumption. In addition, as the liquid cooling technology has been mature in recent years, more and more users start to use liquid-cooled servers.

  • Increasing attention to data security
    A large amount of user data is stored in a high-performance cluster, which requires close attention to user data security. In particular, to deal with the increasing malicious attacks on clusters in recent years, the security protection has become an important part of the HPC system R&D. Data security involves the following aspects:
    • The system may be remotely attacked by hackers, stolen by other users, or stolen due to the loss of user names and passwords. The solution is to use the firewall, encrypted file system, and system login based on encrypted authentication.
    • System data loss may occur due to device fault, earthquake, or fire. Data backup is a solution to this issue.