Function Overview
- Donau Portal provides a unified portal for you to manage software and hardware resources and submitted jobs in an HPC cluster, allowing more appropriate job scheduling and resource allocation. It also provides functions such as data management and remote visualization to ensure that the computing capabilities of the cluster are fully utilized.
- System Management
- Allows you to configure user synchronization based on user authentication to synchronize local users, NIS users, LDAP users, or AD users to the system, synchronize LDAP user groups, create and delete user groups, and assign permissions based on users and user groups.User synchronization requires manual operations. Automatic synchronization is not supported.User groups cannot be nested, and secondary permission assignment to users in user groups is not allowed.
- Job Management
- Allows you to submit common background jobs and VNC jobs, suspend, resume, terminate, and requeue offline jobs.Allows you to submit jobs to third-party schedulers.Allows you to view the historical job list, filter jobs by keyword, and refresh the job list.Allows you to view the job task list, filter tasks by keyword, and refresh the task list.Allows you to view basic and advanced information about jobs and tasks.Allows you to view the basic and advanced information about jobs, CPU and memory usage of jobs, and switch to the job directory.
OBS 2.0 Supported
- Allows you to upload one or more data files; and supports resumable upload of large files (≤ 100 GB).Allows you to download a single file; displays files and folders in the way the File Explorer does; and implements permission isolation between users.Allows you to add, delete, modify, query, copy, paste, decompress, and compress files and folders; and supports online check of TXT files.Allows you to associate data files with simulation applications or job templates, and open data files with double-click.Supports multi-cluster data transfer.
- Integrates simulation applications into the system based on the execution scripts and displays the applications as forms externally; and supports intelligent template integration.Integrates remote Linux applications into the system based on the execution scripts; and associates data files with application templates.Integrates remote Windows applications into the system based on the execution scripts.Allows you to test integrated simulation applications to ensure that the applications run properly based on the returned results.Allows you to publish, cancel publishing, save as, and delete integrated and remote applications.
- Allows you to use the WebUI to start applications deployed on the remote Linux/Windows server.Displays all active sessions in joined tables; allows you to manually disconnect created remote sessions; and allows idle sessions to be automatically released.Limits the number of sessions that can be used by a user.Allows you to select the visualization node using a random algorithm.
- Sends notifications when job statuses change.Displays job details based on the job ID in the pushed notifications.Receives job notifications from multiple heterogeneous clusters.Displays the data upload and download progresses.
- Allows you to create monitoring items and dashboards that contain multiple monitoring items by editing configuration files.Monitors the usage of hardware resources (CPU, GPU, memory, temporary partitions, and swap partitions) and node resources.Monitors the number of packets sent and received by the CNP, PFC, and NIC, and the number of retransmitted packets due to packet loss.Monitors the numbers of running and submitted jobs in a cluster.Monitors the number of completed jobs in a cluster by status.
- Allows you to create multi-dimensional reports and dashboards that contain multiple charts by editing configuration files.Allows you to export reports and charts as data (CSV files) or snapshots (PNG files).Analyzes the usage of hardware resources (CPU, GPU, memory, temporary partitions, and swap partitions) and node resources.Analyzes the number of active jobs, memory, temporary partitions, swap partitions, running duration, CPU usage duration, CPU usage duration of the system, and CPU usage duration of user code from the dimensions of time, job status, cluster, user group, user, queue, and application.Analyzes the number of completed jobs, memory, running duration, CPU usage duration, and waiting duration from the dimensions of time, job status, cluster, user group, user, queue, and application.
- Allows you to view the pricing details of a tenant or account based on the user permissions, and export the details as data (CSV files) or snapshots (PNG files).Displays the pricing trend of the whole year, and allows you to create, edit, and delete pricing plans.
Collects completed job data, cluster server data, cluster core data, and script data of Donau Scheduler.
OBS 2.0 Supported
- Donau Scheduler is a core component in the high-performance computing (HPC) software stack. It manages cluster resources and user jobs, and schedules computing tasks to proper cluster resources based on certain rules.
- System Management
- Allows you to configure user synchronization based on user authentication to synchronize local users, NIS users, LDAP users, or AD users to the system, and assign permissions based on users and user groups. Secondary permission assignment to users in a user group is not supported.Allows you to import the license and check the validity of the license.Allows you to manually activate or revoke a license, and obtain the revocation code for applying for a new license.Allows you to view the license usage information; displays a warning message before the license expires; and periodically displays a message when the license expires or is invalid.Allows you to configure the expiration time of the access token and refresh token, and check whether the refresh token is enabled.
- Job Management
- Job submission:Allows you to submit common HPC job parameters, including the job name, user group, job execution duration, queue to which a user belongs, job execution command, job description, job scheduling time, number of job copies, environment variables, input and output redirection, account, log path redirection, and task execution path.Allows you to set inter-job dependencies, job priorities, runtime resource restrictions, prehook/posthook, retry times upon job/task failures, and job/task timeout interval.Allows you to specify compute node labels, resource requirements, and CPU and memory binding requirements when submitting jobs.Supports common serial jobs, MPI jobs (including openmpi, hmpi, mpich, intelmp, and cosched jobs), array jobs, blocking jobs, and interactive jobs.Allows you to submit a single job or jobs in batches using scripts.Job control:Supports recovery/batch recovery, suspension/batch suspension, termination/batch termination, restart/batch restart, and re-submission/batch re-submission of jobs and tasks.Allows you to submit remarks such as reasons for a control operation. The command can be used together with any control command.Job query:Allows you to query the brief information, detailed information, specified fields, and custom fields of a job or task, and query the detailed reason why a task is pending.Allows you to filter jobs by job ID, job name, user name, user group, queue, account, time, and status.Allows you to filters tasks by index, execution node, and status.Allows you to query the brief and detailed information, specified fields, and custom fields of jobs and tasks at the same time.Allows you to filter jobs and tasks by job ID, user name, user group, queue, account, time, execution node, and status.Allows you to query a large number of jobs or tasks by page or query information about jobs or tasks with the specified page or quantity, as well as query help information. The output can be in long/wide format or JSON format.Job scheduling:Supports scheduling policies such as FIFO, Gang Scheduler, Fairshare, and preemption; and CPU-memory affinity scheduling.Supports multi-dimensional scheduling of resources, such as Kunpeng, Ascend, and x86 servers, CPUs, memory, GPUs, Ascend acceleration cards, and custom resources.Supports multiple resource pools, dynamically allocates resources such as CPUs, memory, GPUs, or nodes based on loads, and borrows resources across resource pools.Selects candidate nodes randomly or based on the registration sequence, maximum available resources, or specified label.Allows you to set the scheduling limit based on the maximum number of jobs that can be run in a cluster, the maximum number of scheduling times of a queue, account, or user, and the maximum number of pending jobs of a user.Allows you to configure the brief and detailed pending reason for scheduling.Allows you to enable or disable a scheduling phase, and configure the scheduling phases for the purpose of achieving high scheduling efficiency, ensuring fair resource allocation, or delivering the maximum number of jobs.Job execution:Supports startup of serial, parallel, MPI, array, blocking, and interactive jobs.Allows you to specify the report period, collect task resource consumption information based on the cgroup, collect information such as the job running duration, CPU usage duration, average memory, and peak memory, and run daemon processes for jobs.Allows you to set the memory limit based on the cgroup and supports jobs to inherit the system limit.Allows you to configure the prehook/posthook that take effect globally, the number of retry times, execution timeout interval, and the blocklist of nodes where prehook failures occur.Supports the prehook/posthook at the job level in MPI jobs and the prehook/posthook at the task level in non-MPI jobs.Outputs basic job information and running data to the specified output file or the prehook phase, and reports the job result with standard error information.
- Cluster Management
- Cluster resource information management:Allows you to query the brief and detailed information about a queue, and customize the queue.Allows you to query the brief and detailed information about a node, including:Name, total CPUs, available CPUs, CPU topology, total memory, available memory, swap, tmp, average CPU load/minute, GPUs, node storage information, number of jobs on the node, labels, SDRsAllows you to enable or disable a node, set the reason for enabling or disabling a node, define the memory usage threshold of a node, customize resources based on node configurations, add, delete, modify, and query node labels, remove nodes, and manually or automatically suspend or resume idle nodes.Allows you to query job statistics, customize user configurations, make user synchronization configurations take effect in real time, dynamically assign job, node, and queue management permissions to users, view multi-dimensional statistics of the entire cluster, view brief and detailed account information, and customize account configurations.Supports online configuration of account, resource pool, and resource allocation policies of resource pools.Job data lifecycle management:Allows you to configure the maximum storage duration of real-time data, historical data, and archived data.Allows you to define storage locations and data types of data files.
OBS 2.0 Supported
- Hyper MPI is developed based on Open MPI and the Open UCX P2P communication framework. It integrates the UCX COLL and UCG framework for collective communication, and implements set operation acceleration algorithms based on the integrated framework. Hyper MPI features ultimate performance, massive processing capability, and portability. It applies to manufacturing, meteorology, and government HPC scenarios. Hyper MPI helps build an HPC ecosystem based on Huawei-developed Kunpeng servers in the long term.
- MPI_Allreduce
- MPI_Allreduce is an MPI group reduction function. Allreduce performs a mathematical operation (for example, addition or multiplication) or logical operation (for example, AND or OR) on the send buffer of each independent process, and then synchronizes the result to the receive buffer of all processes in the communication domain.The following algorithms are supported:Algorithm 1: Recursive doublingAlgorithm 2: Node-aware Recursive + Binomial (intra)Algorithm 3: Socket-aware Recursive + Binomial (intra)Algorithm 4: Ring, accelerating AllReduce large-packet operationsAlgorithm 5: Node-aware Recursive + K-nomial (intra)Algorithm 6: Socket-aware Recursive + K-nomial (intra)Algorithm 7: Node-aware K-nomialAlgorithm 8: Socket-aware K-nomialAlgorithm 11: Node-aware ParallelAlgorithm 12: basic Rabenseifner algorithmAlgorithm 13: Node-aware RabenseifnerAlgorithm 14: Socket-aware RabenseifnerOnly algorithm 1 supports discontinuous data structures and noncommutative operations.
- MPI_Barrier
- MPI_Barrier is an MPI synchronization function. It is used to synchronize all processes in the communication domain, ensuring that the processes are synchronized after a process invokes a function.The following algorithms are supported:Algorithm 1: Recursive doubling, which accelerates Barrier set operationsAlgorithm 2: Node-aware Recursive + Binomial (intra)Algorithm 3: Socket-aware Recursive + Binomial (intra)Algorithm 4: Node-aware Recursive + K-nomial (intra)Algorithm 5: Socket-aware Recursive + K-nomial (intra)Algorithm 6: Node-aware K-nomialAlgorithm 7: Socket-aware K-nomialAlgorithm 10: Node-aware ParallelAlgorithms 3, 4, 5, 6, and 7 do not support PPN imbalance. Algorithms 6 and 10 do not support discontinuous ranks.
- MPI_Bcast
- MPI_Bast is an MPI broadcast operation function. The root process sends data in the buffer to all the other processes in the communication domain so that all processes obtain the same data.The following algorithms are supported:Algorithm 2: Topo-aware Binomial treeAlgorithm 3: Topo-aware K-nomial treeAlgorithm 4: Topo-aware K-nomial tree + Binomial tree (intra)The preceding algorithms support discontinuous data structures. Algorithms 3 and 4 do not support PPN imbalance and discontinuous ranks.
