Security Check and Hardening
Security check and hardening protect the system and network against problems such as hacker attacks, data leakage, and system breakdown. In addition, they can meet regulatory compliance requirements and protect user privacy and information security.
Routine Antivirus Software Check
Periodically scan clusters and Spark components for viruses. This protects clusters from viruses, malicious code, spyware, and malicious programs, reducing risks such as system breakdown and information leakage. Mainstream antivirus software can be recommended for antivirus check.
Vulnerability Fixing
To ensure the security of the production environment and reduce the risk of attacks, enable the firewall and periodically fix the following vulnerabilities:
- OS vulnerabilities
- JDK vulnerabilities
- Hadoop and Spark vulnerabilities
- ZooKeeper vulnerabilities
- Kerberos vulnerabilities
- OpenSSL vulnerabilities
- Vulnerabilities in other components
The following uses CVE-2021-37137 as an example.
Vulnerability description:
Netty 4.1.17 has two Content-Length HTTP headers that may be confused. The vulnerability ID is CVE-2021-37137.
The system uses the hdfs-ceph (version 3.2.0) service as the storage object with decoupled storage and compute. This service depends on aws-java-sdk-bundle-1.11.375.jar and involves this vulnerability. You are advised to update the vulnerability patch in a timely manner to prevent hacker attacks.
Impact:
Netty 4.1.68 and earlier versions
Handling suggestion:
Currently, the vendor has released an upgrade patch to fix the vulnerability. For details, visit GitHub.
SSH Hardening
During the installation and deployment, you need to connect to the server through SSH. The root user has all the operation permissions. Logging in to the server as the root user may pose security risks. You are advised to log in to the server as a common user for installation and deployment and disable root user login using SSH to improve system security. The procedure is as follows:
Check the PermitRootLogin configuration item in /etc/ssh/sshd_config.
- If the value is no, root user login using SSH is disabled.
- If the value is yes, change it to no.
Notification of Data Disclosure Risks
The security configurations ock.ucache.rpc.enableAuthentication, ock.ucache.rpc.enableTLS, ock.ucache.rpc.enableAuthorization in the ock.conf file and the security configuration switch of ZooKeeper can be disabled. However, disabling authentication and transmission encryption may cause spoofing and information leakage risks.
Configuring Address Randomization and Kernel Address Stack in Compilation Options
To ensure memory address protection during program running, you are advised to enable address randomization (randomize_va_space, for example, using the echo 2 >\proc\sys\kernel\randomize_va_space command) and kernel address space protection by using kernel address space layout randomization (KASLR), PAX, Supervisor Mode Access Prevention (SMAP), or Supervisor Mode Execution Prevention (SMEP) in compilation options.
Updating Keys
The OmniShuffle service needs to be restarted after keys are updated. Properly plan the key update period.
Use kmc_tool to periodically update keys.
- --updateMK: updates all master keys.
Command format: ./kmc_tool all --updateMK
- Update the keytab and whitelist.
Importing a CRL
After a CRL file is generated, you can specify its path in the configuration file to complete the configuration. After the configuration is complete, restart the OCKD process for the CRL to take effect.

Restricting Access from IP Addresses Outside the Cluster
To prevent DoS attacks outside the cluster, you are advised to configure the cluster firewall to restrict the access from IP addresses outside the cluster.
A big data cluster environment typically has multiple NICs, which are used for the service network (for high-bandwidth data transmission) and the management network (for cluster management with lower bandwidth). You are advised to bind the OmniShuffle monitoring ports to the service network and configure the service network of each node to receive only packets from the network segments in the cluster on the firewall to defend against DoS attacks from outside the cluster.
This document uses the typical networking of the primary node and compute nodes (secondary 01, secondary 02, and secondary 03) as an example. Each node has two NICs: 10GE NIC A (management network segment 90.90.1.*) and 100GE NIC B (192.168.1.*). In this case, you can do as follows to mitigate DoS attacks from outside the service cluster.
- Ensure that the OmniShuffle network communication is implemented through NIC B on each node.
Set ock.ucache.rpc.transport.devices in the ock.conf file to the device name of NIC B.
- Configure the following firewall policy for each node.
Set iptables or ACL rules to allow the service network segment of the node to receive only packets from the service network segment 192.168.1.* of NIC B.
Configuring a Kerberos Authentication Ticket
Both the OmniShuffle service and ZooKeeper authentication are implemented through Kerberos. To prevent spoofing caused by replay attacks in Kerberos authentication, you are advised to set the validity period of the identity authentication ticket as short as possible.
Recommended Environment Variable Configuration
In Recommended Configuration of HCOM Environment Variables, the default values of environment variables such as UCX_TCP_TX_MAX_BUFS, UCX_TCP_RX_MAX_BUFS, UCX_RC_VERBS_TX_MAX_BUFS, UCX_RC_VERBS_RX_MAX_BUFS, UCX_RC_MLX5_TX_MAX_BUFS, and UCX_RC_MLX5_RX_MAX_BUFS are -1, indicating that there is no upper limit on the memory used by the underlying communication library UCX. To prevent service unavailability caused by excessive memory usage, you are advised to set the maximum number to 131072 (that is, when the size of a single buffer is 8 KB, the maximum size of the buffer pool is 1 GB). You can set the environment variables based on the memory configuration and service traffic of your server.