Ceph Tuning
Modifying Ceph Configuration
- Purpose
Adjust the Ceph configuration to maximize system resource usage.
- Procedure
You can edit the /etc/ceph/ceph.conf file to modify all Ceph configuration parameters. For example, you can add osd_pool_default_size = 4 to the /etc/ceph/ceph.conf file to change the default number of copies to 4 and run the systemctl restart ceph.target command to restart the Ceph daemon process for the change to take effect.
The preceding operations take effect only on the current Ceph node. You need to modify the ceph.conf file on all Ceph nodes and restart the Ceph daemon process for the modification to take effect on the entire Ceph cluster. Table 1 describes the Ceph optimization items.
Table 1 Ceph parameter configuration Parameter
Description
Suggestion
[global]
cluster_network
You can configure a network segment different from the public network for OSD replication and data balancing to relieve the pressure on the public network.
Recommended value: 192.168.4.0/24
You can set this parameter as required as long as it is different from the public network segment.
public_network
Recommended value: 192.168.3.0/24
You can set this parameter as required as long as it is different from the cluster network segment.
osd_pool_default_size
Number of copies
Recommended value: 3
osd_memory_target
Size of memory that each OSD process is allowed to obtain
Recommended value: 4294967296
For details about how to optimize other parameters, see Table 2.
Table 2 Other parameter configuration Parameter
Description
Suggestion
[global]
osd_pool_default_min_size
Minimum number of I/O copies that the PG can receive. If a PG is in the degraded state, its I/O capability is not affected.
Default value: 0
Recommended value: 1
cluster_network
You can configure a network segment different from the public network for OSD replication and data balancing to relieve the pressure on the public network.
Default value: /
Recommended value: 192.168.4.0/24
osd_memory_target
Size of memory that each OSD process is allowed to obtain
Default value: 4294967296
Recommended value: 4294967296
[mon]
mon_clock_drift_allowed
Clock drift between MONs
Default value: 0.05
Recommended value: 1
mon_osd_min_down_reporters
Minimum down OSD quantity that triggers a report to the MONs
Default value: 2
Recommended value: 13
mon_osd_down_out_interval
Number of seconds that Ceph waits before an OSD is marked as down or out
Default value: 600
Recommended value: 600
[OSD]
osd_journal_size
OSD journal size
Default value: 5120
Recommended value: 20000
osd_max_write_size
Maximum size (in MB) of data that can be written by an OSD at a time
Default value: 90
Recommended value: 512
osd_client_message_size_cap
Maximum size (in bytes) of data that can be stored in the memory by the clients
Default value: 100
Recommended value: 2147483648
osd_deep_scrub_stride
Number of bytes that can be read during deep scrubbing
Default value: 524288
Recommended value: 131072
osd_map_cache_size
Size of the cache (in MB) that stores the OSD map
Default value: 50
Recommended value: 1024
osd_recovery_op_priority
Restoration priority. The value ranges from 1 to 63. A larger value indicates higher resource usage.
Default value: 3
Recommended value: 2
osd_recovery_max_active
Number of active restoration requests in the same period
Default value: 3
Recommended value: 10
osd_max_backfills
Maximum number of backfills allowed by an OSD
Default value: 1
Recommended value: 4
osd_min_pg_log_entries
Minimum number of reserved PG logs
Default value: 3000
Recommended value: 30000
osd_max_pg_log_entries
Maximum number of reserved PG logs
Default value: 3000
Recommended value: 100000
osd_mon_heartbeat_interval
Interval (in seconds) for an OSD to ping a MON
Default value: 30
Recommended value: 40
ms_dispatch_throttle_bytes
Maximum number of messages to be dispatched
Default value: 104857600
Recommended value: 1048576000
objecter_inflight_ops
Allowed maximum number of unsent I/O requests. This parameter is used for client traffic control. If the number of unsent I/O requests exceeds the threshold, the application I/O is blocked. The value 0 indicates that the number of unsent I/O requests is not limited.
Default value: 1024
Recommended value: 819200
osd_op_log_threshold
Number of operation logs to be displayed at a time
Default value: 5
Recommended value: 50
osd_crush_chooseleaf_type
Bucket type when the CRUSH rule uses chooseleaf
Default value: 1
Recommended value: 0
journal_max_write_bytes
Maximum number of journal bytes that can be written at a time
Default value: 10485760
Recommended value: 1073714824
journal_max_write_entries
Maximum number of journal records that can be written at a time
Default value: 100
Recommended value: 10000
[Client]
rbd_cache
RBD cache
Default value: True
Recommended value: True
rbd_cache_size
RBD cache size (in bytes)
Default value: 33554432
Recommended value: 335544320
rbd_cache_max_dirty
Maximum number of dirty bytes allowed when the cache is set to the writeback mode. If the value is 0, the cache is set to the writethrough mode.
Default value: 25165824
Recommended value: 134217728
rbd_cache_max_dirty_age
Duration (in seconds) for which the dirty data is stored in the cache before being flushed to the drives
Default value: 1
Recommended value: 30
rbd_cache_writethrough_until_flush
This parameter is used for compatibility with the virtio driver earlier than linux-2.6.32. It prevents the situation that data is written back when no flush request is sent. After this parameter is set, librbd processes I/Os in writethrough mode. The mode is switched to writeback only after the first flush request is received.
Default value: True
Recommended value: False
rbd_cache_max_dirty_object
Maximum number of objects. The default value is 0, which indicates that the number is calculated based on the RBD cache size. By default, librbd logically splits the drive image in a unit of 4 MB.
Each chunk object is abstracted as an object. librbd manages the cache object. You can increase the value of this parameter improve the performance.
Default value: 0
Recommended value: 2
rbd_cache_target_dirty
Dirty data size that triggers writeback. The value cannot exceed the value of rbd_cache_max_dirty.
Default value: 16777216
Recommended value: 235544320
Optimizing the PG Distribution
- Purpose
Adjust the number of PGs on each OSD to balance the load on each OSD.
- Procedure
By default, Ceph allocates eight PGs/PGPs to each storage pool. When creating a storage pool, run the ceph osd pool create {pool-name} {pg-num} {pgp-num} command to specify the number of PGs/PGPs, or run the ceph osd pool set {pool_name} pg_num {pg-num} and ceph osd pool set {pool_name} pgp_num {pgp-num} command to change the number of PGs/PGPs created in a storage pool. After the modification, run the ceph osd pool get {pool_name} pg_num/pgp_num command to check the number of PGs/PGPs in the storage pool.
The default value of ceph balancer mode is none. You can run the ceph balancer mode upmap command to change it to upmap. The Ceph balancer function is disabled by default. You can run the ceph balancer on/off command is used to enable or disable the Ceph balancer function.
Table 3 describes the PG distribution parameters.
Table 3 PG distribution parameters Parameter
Description
Suggestion
pg_num
Total PGs = (Total_number_of_OSD x 100)/max_replication_count
Round up the result to the nearest integer power of 2.
Default value: 8
Symptom: A warning is displayed if the number of PGs is insufficient.
Suggestion: Calculate the value based on the formula.
pgp_num
Sets the number of PGPs to be the same as that of PGs.
Default value: 8
Symptom: It is recommended that the number of PGPs be the same as the number of PGs.
Suggestion: Calculate the value based on the formula.
ceph_balancer_mode
Enables the balancer plugin and sets the plugin mode to upmap.
Default value: none
Symptom: If the number of PGs is unbalanced, some OSDs may be overloaded and become bottlenecks.
Recommended value: upmap
- The number of PGs carried by each OSD must be the same or close. Otherwise, some OSDs may be overloaded and become bottlenecks. The balancer plugin can be used to optimize the PG distribution. You can run the ceph balancer eval or ceph pg dump command to view the PG distribution.
- Run the eph balancer mode upmap and ceph balancer on commands to automatically balance and optimize Ceph PGs. Ceph adjusts the distribution of a few PGs every 60 seconds. Run the ceph balancer eval or ceph pg dump command to view the PG distribution. If the PG distribution does not change, the distribution is optimal.
- The PG distribution of each OSD affects the load balancing of write data. In addition to optimizing the number of PGs corresponding to each OSD, the distribution of the primary PGs also needs to be optimized. That is, the primary PGs need to be distributed to each OSD as evenly as possible.
Binding OSDs to CPU Cores
- Purpose
Bind each OSD process to a fixed CPU core.
- Procedure
Add osd_numa_node = <NUM> to the /etc/ceph/ceph.conf file.
Table 4 lists the optimization items.
Table 4 OSD core binding parameters Parameter
Description
Suggestion
[osd.n]
osd_numa_node
Binds the osd.n daemon process to a specified idle NUMA node, which is a node other than the nodes that process NIC software interrupts.
Default value: none
Symptom: If the CPU of each OSD process is the same as that of the NIC interrupt, some CPUs may be overloaded.
Suggestion: To balance the CPU load pressure, avoid running each OSD process and NIC interrupt process (or other processes with high CPU usage) on the same NUMA node.
- The Ceph OSD daemon process and NIC software interrupt process must run on different NUMA nodes. Otherwise, CPU bottlenecks may occur when the network load is heavy. By default, Ceph evenly allocates OSD processes to all CPU cores. You can add the osd_numa_node parameter to the ceph.conf file to avoid running each OSD process and NIC interrupt process (or other processes with high CPU usage) on the same NUMA node.
- Optimizing the Network Performance describes how to bind NIC software interrupts to the CPU core of the NUMA node to which the NIC belongs. When the network load is heavy, the usage of the CPU core bound to the software interrupts is high. Therefore, you are advised to set osd_numa_node to a NUMA node different from that of the NIC. For example, run the cat /sys/class/net/Port Name/device/numa_node command to query the NUMA node of the NIC. If the NIC belongs to NUMA node 2, set osd_numa_node = 0 or osd_numa_node = 1 to prevent the OSD and NIC software interrupt from using the same CPU core.
Optimizing Compression Algorithm Configuration Parameters
- Purpose
Adjust the compression algorithm configuration parameters to optimize the performance of the compression algorithm.
- Procedure
The default value of bluestore_min_alloc_size_hdd for Ceph is 32 KB. The value of this parameter affects the size of the final data obtained after the compression algorithm is run. Set this parameter to a smaller value to maximize the compression rate of the compression algorithm.
By default, Ceph uses five threads to process I/O requests in an OSD process. After the compression algorithm is enabled, the number of threads can cause a performance bottleneck. Increase the number of threads to maximize the performance of the compression algorithm.
The following table describes the PG distribution parameters:
Parameter
Description
Suggestion
bluestore_min_alloc_size_hdd
Minimum size of objects allocated to the HDD data disks in the BlueStore storage engine
Default value: 32768
Recommended value: 8192
osd_op_num_shards_hdd
Number of shards for an HDD data drive in an OSD process
Default value: 5
Recommended value: 12
osd_op_num_threads_per_shard_hdd
Average number of threads of an OSD process for each HDD data drive shard
Default value: 1
Recommended value: 2