鲲鹏社区首页
中文
注册
我要评分
文档获取效率
文档正确性
内容完整性
文档易理解
在线提单
论坛求助

Ceph Tuning

Modifying Ceph Configuration

  • Purpose

    Adjust the Ceph configuration to maximize system resource usage.

  • Procedure

    You can edit the /etc/ceph/ceph.conf file to modify all Ceph configuration parameters. For example, you can add osd_pool_default_size = 4 to the /etc/ceph/ceph.conf file to change the default number of copies to 4 and run the systemctl restart ceph.target command to restart the Ceph daemon process for the change to take effect.

    The preceding operations take effect only on the current Ceph node. You need to modify the ceph.conf file on all Ceph nodes and restart the Ceph daemon process for the modification to take effect on the entire Ceph cluster. Table 1 describes the Ceph optimization items.

    Table 1 Ceph parameter configuration

    Parameter

    Description

    Suggestion

    [global]

    cluster_network

    You can configure a network segment different from the public network for OSD replication and data balancing to relieve the pressure on the public network.

    Recommended value: 192.168.4.0/24

    You can set this parameter as required as long as it is different from the public network segment.

    public_network

    Recommended value: 192.168.3.0/24

    You can set this parameter as required as long as it is different from the cluster network segment.

    osd_pool_default_size

    Number of copies

    Recommended value: 3

    osd_memory_target

    Size of memory that each OSD process is allowed to obtain

    Recommended value: 4294967296

    For details about how to optimize other parameters, see Table 2.

    Table 2 Other parameter configuration

    Parameter

    Description

    Suggestion

    [global]

    osd_pool_default_min_size

    Minimum number of I/O copies that the PG can receive. If a PG is in the degraded state, its I/O capability is not affected.

    Default value: 0

    Recommended value: 1

    cluster_network

    You can configure a network segment different from the public network for OSD replication and data balancing to relieve the pressure on the public network.

    Default value: /

    Recommended value: 192.168.4.0/24

    osd_memory_target

    Size of memory that each OSD process is allowed to obtain

    Default value: 4294967296

    Recommended value: 4294967296

    [mon]

    mon_clock_drift_allowed

    Clock drift between MONs

    Default value: 0.05

    Recommended value: 1

    mon_osd_min_down_reporters

    Minimum down OSD quantity that triggers a report to the MONs

    Default value: 2

    Recommended value: 13

    mon_osd_down_out_interval

    Number of seconds that Ceph waits before an OSD is marked as down or out

    Default value: 600

    Recommended value: 600

    [OSD]

    osd_journal_size

    OSD journal size

    Default value: 5120

    Recommended value: 20000

    osd_max_write_size

    Maximum size (in MB) of data that can be written by an OSD at a time

    Default value: 90

    Recommended value: 512

    osd_client_message_size_cap

    Maximum size (in bytes) of data that can be stored in the memory by the clients

    Default value: 100

    Recommended value: 2147483648

    osd_deep_scrub_stride

    Number of bytes that can be read during deep scrubbing

    Default value: 524288

    Recommended value: 131072

    osd_map_cache_size

    Size of the cache (in MB) that stores the OSD map

    Default value: 50

    Recommended value: 1024

    osd_recovery_op_priority

    Restoration priority. The value ranges from 1 to 63. A larger value indicates higher resource usage.

    Default value: 3

    Recommended value: 2

    osd_recovery_max_active

    Number of active restoration requests in the same period

    Default value: 3

    Recommended value: 10

    osd_max_backfills

    Maximum number of backfills allowed by an OSD

    Default value: 1

    Recommended value: 4

    osd_min_pg_log_entries

    Minimum number of reserved PG logs

    Default value: 3000

    Recommended value: 30000

    osd_max_pg_log_entries

    Maximum number of reserved PG logs

    Default value: 3000

    Recommended value: 100000

    osd_mon_heartbeat_interval

    Interval (in seconds) for an OSD to ping a MON

    Default value: 30

    Recommended value: 40

    ms_dispatch_throttle_bytes

    Maximum number of messages to be dispatched

    Default value: 104857600

    Recommended value: 1048576000

    objecter_inflight_ops

    Allowed maximum number of unsent I/O requests. This parameter is used for client traffic control. If the number of unsent I/O requests exceeds the threshold, the application I/O is blocked. The value 0 indicates that the number of unsent I/O requests is not limited.

    Default value: 1024

    Recommended value: 819200

    osd_op_log_threshold

    Number of operation logs to be displayed at a time

    Default value: 5

    Recommended value: 50

    osd_crush_chooseleaf_type

    Bucket type when the CRUSH rule uses chooseleaf

    Default value: 1

    Recommended value: 0

    journal_max_write_bytes

    Maximum number of journal bytes that can be written at a time

    Default value: 10485760

    Recommended value: 1073714824

    journal_max_write_entries

    Maximum number of journal records that can be written at a time

    Default value: 100

    Recommended value: 10000

    [Client]

    rbd_cache

    RBD cache

    Default value: True

    Recommended value: True

    rbd_cache_size

    RBD cache size (in bytes)

    Default value: 33554432

    Recommended value: 335544320

    rbd_cache_max_dirty

    Maximum number of dirty bytes allowed when the cache is set to the writeback mode. If the value is 0, the cache is set to the writethrough mode.

    Default value: 25165824

    Recommended value: 134217728

    rbd_cache_max_dirty_age

    Duration (in seconds) for which the dirty data is stored in the cache before being flushed to the drives

    Default value: 1

    Recommended value: 30

    rbd_cache_writethrough_until_flush

    This parameter is used for compatibility with the virtio driver earlier than linux-2.6.32. It prevents the situation that data is written back when no flush request is sent. After this parameter is set, librbd processes I/Os in writethrough mode. The mode is switched to writeback only after the first flush request is received.

    Default value: True

    Recommended value: False

    rbd_cache_max_dirty_object

    Maximum number of objects. The default value is 0, which indicates that the number is calculated based on the RBD cache size. By default, librbd logically splits the drive image in a unit of 4 MB.

    Each chunk object is abstracted as an object. librbd manages the cache object. You can increase the value of this parameter improve the performance.

    Default value: 0

    Recommended value: 2

    rbd_cache_target_dirty

    Dirty data size that triggers writeback. The value cannot exceed the value of rbd_cache_max_dirty.

    Default value: 16777216

    Recommended value: 235544320

Optimizing the PG Distribution

  • Purpose

    Adjust the number of PGs on each OSD to balance the load on each OSD.

  • Procedure

    By default, Ceph allocates eight PGs/PGPs to each storage pool. When creating a storage pool, run the ceph osd pool create {pool-name} {pg-num} {pgp-num} command to specify the number of PGs/PGPs, or run the ceph osd pool set {pool_name} pg_num {pg-num} and ceph osd pool set {pool_name} pgp_num {pgp-num} command to change the number of PGs/PGPs created in a storage pool. After the modification, run the ceph osd pool get {pool_name} pg_num/pgp_num command to check the number of PGs/PGPs in the storage pool.

    The default value of ceph balancer mode is none. You can run the ceph balancer mode upmap command to change it to upmap. The Ceph balancer function is disabled by default. You can run the ceph balancer on/off command is used to enable or disable the Ceph balancer function.

    Table 3 describes the PG distribution parameters.

    Table 3 PG distribution parameters

    Parameter

    Description

    Suggestion

    pg_num

    Total PGs = (Total_number_of_OSD x 100)/max_replication_count

    Round up the result to the nearest integer power of 2.

    Default value: 8

    Symptom: A warning is displayed if the number of PGs is insufficient.

    Suggestion: Calculate the value based on the formula.

    pgp_num

    Sets the number of PGPs to be the same as that of PGs.

    Default value: 8

    Symptom: It is recommended that the number of PGPs be the same as the number of PGs.

    Suggestion: Calculate the value based on the formula.

    ceph_balancer_mode

    Enables the balancer plugin and sets the plugin mode to upmap.

    Default value: none

    Symptom: If the number of PGs is unbalanced, some OSDs may be overloaded and become bottlenecks.

    Recommended value: upmap

    • The number of PGs carried by each OSD must be the same or close. Otherwise, some OSDs may be overloaded and become bottlenecks. The balancer plugin can be used to optimize the PG distribution. You can run the ceph balancer eval or ceph pg dump command to view the PG distribution.
    • Run the eph balancer mode upmap and ceph balancer on commands to automatically balance and optimize Ceph PGs. Ceph adjusts the distribution of a few PGs every 60 seconds. Run the ceph balancer eval or ceph pg dump command to view the PG distribution. If the PG distribution does not change, the distribution is optimal.
    • The PG distribution of each OSD affects the load balancing of write data. In addition to optimizing the number of PGs corresponding to each OSD, the distribution of the primary PGs also needs to be optimized. That is, the primary PGs need to be distributed to each OSD as evenly as possible.

Binding OSDs to CPU Cores

  • Purpose

    Bind each OSD process to a fixed CPU core.

  • Procedure

    Add osd_numa_node = <NUM> to the /etc/ceph/ceph.conf file.

    Table 4 lists the optimization items.

    Table 4 OSD core binding parameters

    Parameter

    Description

    Suggestion

    [osd.n]

    osd_numa_node

    Binds the osd.n daemon process to a specified idle NUMA node, which is a node other than the nodes that process NIC software interrupts.

    Default value: none

    Symptom: If the CPU of each OSD process is the same as that of the NIC interrupt, some CPUs may be overloaded.

    Suggestion: To balance the CPU load pressure, avoid running each OSD process and NIC interrupt process (or other processes with high CPU usage) on the same NUMA node.

    • The Ceph OSD daemon process and NIC software interrupt process must run on different NUMA nodes. Otherwise, CPU bottlenecks may occur when the network load is heavy. By default, Ceph evenly allocates OSD processes to all CPU cores. You can add the osd_numa_node parameter to the ceph.conf file to avoid running each OSD process and NIC interrupt process (or other processes with high CPU usage) on the same NUMA node.
    • Optimizing the Network Performance describes how to bind NIC software interrupts to the CPU core of the NUMA node to which the NIC belongs. When the network load is heavy, the usage of the CPU core bound to the software interrupts is high. Therefore, you are advised to set osd_numa_node to a NUMA node different from that of the NIC. For example, run the cat /sys/class/net/Port Name/device/numa_node command to query the NUMA node of the NIC. If the NIC belongs to NUMA node 2, set osd_numa_node = 0 or osd_numa_node = 1 to prevent the OSD and NIC software interrupt from using the same CPU core.

Optimizing Compression Algorithm Configuration Parameters

  • Purpose

    Adjust the compression algorithm configuration parameters to optimize the performance of the compression algorithm.

  • Procedure

    The default value of bluestore_min_alloc_size_hdd for Ceph is 32 KB. The value of this parameter affects the size of the final data obtained after the compression algorithm is run. Set this parameter to a smaller value to maximize the compression rate of the compression algorithm.

    By default, Ceph uses five threads to process I/O requests in an OSD process. After the compression algorithm is enabled, the number of threads can cause a performance bottleneck. Increase the number of threads to maximize the performance of the compression algorithm.

    The following table describes the PG distribution parameters:

    Parameter

    Description

    Suggestion

    bluestore_min_alloc_size_hdd

    Minimum size of objects allocated to the HDD data disks in the BlueStore storage engine

    Default value: 32768

    Recommended value: 8192

    osd_op_num_shards_hdd

    Number of shards for an HDD data drive in an OSD process

    Default value: 5

    Recommended value: 12

    osd_op_num_threads_per_shard_hdd

    Average number of threads of an OSD process for each HDD data drive shard

    Default value: 1

    Recommended value: 2

Enabling Bcache

Bcache is a block layer cache of the Linux kernel. It uses SSDs as the cache of HDDs for acceleration. To enable the Bcache kernel module, you need to recompile the kernel. For details, see Bcache User Guide (CentOS & openEuler 20.03).

Using the I/O Passthrough Tool

The I/O passthrough tool is a process optimization tool for balanced scenarios of the Ceph cluster. It can automatically detect and optimize OSDs in the Ceph cluster. For details, see Kunpeng BoostKit for SDS I/O Passthrough Tool User Guide.