Ultra-large Clusters

In many cluster architectures, the main purpose of cluster members is to let the centralized interface know which nodes it can access, and then the central interface provides services for the client through two-level scheduling. In a petabyte- to exabyte-scale system, the scheduling system is the biggest bottleneck.

Ceph eliminates this bottleneck. Both the OSDs and clients can sense the cluster. For example, Ceph clients and OSDs know other OSDs in the cluster, and each OSD can directly communicate with other OSDs and monitors. In addition, Ceph clients can directly interact with OSDs.

Ceph clients, monitors, and OSDs can interact directly with each other, which means that OSDs can use the CPU and memory of the local node to perform tasks that may overwhelm the central server. This design balances computing resources and brings the following benefits:

OSDs directly serve clients. All network devices have a limit on concurrent connections. When the number of concurrent connections is large, the physical limitations of centralized system are evident. Ceph allows clients to directly communicate with OSD nodes, eliminating single points of failure (SPOF) and improving performance and system capacity. Ceph clients can maintain sessions with a certain OSD on demand, rather than a central server.
OSD members and status: After being added to the cluster, the Ceph OSD continuously reports its status. At the low level, the OSD status is up or down, indicating whether it is running and whether it can provide services. If the status of an OSD is down and in, the OSD may be faulty. If an OSD is not running (for example, the OSD has crashed), it cannot report to the monitor that it is down. The Ceph monitor periodically pings OSDs to ensure that they are running. In the meantime, it also authorizes each OSD to check whether neighboring OSDs are down, update the cluster map, and report to the monitor. This mechanism means that the monitor is still a lightweight process.
Data scrubbing: As part of maintaining data consistency and cleanliness, OSDs can scrub objects in a placement group (PG). Ceph OSDs can compare object metadata with metadata of copies stored on other OSDs to catch OSD defects or file system errors daily. Ceph OSDs can also perform deep scrubbing weekly by comparing object data by bit to identify bad sectors that are not detected during light scrubbing.
Replication: Like Ceph clients, OSDs also use the CRUSH algorithm. However, the OSDs use it to calculate where copies are stored (also used for rebalancing). In a typical write scenario, a client uses the CRUSH algorithm to calculate where an object should be stored, maps the object to a storage pool and a PG, and then searches the CRUSH map to determine the primary OSD of the PG.
The client writes the object to the primary OSD of the target PG. The primary OSD uses its CRUSH map copy to find the secondary and tertiary OSDs for storing the object copies, and copy the data to the secondary and tertiary OSDs corresponding to the appropriate PG (the number of OSDs is the same as the number of copies). The data storage success information is fed back to the client.

Figure 1 Ceph three-copy write

With the capability of making copies, OSDs can reduce the replication pressure of clients and ensure the high reliability and security of data.

Parent topic: Ceph Storage Cluster