我要评分
获取效率
正确性
完整性
易理解

Slow Fault Recovery of Three or More Nodes in the Global Cache Cluster

Symptom

Multiple nodes (more than three nodes) in a cluster are faulty at the same time. The fault is rectified immediately, but the cluster recovery takes a long time (the recovery duration is positively correlated with the number of faulty nodes). It takes up to 30 minutes to recover five nodes in a 10-node environment.

Cause Analysis

After some nodes are faulty, normal nodes perform some operations in the background to release the connections to the faulty nodes and the corresponding resources. These operations take some time to complete. During the resource release, if faulty node recovery is performed, complex time sequence issues occur. As a result, the recovery time is prolonged.

Solution

The cluster can be recovered. If multiple nodes are faulty, wait for a period of time and then recover the cluster. This effectively shortens the cluster recovery time.