The Global Cache Process Restarts and Is Suspended After All ZooKeeper Faults Are Rectified

Symptom

After all ZooKeeper faults are rectified, Global Cache restarts unexpectedly and the restart is suspended for 50 minutes and fails.

Cause Analysis

After ZooKeeper faults are rectified, the CCM fails to obtain the distributed lock and proactively restarts. During the restart, the CCM is suspended and services are not recovered.

Solution

Manually scale in and out the faulty node to prevent services from being affected.

Run the following commands:

# Access mgrtool.
attach CCM
ccm whoami # Check the CCM master.

# Access the CCM master mgrtool.
# Set a permanent fault. (Before removing a node from the cluster, shut down the Global Cache process of the node.)
ccm set permanentFault

# Restore the permanently faulty node to the cluster. (Ensure that its drive has no data. That is, the BDM is formatted.)
ccm start failback
# Start the scale-out.
ccm start scaleout
# Check the scale-out status.
ccm show scaleout status

Parent topic: System Startup Abnormalities