Poor Network Connection in the ZooKeeper Cluster

Symptom

If network congestion occurs on the NIC of the ZooKeeper leader node, data synchronization between the follower nodes and the leader node times out. As a result, the leader node removes the follower nodes from the cluster. If more than half of the nodes are removed from the cluster, all ZooKeeper nodes cannot provide services. You can locate the fault based on the gcache.log file in /var/log/gcache. On a follower node:

On the leader node:

Cause Analysis

The data synchronization latency between ZooKeeper servers is too long due to network congestion. As a result, the transaction times out. If the leader server fails to obtain the heartbeat of a follower node within the period specified by syncLimit, a network I/O error occurs on the follower node. As a result, the follower node disconnects from the leader node and shuts down. This is a native ZooKeeper fault.

Solution

In the zoo.cfg configuration file, increase the values of syncTimeout (default value: 5 x tickTime) and InitTimeout (default value: 10 x tickTime).

Separate ZooKeeper servers from the service-layer network to prevent network congestion caused by heavy traffic at the service layer.

Parent topic: ZooKeeper Faults