The Latency Is High When a Large Number of Containers Are Started

Symptom

During the test of concurrently scheduling 100 containers on two Kubernetes worker nodes, the average container startup latency of the Kunpeng platform is much longer than that of the x86 platform. When breaking down the deployment process, it is found that the scheduling duration accounts for nearly 90% of the total latency, which means congestion exists.

Key Process and Cause Analysis

The contents of the script file used in the test are as follows:

#!/bin/bash

total=100

echo "start delete old pods."
kubectl get jobs | grep "yanzu-test" | awk '{print $1}' | grep -v NAME | xargs kubectl delete job > /dev/null 2>&1
kubectl delete events --all
echo "delete old pods finish"

echo "staring..."
start=$(date +%s)
echo $start
date
declare -i i=1
while ((i<=$total))
do
    echo "start" $i
    kubectl create -f run.yaml &
    let ++i
done


echo "wait for 30s"
sleep 30s
completed=`kubectl get pods | grep "yanzu-test" | awk '{print $3}' | grep -v Completed`
while [ -n "$completed" ]
do
    echo "waiting for complete:" $completed
    sleep 3
    completed=`kubectl get pods | grep "yanzu-test" | awk '{print $3}' | grep -v Completed`
done

echo "finish create jobs"

declare -i sum=0 max=0 sum_sch=0 sum_cre=0 sum_sta=0 max_sch=0 max_cre=0 max_sta=0 sum_pul=0 max_pul=0

pods=`kubectl get pods | grep "yanzu-test" | awk '{print $1}'`
for pod in ${pods[@]}
do
    timet=`kubectl describe pod $pod | grep "Started:" | awk '{print $6}'`
    p_timet=$(date --date $timet)
    # echo "Pod $pod start time is $p_timet"
    mils=$(date +%s --date $timet)
    used=$((mils-start))
    if((used>max))
    then
      max=used
    fi
    sch_time=`kubectl get events -o=custom-columns=LastSeen:.metadata.creationTimestamp,Type:.type,Object:.involvedObject.name,Reason:.reason | grep "$pod" | grep "Scheduled" | awk '{print $1}'`
    pul_time=`kubectl get events -o=custom-columns=LastSeen:.lastTimestamp,Type:.type,Object:.involvedObject.name,Reason:.reason | grep "$pod" | grep "Pulled" | awk '{print $1}'`
    cre_time=`kubectl get events -o=custom-columns=LastSeen:.lastTimestamp,Type:.type,Object:.involvedObject.name,Reason:.reason | grep "$pod" | grep "Created" | awk '{print $1}'`
    sta_time=`kubectl get events -o=custom-columns=LastSeen:.lastTimestamp,Type:.type,Object:.involvedObject.name,Reason:.reason | grep "$pod" | grep "Started" | awk '{print $1}'`
    sch_time_t=$(date +%s --date $sch_time)
    sch_used=$((sch_time_t-start))
    pul_time_t=$(date +%s --date $pul_time)
    pul_used=$((pul_time_t-sch_time_t))
    cre_time_t=$(date +%s --date $cre_time)
    cre_used=$((cre_time_t-pul_time_t))
    sta_time_t=$(date +%s --date $sta_time)
    sta_used=$((sta_time_t-cre_time_t))
    if((sch_used>max_sch))
    then
      max_sch=sch_used
    fi
    if((pul_used>max_pul))
    then
      max_pul=pul_used
    fi
    if((cre_used>max_cre))
    then
      max_cre=cre_used
    fi
    if((sta_used>max_sta))
    then
      max_sta=sta_used
    fi
    sum=$((sum+used))
    sum_sch=$((sum_sch+sch_used))
    sum_pul=$((sum_pul+pul_used))
    sum_cre=$((sum_cre+cre_used))
    sum_sta=$((sum_sta+sta_used))
done

echo "Avg Scheduled Time: $((sum_sch/100))"
echo "Avg Pulled Time: $((sum_pul/100))"
echo "Avg Created Time: $((sum_cre/100))"
echo "Avg Started Time: $((sum_sta/100))"
echo "Avg Time: $((sum/100))"
echo "Max Scheduled Time: $max_sch"
echo "Max Pulled Time: $max_pul"
echo "Max Created Time: $max_cre"
echo "Max Started Time: $max_sta"
echo "Max Time: $max"

echo "finish test"

The contents of the run.yaml file are as follows:

apiVersion: batch/v1
kind: Job
metadata:
  generateName: yanzu-test-
spec:
  completions: 1
  parallelism: 1
  backoffLimit: 0
  activeDeadlineSeconds: 3600
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: my-container
        image: openjdk:8u292-jdk
        resources:
          requests:
            memory: "1Gi"
            cpu: "1000m"
          limits:
            memory: "1Gi"
            cpu: "1000m"
        command: [ "/bin/bash", "-c", "sleep 21s"]

The procedure is as follows:

Execute kubectl create -f job.yaml for 100 times in the for loop, and concurrently create 100 jobs to deploy pods to start containers.
Calculate using the script to find that the scheduling duration ratio is extremely high.
- Container startup duration = Container startup time - Command input time
- Pod scheduling duration = Pod scheduling success time - Command input time
Run the kubectl logs command to check the controller-manager log. The information about "Throttling request" is found.
Figure 1 controller-manager log

In the scheduling phase, Kubernetes concurrently creates 100 jobs objects and pods objects using controller-manager, and then schedules the 100 pods to proper worker nodes using scheduler. In this process, controller-manager and scheduler need to frequently send HTTP requests to apiserver. In these components, the number of API communication requests per second and the peak number of API burst requests per second are controlled by kube-api-qps (50 by default) and kube-api-burst (100 by default), respectively. If the threshold is set to a small value, the API communication requests that exceed the threshold are limited and become throttling requests. As a result, the response becomes slow, the pod creation and scheduling duration are affected, and the latency increases.

Conclusion and Solution

Before initializing the cluster, perform the following steps to configure related parameters:

Run the kubeadm config command to generate an init-config.yaml configuration file.
```
kubeadm config print init-defaults > init-config.yaml
```
Modify init-config.yaml and add the corresponding startup parameters. In the following example, kubeadm-1.23.1-0.aarch64 is used to describe how to modify the kube-controller-manager startup parameter.
1. Open the init-config.yaml file.
```
vim init-config.yaml
```
2. Modify the following configuration items.
  Add the kube-apiserver-burst and kube-apiserver-qps parameters under controllerManager to adjust the maximum number of API requests:
```
controllerManager:
  extraArgs:
    kube-apiserver-burst: "400"
    kube-apiserver-qps: "600"
```
3. Run the kubeadm init command to initialize the cluster.
```
kubeadm init --config=init-config.yaml
```
After the modification, run the script again to perform the test. The result in Figure 2 shows that the scheduling latency is only about 1 second, which is much shorter than the previous 8 seconds.
Figure 2 Test result

Parent topic: Docker & Kubernetes