Merge pull request #50310 from shyamjvs/block-on-master-startup

Automatic merge from submit-queue

Block on master-creation step for large clusters (>50 nodes) in kube-up

I recently noticed a failure in our 5000-node scale test where the master failed to initialize within time. But it went on and created all 5000 nodes due to not blocking on master creation. Turned out the master VM wasn't even created:

```
W0808 10:00:49.340] ERROR: (gcloud.compute.instances.create) Could not fetch resource:
... Try a different zone, or try again later.
```

Even some of our 100-node tests are flaking occasionally during cluster startup (with master validation step timing out) and I think the reason is the same (issue - https://github.com/kubernetes/kubernetes/issues/49453)
We should block on that step for large clusters.

cc @kubernetes/sig-scalability-misc @gmarek
This commit is contained in:
Kubernetes Submit Queue 2017-08-08 13:56:50 -07:00 committed by GitHub
commit 3e0eff9f55

View File

@ -953,6 +953,7 @@ function delete-subnetworks() {
#
# Assumed vars:
# KUBE_TEMP: temporary directory
# NUM_NODES: #nodes in the cluster
#
# Args:
# $1: host name
@ -1044,7 +1045,13 @@ function create-master() {
create-certs "${MASTER_RESERVED_IP}"
create-etcd-certs ${MASTER_NAME}
create-master-instance "${MASTER_RESERVED_IP}" &
if [[ "${NUM_NODES}" -ge "50" ]]; then
# We block on master creation for large clusters to avoid doing too much
# unnecessary work in case master start-up fails (like creation of nodes).
create-master-instance "${MASTER_RESERVED_IP}"
else
create-master-instance "${MASTER_RESERVED_IP}" &
fi
}
# Adds master replica to etcd cluster.