From 28198919f86d5badc3cb3d7d0d09f9d51561014e Mon Sep 17 00:00:00 2001 From: Clayton Coleman Date: Wed, 5 Jan 2022 15:47:59 -0500 Subject: [PATCH] release-note: Describe issues around node admission in 1.22 The 1.22 release fixed an issue where pods that were terminating were not always properly accounting for the resources they used. As a consequence, certain workloads that saturate a single node with pods may see increased pod creation failures until existing pods fully terminate. Inform users of that change and link to where we will resolve in the future. --- CHANGELOG/CHANGELOG-1.22.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/CHANGELOG/CHANGELOG-1.22.md b/CHANGELOG/CHANGELOG-1.22.md index 4c26a5f7911..9dfa497aec4 100644 --- a/CHANGELOG/CHANGELOG-1.22.md +++ b/CHANGELOG/CHANGELOG-1.22.md @@ -826,6 +826,10 @@ A regression bug was found where guaranteed Pods with multiple containers do not If CSIMigrationvSphere feature gate is enabled, user should not upgrade to Kubernetes v1.22. vSphere CSI Driver does not support Kubernetes v1.22 yet because it uses v1beta1 CRD APIs. Support for v1.22 will be added at a later release. Check the following document for supported Kubernetes releases for a given [vSphere CSI Driver version](https://vsphere-csi-driver.sigs.k8s.io/compatiblity_matrix.html#compatibility-matrix-for-vsphere-csi-driver). +### Workloads that saturate nodes with pods may see pods that fail due to node admission + +1.22 addressed a long-standing issue in the Kubelet where terminating pods were [vulnerable to race conditions](https://github.com/kubernetes/kubernetes/pull/102344) leading to early shutdown, resource leaks, or long delays in actually completing pod shutdown. As a consequence of this change the Kubelet now correctly takes into account the resources of running and terminating pods when deciding to accept new pods, since terminating pods are still holding on to those resources. This stricter handling may surface to end users as pod rejections when creating pods that are scheduled to mostly full nodes that have other terminating pods holding the resources the new pods need. The most likely error would be a pod set to `Failed` phase with reason set to `OutOfCpu` or `OutOfMemory`, but any resource on the node that has some fixed limit (including persistent volume counts on cloud nodes, exclusive CPU cores, or unique hardware devices) could trigger the failure. While this behavior is correct it reduces the throughput of pod execution and creates user-visible warnings - [future versions of Kubernetes will minimize the likelihood users see pod failures due to this issue](https://github.com/kubernetes/kubernetes/issues/106884). In general, any automation that creates pods [must take Kubelet rejections into account](https://kubernetes.io/docs/concepts/scheduling-eviction/#pod-disruption), and should be designed to retry and backoff where necessary. + ## Urgent Upgrade Notes ### (No, really, you MUST read this before you upgrade)