We have a e2e test which tries to ensure device plugin assignments to pods are kept
across node reboots. And this tests is permafailing since many weeks at
time of writing (xref: #128443).
Problem is: closer inspection reveals the test was well intentioned, but
puzzling:
The test runs a pod, then restarts the kubelet, then _expects the pod to
end up in admission failure_ and yet _ensure the device assignment is
kept_! https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/test/e2e_node/device_plugin_test.go#L97
A reader can legitmately wonder if this means the device will be kept busy forever?
This is not the case, luckily. The test however embodied the behavior at
time of the kubelet, in turn caused by #103979
Device manager used to record the last admitted pod and forcibly added
to the list of active pod. The retention logic had space for exactly one
pod, the last which attempted admission.
This retention prevented the cleanup code
(see: https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549
compare to: https://github.com/kubernetes/kubernetes/blob/v1.31.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549)
to clear the registration, so the device was still (mis)reported
allocated to the failed pod.
This fact was in turn leveraged by the test in question:
the test uses the podresources API to learn about the device assignment,
and because of the chain of events above the pod failed admission yet
was still reported as owning the device.
What happened however was the next pod trying admission would have
replaced the previous pod in the device manager data, so the previous
pod was no longer forced to be added into the active list, so its
assignment were correctly cleared once the cleanup code runs;
And the cleanup code is run, among other things, every time device
manager is asked to allocated devices and every time podresources API
queries the device assignment
Later, in PR https://github.com/kubernetes/kubernetes/pull/120661
the forced retention logic was removed from all the resource managers,
thus also from device manager, and this is what caused the permafailure.
Because all of the above, it should be evident that the e2e test was
actually enforcing a very specific and not really work-as-intended
behavior, which was also overall quite puzzling for users.
The best we can do is to fix the test to record and ensure that
pods which did fail admission _do not_ retain device assignment.
Unfortunately, we _cannot_ guarantee the desirable property that
pod going running retain their device assignment across node reboots.
In the kubelet restart flow, all pods race to be admitted. There's no
order enforced between device plugin pods and application pods.
Unless an application pod is lucky enough to _lose_ the race with both
the device plugin (to go running before the app pod does) and _also_
with the kubelet (which needs to set devices healthy before the pod
tries admission).
Signed-off-by: Francesco Romani <fromani@redhat.com>
Previously, ValidateNodeSelector did not check that labels are valid. Now it
does for resource.k8s.io, regardless whether an object already was created with
invalid labels in an earlier Kubernetes release. Theoretically this is a
breaking change and could cause problems during an upgrade, but that is highly
unlikely in practice.
In contrast to node affinity, DRA does not ignore parse errors
(= uses NewNodeSelector, not NewLazyErrorNodeSelector), so invalid labels would
have been found instead of being silently ignored.
Even if some object has invalid labels, this only affects an alpha -> beta
upgrade which isn't guaranteed to work seamlessly.
Pruning of tests to the top-level test was added for jobs like
pull-kubernetes-unit which run many tests. For other, more focused jobs like
scheduler-perf benchmarking it would be nice to keep the more detailed
information, in particular because it includes the duration per test case.
The "disabled by label filter" message for benchmarks printed the pointer to
the filter string, not the filter string itself. This mistake gets avoided and
the code becomes simpler when not using pointers.
After removing a pod in port-forward test we wait for an error from POST
request. Since the POST doesn't have a timeout it hangs indefinitely, so
instead we're hitting a DefaultPodDeletionTimeout. To make sure the POST
fails this adds a timeout to ensure we'll always get that expected
error, rather than nil.
Signed-off-by: Maciej Szulik <soltysh@gmail.com>