Previously, ValidateNodeSelector did not check that labels are valid. Now it
does for resource.k8s.io, regardless whether an object already was created with
invalid labels in an earlier Kubernetes release. Theoretically this is a
breaking change and could cause problems during an upgrade, but that is highly
unlikely in practice.
In contrast to node affinity, DRA does not ignore parse errors
(= uses NewNodeSelector, not NewLazyErrorNodeSelector), so invalid labels would
have been found instead of being silently ignored.
Even if some object has invalid labels, this only affects an alpha -> beta
upgrade which isn't guaranteed to work seamlessly.
Note that these tests will take now more time to run as they are relying on the scale up and scale down to prepare the test case and restore the cluster state.
Remove all the gke and gce specific tests including:
- GPUs
- volumes (no way to provision volumes without provider specific
infrastructure)
- scale up/down from/to 0
- tests checking what happens after breaking nodes (no way to simulate
temporary network failure without provider assumptions)
Remove the scalability tests that were not run and unmaintained.
Update the autoscaler version that is used by the tests.
Update the autoscaler status parsing logic for the tests to pass with
newer version of autoscaler.
We have zero flake policy for a long time now (> 1 year) https://github.com/kubernetes/community/pull/7538, however , there are some places that are still tolerating flakes and retrying
Flakes does not help, to the point that when we have to take a hard decision it creates more iuncertainty.
It does not matter how, we should be always able to deal with flakes:
- if the software or algorithm is racy, we need to work to make deterministic
- if is deterministic, the test must be deterministic
- if the test is determinist but it depends on the environment, then we work on making the environment deterministci
Pruning of tests to the top-level test was added for jobs like
pull-kubernetes-unit which run many tests. For other, more focused jobs like
scheduler-perf benchmarking it would be nice to keep the more detailed
information, in particular because it includes the duration per test case.
The "disabled by label filter" message for benchmarks printed the pointer to
the filter string, not the filter string itself. This mistake gets avoided and
the code becomes simpler when not using pointers.
After removing a pod in port-forward test we wait for an error from POST
request. Since the POST doesn't have a timeout it hangs indefinitely, so
instead we're hitting a DefaultPodDeletionTimeout. To make sure the POST
fails this adds a timeout to ensure we'll always get that expected
error, rather than nil.
Signed-off-by: Maciej Szulik <soltysh@gmail.com>