The device_plugin_tests have not run successfully in a very long time,
initially being marked flaky and then eventually becoming stale.
The gpu_device_plugin_tests have been used to test the same behaviour,
but are incredibly high maintenance due to external changes in behaviour
from GCP/Nvidia that we have no control over.
This commit takes the existing device plugin tests, makes them look more
like the GPU tests, and removes the cases that have been unsupported for
a long time (namely restarting containers while the plugin is
unavailable).
It also removes the GPU plugin tests, as we do not get more signal by
using real devices here.
On the multi NUMA node environment, kernel splits hugepages allocated under
/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages file equally between NUMA nodes.
That makes it harder to predict where several pods will start because the number
of hugepages on each NUMA node will depend on the amount of NUMA nodes under the environment.
The memory manager test will allocate hugepages on the specific NUMA node to make
the test more predictable on the multi NUMA nodes environment.
Signed-off-by: Artyom Lukianov <alukiano@redhat.com>
Let's wait for the local node (aka the kubelet)
to be ready before to query podresources again,
to avoid false negatives.
Co-authored-by: Artyom Lukianov <alukiano@redhat.com>
Signed-off-by: Francesco Romani <fromani@redhat.com>
DKC is being removed and we don't want it to continue flaking the rest
of our tests. Lets disable them when dkc is disabled rather than hard
failing. This fits more in line with our other E2Es, and reduces the
maintenance load in test-infra.
we need to make sure the system state is completely cleaned up
again, to avoid to mess up with the shared node state, before
we transition from one test to another.
Signed-off-by: Francesco Romani <fromani@redhat.com>
Since commit 42dd01aa3f the cpuRequest is in millicores, hence
we need to properly check translating to exclusive cpus
when verifying the resource allocation.
Signed-off-by: Francesco Romani <fromani@redhat.com>
the intent is to make the code more readable, no intended
changes in behaviour. Now it should be a bit more explicit
why the code is checking some values.
Signed-off-by: Francesco Romani <fromani@redhat.com>
* Bump the pod status and node status update timeouts to avoid flakes
* Add a small delay after dbus restart to ensure dbus has enough time to
restart to startup prior to sending shutdown signal
* Change check of pod being terminated by graceful shutdown. Previously,
the pod phase was checked to see if it was `Failed` and the pod reason
string matched. This logic needs to change after 1.22 graceful node
shutdown change introduced in PR #102344 which changed behavior to no
longer put the pods into a failed phase. Instead, the test now checks
that containers are not ready, and the pod status message and reason
are set appropriately.
Signed-off-by: David Porter <david@porter.me>
This commit forces Kubelet Configuration files to always be generated
and when possible will use the kubeletconfig file that has been provided
by the test orchestrator
This commit enables the remote runner to provide a KubeletConfiguration
file to the test suite when uploading it to a remote host, thet test
runner will then use this configuration to run the Kubelet with the
provided config.
Add a e2e test to exercise the checkpoint recovery flow.
This means we need to actually create a old (V1, pre-1.20) checkpoint,
but if we do it only in the e2e test, it's still fine.
Signed-off-by: Francesco Romani <fromani@redhat.com>
* Cleanup FeatureGate skippers
* Perform changes requested by review
* some more review related changes
* Rename skipper functions to make code more readable
* add utilfeature back in