In older versions of Kubernetes (at least pre-0.19, it's the earliest
this test will run unmodified on), Pods that depended on devices could be
restarted after the device plugin had been removed. Currently however,
this isn't possible, as during ContainerManager.GetResources(), we
attempt to DeviceManager.GetDeviceRunContainerOptions() which fails as
there's no cached endpoint information for the plugin type.
This commit therefore breaks apart the existing test into two:
- One active test that validates that assignments are maintained across
restarts
- One skipped test that validates the behaviour after GPUs have been
removed, in case we decide that this is a bug that should be fixed in
the future.
Prior to this change, the pod was not getting scheduled on the node as
we don't have a running scheduler in e2e_node. PodClient solves this
problem by manually assigning the pod to the node.
The current GPU installer was built in 2017, from source that no longer
exists in Kubernetes ([adding commit][1]. The image was built on 2017-06-13.
Unfortunately, this installer no longer appears to work. When debugging
on the same node type as used by test-infra, it failed to build the
driver as the kernel sha was no longer available.
This lead to needing to find a new way to install GPUs. The smallest
logical change was switching to [cos-gpu-installer][2]
. There is a newer version of this available on [googlesource][3] that
I have not yet tested as it's not clear what the state of the project
is, as I couldn't find docs outside of the source itself.
We install things to the same location as previously to avoid needing
extra downstream changes. There are a couple of weird issues here
however, like needing to run the container twice to correctly update the
LD Cache.
[1]: 1e77594958/cluster/gce/gci/nvidia-gpus/Dockerfile
[2]: https://github.com/GoogleCloudPlatform/cos-gpu-installer
[3]: https://cos.googlesource.com/cos/tools/+/refs/heads/master/src/cmd/cos_gpu_installer/
Add script to verify that net.ParseIP and net.ParseCIDR are
not being used.
Add another script to automatically replace those functions
for the ones forked in k8s.io/utils/net
This updates the k8s.io/util to pull in the fix for
https://github.com/kubernetes/kubernetes/issues/104452.
Commands run:
./hack/pin-dependency.sh k8s.io/utils v0.0.0-20210819203725-bdf08cb9a70a
./hack/update-vendor.sh
The Container Images for Windows Server 2022 have been published, and
we can start building test images using them, so we can start adding
jobs for them.
The image versions for the e2e test images have been bumped in a previous
commit, but haven't been promoted yet. We don't need to bump them here.
We're starting with windows-servercore-cache and busybox images, since
they are needed for the other images the most.
A previous added LD_FLAGS for the go binary compilation, but it's not
defined for all images.
The pods using hostNetwork use the host network namespace, hence
they have to share it with the rest of the process and pods.
If several pods try to bind to the same port, the test will fail,
so we try to use a non common port, and run the different scenario
in the same test, so we only have to bind once and we avoid consuming
ports reducing the port collision risk.