Add unit test with a volume plugin that does not support SELinux. That
simulates a CSi driver whose spec.SELinuxMount is empty or false.
This requires a little refactoring, each unit test now has a flag if it
runs with a volume plugin that supports SELinux.
Reset SELinuxChangePolicy of Pods that have no SELinux label set to
Recursive. Kubelet cannot mount with `-o context=<label>`, if the label is
not known.
This fixes the e2e test error revealed by the previous commit - it changed the
e2e test to check for events when no events are expected and it found a
warning about a Pod with no label, but MountOption policy.
When a Pod reaches its final state (Succeeded or Failed), its volumes are
getting unmounted and therefore their SELinux mount option will not
conflict with any other pod.
Let the SELinux controller monitor "pod updated" events to see the pod is
finished
The comparison of SELinux labels in KCM tolerates missing fields - the
operating system is going to default them from its defaults, but in KCM we
don't know what the defaults are.
But the OS won't default the last component, "level", which includes also
categories. Make sure that labels with a level set conflicts with level "",
that's what will conflict on the OS too.
Once received job deletion event, it cleans the backoff records for that
job before enqueueing this job so that we can avoid a race condition
that the syncJob() may incorrect use stale backoff records for a newly created
job with same key.
Co-authored-by: Michal Wozniak <michalwozniak@google.com>
The timed worker queue actually can have nil entries in its map if the work was
kicked off immediately. This looks like an unnecessary special case (it would
be fine to call AfterFunc with a duration <= 0 and it would do the right
thing), but to avoid more sweeping changes the fix consists of documenting this
special behavior and adding a nil check.
Normally the scheduler shouldn't schedule when there is a taint, but perhaps it
didn't know yet.
The TestEviction/update test covered this, but only failed under the right
timing conditions. The new event handler test case covers it reliably.
Events get recorded in the apiserver asynchronously, so even if the test knows
that the event has been evicted because the pod is deleted, it still has to
also check for the event to be recorded.
This caused a flake in the "Consistently" check of events.
There was one error path that led to a "controller has shut down" log
message. Other errors caused different log entries or are so unlikely (event
handler registration failure!) that they weren't checked at all.
It's clearer to let Run return an error in all cases and then log the
"controller has shut down" error at the call site. This also enables tests to
mark themselves as failed, should that ever happen.
In tests it is sometimes unavoidable to use the Prometheus types directly,
for example when writing a custom gatherer which needs to normalize data
before testing it. device_taint_eviction_test.go does this to strip
out unpredictable data in a histogram.
With type aliases in a package that is explicitly meant for tests we
can avoid adding exceptions for such tests to the global exception list.
The controller is derived from the node taint eviction controller.
In contrast to that controller it tracks the UID of pods to prevent
deleting the wrong pod when it got replaced.
This is a verbatim copy of the current pkg/controller/taintseviction code,
revision fc268ecd09 (v1.33.0 plus one commit),
minus the TimedWorker helper.
The intent is to modify the code such that it enforces eviction of pods which
use tainted devices.