Instead of reporting an event or displaying an error, simply exit
when the namespace is being terminated. This reduces the amount of
controller churn on namespace shutdown. Unlike other controllers, we
drop the replica set create error very late (in the queue handleErr)
in order to avoid changing the structure of the controller
substantially.
Instead of reporting an event or displaying an error, simply exit
when the namespace is being terminated. This reduces the amount of
controller churn on namespace shutdown. While we could technically
exit the entire processing loop early for very large jobs,
we should wait for more evidence that is an issue before changing
that logic substantially.
Instead of reporting an event or displaying an error, simply exit
when the namespace is being terminated. This reduces the amount of
controller churn on namespace shutdown. While we could technically
exit the entire processing loop early for very large daemon sets,
we should wait for more evidence that is an issue before changing
that logic substantially.
Instead of reporting an event or displaying an error, simply exit
when the namespace is being terminated. This reduces the amount of
controller churn on namespace shutdown. While we could technically
exit the entire processing loop early for very large replica sets,
we should wait for more evidence that is an issue before changing
that logic substantially.
In some scenarios the service account and token controllers can
race with namespace deletion, causing a burst of errors as they
attempt to recreate secrets being deleted.
Instead, detect these errors and do not retry.
Clients should be able to identify when a namespace is being terminated and
take special action such as backing off or giving up. Add a helper for
getting the cause of an error and then add a special cause to the forbidden
error that namespace lifecycle admission returns. We can't change the forbidden
reason without potentially breaking older clients and so cause is the
appropriate tool.
Add `StatusCause` and `HasStatusCause` to the errors package to make checking
for causes simpler. Add `NamespaceTerminatingCause` to the v1 API as a constant.
All objects with graceful deletion allow multiple DELETE calls in the pending
state. Namespace is the one outlier, and the error here predates graceful
deletion and finalizers. We should make this behavior consistent with other
calls and merely indicate success and return the state of the object, the
same as if there were pending metadata finalizers.
Clients that previously checked for a conflict error during delete to know
that the server is already deleting will now no longer receive an error
(as if the object were rapidly deleted). There is a small chance that some
clients have error checking for this state, but a much larger chance that
clients that want to trigger a delete of the namespace do not handle this
error case.
Discovered in an e2e test that used the framework namespace and triggered
deletion of that ns itself, and then the AfterEach step in e2e failed
because the namespace was already pending deletion.
A preemption is a disruption event that should have a metric so that
the rate of preemption can be assessed. Nodes that are under heavy
preemption may have conflicting workloads or otherwise need attention.
A sudden burst of preemption on a cluster in steady state could
indicate pathological conditions within the scheduler or workload
controllers.