Remove dependencies on internal fieldmanager for admission things. This
is preparing for moving fieldmanager out, but the admission part will
stay here, so it can't depend directly on internal.
This commit makes the job controller re-honor exponential backoff for
failed pods. Before this commit, the controller created pods without any
backoff. This is a regression because the controller used to
create pods with an exponential backoff delay before (10s, 20s, 40s ...).
The issue occurs only when the JobTrackingWithFinalizers feature is
enabled (which is enabled by default right now). With this feature, we
get an extra pod update event when the finalizer of a failed pod is
removed.
Note that the pod failure detection and new pod creation happen in the
same reconcile loop so the 2nd pod is created immediately after the 1st
pod fails. The backoff is only applied on 2nd pod failure, which means
that the 3rd pod created 10s after the 2nd pod, 4th pod is created 20s
after the 3rd pod and so on.
This commit fixes a few bugs:
1. Right now, each time `uncounted != nil` and the job does not see a
_new_ failure, `forget` is set to true and the job is removed from the
queue. Which means that this condition is also triggered each time the
finalizer for a failed pod is removed and `NumRequeues` is reset, which
results in a backoff of 0s.
2. Updates `updatePod` to only apply backoff when we see a particular
pod failed for the first time. This is necessary to ensure that the
controller does not apply backoff when it sees a pod update event
for finalizer removal of a failed pod.
3. If `JobsReadyPods` feature is enabled and backoff is 0s, the job is
now enqueued after `podUpdateBatchPeriod` seconds, instead of 0s. The
unit test for this check also had a few bugs:
- `DefaultJobBackOff` is overwritten to 0 in certain unit tests,
which meant that `DefaultJobBackOff` was considered to be 0,
effectively not running any meaningful checks.
- `JobsReadyPods` was not enabled for test cases that ran tests
which required the feature gate to be enabled.
- The check for expected and actual backoff had incorrect
calculations.
Modify unwrap error utility to make it work with go1.20
This version of Go introduces a new layer of wrapping via
a new error type. The commit accounts for that while being
compatible with go1.19
Signed-off-by: Madhav Jivrajani <madhav.jiv@gmail.com>
Bumping version to include changes that
better handle TLS errors. Bump nescessary
to prepare for when the version of Go is
bumped to 1.20
Signed-off-by: Madhav Jivrajani <madhav.jiv@gmail.com>
Marking the pods not ready on a node requires looping over them and
updating each pod's status one at a time. This is performed serially,
and can take a while if we're processing each node serially as well.
Since the time is spent waiting on io, there's an opportunity to go
faster by processing multiple nodes concurrently. This change modifies
the loop to process nodes in parallel, using the same number of workers
as doNodeProcessingPassWorker.
This change also introduces histogram metrics to better observe
monitorNodeHealth.
* Improving the output of tests in case of error
* Better error message
Also, the condition in the second case was reversed
* Fixing 2 tests whose condition was inverted
* Again I got the conditions wrong
* Sorry for the confusion
* Improved error messages on failures