for now:
- shim FORCE_HOST_GO to GOTOOLCHAIN=local
- treat GOTOOLCHAIN set and !=auto like FORCE_HOST_GO
- otherwise set GOTOOLCHAIN=go${GO_VERSION} and fallback to gimme if necessary
TODO: set toolchain statements in go.mod files and keep them in sync
The status error was embedded inside the new error constructed by
WaitForPodsResponding's get function, but not wrapped. Therefore
`apierrors.IsServiceUnavailable(err)` didn't find it and returned false -> no
retries.
Wrapping fixes this and Gomega formatting of the error remains useful:
err := &errors.StatusError{}
err.ErrStatus.Code = 503
err.ErrStatus.Message = "temporary failure"
err2 := fmt.Errorf("Controller %s: failed to Get from replica pod %s:\n%w\nPod status:\n%s",
"foo", "bar",
err, "some status")
fmt.Println(format.Object(err2, 1))
fmt.Println(errors.IsServiceUnavailable(err2))
=>
<*fmt.wrapError | 0xc000139340>:
Controller foo: failed to Get from replica pod bar:
temporary failure
Pod status:
some status
{
msg: "Controller foo: failed to Get from replica pod bar:\ntemporary failure\nPod status:\nsome status",
err: <*errors.StatusError | 0xc0001a01e0>{
ErrStatus: {
TypeMeta: {Kind: "", APIVersion: ""},
ListMeta: {
SelfLink: "",
ResourceVersion: "",
Continue: "",
RemainingItemCount: nil,
},
Status: "",
Message: "temporary failure",
Reason: "",
Details: nil,
Code: 503,
},
},
}
true
This fixes a test flake:
[sig-node] DRA [Feature:DynamicResourceAllocation] multiple nodes reallocation [It] works
/nvme/gopath/src/k8s.io/kubernetes/test/e2e/dra/dra.go:552
[FAILED] number of deallocations
Expected
<int64>: 2
to equal
<int64>: 1
In [It] at: /nvme/gopath/src/k8s.io/kubernetes/test/e2e/dra/dra.go:651 @ 09/05/23 14:01:54.652
This can be reproduced locally with
stress -p 10 go test ./test/e2e -args -ginkgo.focus=DynamicResourceAllocation.*reallocation.works -ginkgo.no-color -v=4 -ginkgo.v
Log output showed that the sequence of events leading to this was:
- claim gets allocated because of selected node
- a different node has to be used, so PostFilter sets
claim.status.deallocationRequested
- the driver deallocates
- before the scheduler can react and select a different node,
the driver allocates *again* for the original node
- the scheduler asks for deallocation again
- the driver deallocates again (causing the test failure)
- eventually the pod runs
The fix is to disable allocations first by removing the selected node and then
starting to deallocate.
When some plugin was registered as "unschedulable" in some previous scheduling
attempt, it kept that attribute for a pod forever. When that plugin then later
failed with an error that requires backoff, the pod was incorrectly moved to the
"unschedulable" queue where it got stuck until the periodic flushing because
there was no event that the plugin was waiting for.
Here's an example where that happened:
framework.go:1280: E0831 20:03:47.184243] Reserve/DynamicResources: Plugin failed err="Operation cannot be fulfilled on podschedulingcontexts.resource.k8s.io \"test-dragxd5c\": the object has been modified; please apply your changes to the latest version and try again" node="scheduler-perf-dra-7l2v2" plugin="DynamicResources" pod="test/test-dragxd5c"
schedule_one.go:1001: E0831 20:03:47.184345] Error scheduling pod; retrying err="running Reserve plugin \"DynamicResources\": Operation cannot be fulfilled on podschedulingcontexts.resource.k8s.io \"test-dragxd5c\": the object has been modified; please apply your changes to the latest version and try again" pod="test/test-dragxd5c"
...
scheduling_queue.go:745: I0831 20:03:47.198968] Pod moved to an internal scheduling queue pod="test/test-dragxd5c" event="ScheduleAttemptFailure" queue="Unschedulable" schedulingCycle=9576 hint="QueueSkip"
Pop still needs the information about unschedulable plugins to update the
UnschedulableReason metric. It can reset that information before returning the
PodInfo for the next scheduling attempt.