And also in the terminating-namespace log output. This makes it
easier to track down drain-blocking pods, without having to hunt
around in earlier logs for 'evicting pod ...' messages. Before this
change, caller logs might look like:
evicting pod {namespace}/{name}
...
error when waiting for pod "{name}" terminating: global timeout reached: 20s
With this change, they will look like:
evicting pod {namespace}/{name}
...
error when waiting for pod "{name}" in namespace "{namespace}" to terminate: global timeout reached: 20s
There is a test in wait.sh integration suite which is checking the
given timeout value(passed by user) is equal to actual happened timeout value.
However, this test rarely gets `no matching resources found` error and
causes flakyiness. The reason is we are running wait command, immediately
after applying deployment. In reality, timeout test does not care about
deployment, since it is testing the timeout by passing invalid configurations.
But we need this deployment to not get `no matching resources found` error.
That's why, this PR adds deployment assertion before executing wait command.
The recently introduced failure handling in ExpectNoError depends on error
wrapping: if an error prefix gets added with `fmt.Errorf("foo: %v", err)`, then
ExpectNoError cannot detect that the root cause is an assertion failure and
then will add another useless "unexpected error" prefix and will not dump the
additional failure information (currently the backtrace inside the E2E
framework).
Instead of manually deciding on a case-by-case basis where %w is needed, all
error wrapping was updated automatically with
sed -i "s/fmt.Errorf\(.*\): '*\(%s\|%v\)'*\",\(.* err)\)/fmt.Errorf\1: %w\",\3/" $(git grep -l 'fmt.Errorf' test/e2e*)
This may be unnecessary in some cases, but it's not wrong.
Instead of pod responses being printed to the log each time polling fails, we
get a consolidated failure message with all unexpected pod responses if (and
only if) the check times out or a progress report gets produced.
This renames PodsResponding to WaitForPodsResponding for the sake of
consistency and adds a timeout parameter. That is necessary because some other
users of NewProxyResponseChecker used a much lower timeout (2min vs. 15min).
Besides simplifying some code, it also makes it easier to rewrite
ProxyResponseChecker because it only gets used in WaitForPodsResponding.
WaitForPodToDisappear was always called such that it listed all pods, which
made it less efficient than trying to get just the one pod it was checking for.
Being able to customize the poll interval in practice wasn't useful, therefore
it can be replaced with WaitForPodNotFoundInNamespace.
WaitForPods is now a generic function which lists pods and then checks the pods
that it found against some provided condition. A parameter determines how many
pods must be found resp. match the condition for the check to succeed.
The code becomes simpler (78 insertions, 91 deletions), easier to read (all
code entirely inside WaitForPodsRunningReady, no need to declare and later
overwrite variables) and possibly more correct (if all API calls failed,
the resulting error was ignored when allowedNotReadyPods > 0).
None of the users of the functions passed anything other than nil or an empty
map and the implementation ignore the parameter - it seems like a candidate for
simplification.
When a Gomega failure is converted to an error, the stack at the time when the
failure occurs may be useful: error wrapping provides some bread crumbs that
can be followed to determine where the failure really occurred, but error
wrapping may be missing or ambiguous.
To provide the additional information, a FailureError now includes a full stack
backtrace. The backtrace intentionally makes no attempt to exclude framework
functions besides the gomega support itself because helpers like
e2e/framework/pod may be relevant.
That backtrace is not included in the failure message for the sake of
brevity. Instead, it gets logged as part of the test's output.
gomega.Eventually provides better progress reports: instead of filling up the
log with rather useless one-line messages that are not enough to to understand
the current state, it integrates with Gingko's progress reporting (SIGUSR1,
--poll-progress-after) and then dumps the same complete failure message as
after a timeout. That makes it possible to understand why progress isn't
getting made without having to wait for the timeout.
The other advantage is that the failure message for some unexpected pod state
becomes more readable: instead of encapsulating it as "observed object" inside
an error, it directly gets rendered by gomega.
Calling gomega.Expect/Eventually/Consistently deep inside a helper call chain
has several challenges:
- the stack offset must be tracked correctly, otherwise the callstack
for the failure starts at some helper code, which is often not informative
- augmenting the failure message with additional information from each
caller implies that each caller must pass down a string and/or format
string plus arguments
Both challenges can be solved by returning errors:
- the stacktrace is taken at that level where the error is
treated as a failure instead of passing back an error, i.e.
inside the It callback
- traditional error wrapping can add additional information, if
desirable
What was missing was some easy way to generate an error via a gomega
assertion. The new infrastructure achieves that by mirroring the
Gomega/Assertion/AsyncAssertion interfaces with errors as return values instead
of calling a fail handler.
It is intentionally less flexible than the gomega APIs:
- A context must be passed to Eventually/Consistently as first
parameter because that is needed for proper timeout handling.
- No additional text can be added to the failure through this
API because error wrapping is meant to be used for this.
- No need to adjust the callstack offset because no backtrace
is recorded when a failure occurs.
To avoid the useless "unexpected error" log message when passing back a gomega
failure, ExpectNoError gets extended to recognize such errors and then skips
the logging.
Calling WaitForPodTerminatedInNamespace after testFlexVolume is useless because
the client pod that it waits for always gets deleted by testVolumeClient:
0fcc3dbd55/test/e2e/framework/volume/fixtures.go (L541-L546)
Worse, because WaitForPodTerminatedInNamespace treats "not found" as "must keep
polling", these two tests always kept waiting for 5 minutes:
Kubernetes e2e suite: [It] [sig-storage] Flexvolumes should be mountable
when non-attachable 6m4s
The only reason why these tests passed is that WaitForPodTerminatedInNamespace
used to return the "not found" API error. That is not guaranteed and about to
change.