would fllake .04% of the time on my machine.
In tests waiting for objects to be reconciled, would erroneously treat the "Not Found" case as an error rather than waiting a bit.
also add some more context to test errors to improve debuggability
Currently, there are some unit tests that are failing on Windows due to
various reasons:
- paths not properly joined (filepath.Join should be used).
- Proxy Mode IPVS not supported on Windows.
- DeadlineExceeded can occur when trying to read data from an UDP
socket. This can be used to detect whether the port was closed or not.
- In Windows, with long file name support enabled, file names can have
up to 32,767 characters. In this case, the error
windows.ERROR_FILENAME_EXCED_RANGE will be encountered instead.
- files not closed, which means that they cannot be removed / renamed.
- time.Now() is not as precise on Windows, which means that 2
consecutive calls may return the same timestamp.
- path.Base() will return the same path. filepath.Base() should be used
instead.
- path.Join() will always join the paths with a / instead of the OS
specific separator. filepath.Join() should be used instead.
Currently, when running node e2e it's not possible to use the ginkgo `--repeat`
flag to run the test suite multiple times. This is useful when debugging tests
and ensuring they are not flaky by re-running them several times. Currently if
using `--repeat` ginkgo flag, the 2nd run of the test will fail due to kubelet
not starting with message like:
```
Failed to start transient service unit: Unit kubelet-20221020T040841.service already exists.
```
This is because during the test startup, kubelet is started as a transient unit
file via `systemd-run`. The unit is started with the `--remain-after-exit` flag
to ensure that the unit will remain even if the kubelet is restarted. The test
suite currently uses `systemd kill` command to stop kubelet. This works fine for
stopping the kubelet, but on the second run, when `systemd-run` is used to start
systemd unit again it will fail because the unit already exists. This is because
`systemd kill` will not delete the systemd unit, only send SIGTERM signal to it.
To fix this, add `unitName` as a field to the `server` struct. When
kubelet server is constructed, set the unit name. As part of e2e test
termination, in `E2EServices.Stop()``, stop the kubelet systemd unit. By
stopping the kubelet systemd unit, systemd will delete the systemd
transient unit, allowing it to be created and started again in a
subsequent e2e run.
Signed-off-by: David Porter <david@porter.me>
The original intention was to address "frustration of end users running the e2e
suite is that they take a significant amount of time and it is difficult to
gauge progress".
But Ginkgo's output is different now than it was in Kubernetes 1.19. If users
want to see progress, then "ginkgo --progress" might provide enough
information.
Printing to os.Stdout doesn't work as intended anyway when output redirection
is enabled (the default for parallel runs) and causes these JSON snippets to
appear as "show stdout" for each failed test in a Prow job, which is
distracting.
Tests should accept a context from Ginkgo and pass it through to all functions
which may block for a longer period of time. In particular all Kubernetes API
calls through client-go should use that context. Then if a timeout occurs,
the test returns immediately because everything that it could block on will
return.
Cleanup code then needs to run in a separate Ginkgo node, typically
DeferCleanup, which ensures that it gets a separate context which has not timed
out yet.