In the `should correctly account for terminated pods after restart`, the
test first creates a set of `restartNever` pods, followed by a set of
`restartAlways` pods. Both the `restartNever` and `restartAlways` pods
request an entire CPU. As a result, the `restartAlways` pods will not be
admitted, if the `restartNever` pods did not terminate yet.
Depending on the timing/how fast the pods terminate, the test can pass
sometimes fail which results in flakes. To de-flake the test, the test
should wait until the `restartNever` pods enter a terminal `Succeeded`
phase, before creating the `restartAlways` pods.
To do this, generalize the function `waitForPods` to accept a pod
condition (`testutils.PodRunningReadyOrSucceeded`, or
`testutils.PodSucceeded`). Also introduce a new "Succeeded" pod
condition, so the test can explicitly wait until the pods enter the
Succeeded phase.
Signed-off-by: David Porter <david@porter.me>
Most parameters can be passed to both the CLI and the suite, but some
(for example, --ginkgo.slow-spec-threshold) had no effect when only
passed to the suite.
would fllake .04% of the time on my machine.
In tests waiting for objects to be reconciled, would erroneously treat the "Not Found" case as an error rather than waiting a bit.
also add some more context to test errors to improve debuggability
Currently, there are some unit tests that are failing on Windows due to
various reasons:
- paths not properly joined (filepath.Join should be used).
- Proxy Mode IPVS not supported on Windows.
- DeadlineExceeded can occur when trying to read data from an UDP
socket. This can be used to detect whether the port was closed or not.
- In Windows, with long file name support enabled, file names can have
up to 32,767 characters. In this case, the error
windows.ERROR_FILENAME_EXCED_RANGE will be encountered instead.
- files not closed, which means that they cannot be removed / renamed.
- time.Now() is not as precise on Windows, which means that 2
consecutive calls may return the same timestamp.
- path.Base() will return the same path. filepath.Base() should be used
instead.
- path.Join() will always join the paths with a / instead of the OS
specific separator. filepath.Join() should be used instead.
Currently, when running node e2e it's not possible to use the ginkgo `--repeat`
flag to run the test suite multiple times. This is useful when debugging tests
and ensuring they are not flaky by re-running them several times. Currently if
using `--repeat` ginkgo flag, the 2nd run of the test will fail due to kubelet
not starting with message like:
```
Failed to start transient service unit: Unit kubelet-20221020T040841.service already exists.
```
This is because during the test startup, kubelet is started as a transient unit
file via `systemd-run`. The unit is started with the `--remain-after-exit` flag
to ensure that the unit will remain even if the kubelet is restarted. The test
suite currently uses `systemd kill` command to stop kubelet. This works fine for
stopping the kubelet, but on the second run, when `systemd-run` is used to start
systemd unit again it will fail because the unit already exists. This is because
`systemd kill` will not delete the systemd unit, only send SIGTERM signal to it.
To fix this, add `unitName` as a field to the `server` struct. When
kubelet server is constructed, set the unit name. As part of e2e test
termination, in `E2EServices.Stop()``, stop the kubelet systemd unit. By
stopping the kubelet systemd unit, systemd will delete the systemd
transient unit, allowing it to be created and started again in a
subsequent e2e run.
Signed-off-by: David Porter <david@porter.me>