mirror of
https://github.com/k3s-io/kubernetes.git
synced 2025-07-31 07:20:13 +00:00
Merge pull request #20040 from lavalamp/doc-flaky-test
Add expectations for flaky test issue assignments
This commit is contained in:
commit
8f394a3cca
@ -32,7 +32,93 @@ Documentation for other releases can be found at
|
|||||||
|
|
||||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||||
|
|
||||||
# Hunting flaky tests in Kubernetes
|
# Flaky tests
|
||||||
|
|
||||||
|
Any test that fails occasionally is "flaky". Since our merges only proceed when
|
||||||
|
all tests are green, and we have a number of different CI systems running the
|
||||||
|
tests in various combinations, even a small percentage of flakes results in a
|
||||||
|
lot of pain for people waiting for their PRs to merge.
|
||||||
|
|
||||||
|
Therefore, it's very important that we write tests defensively. Situations that
|
||||||
|
"almost never happen" happen with some regularity when run thousands of times in
|
||||||
|
resource-constrained environments. Since flakes can often be quite hard to
|
||||||
|
reproduce while still being common enough to block merges occasionally, it's
|
||||||
|
additionally important that the test logs be useful for narrowing down exactly
|
||||||
|
what caused the failure.
|
||||||
|
|
||||||
|
Note that flakes can occur in unit tests, integration tests, or end-to-end
|
||||||
|
tests, but probably occur most commonly in end-to-end tests.
|
||||||
|
|
||||||
|
## Filing issues for flaky tests
|
||||||
|
|
||||||
|
Because flakes may be rare, it's very important that all relevant logs be
|
||||||
|
discoverable from the issue.
|
||||||
|
|
||||||
|
1. Search for the test name. If you find an open issue and you're 90% sure the
|
||||||
|
flake is exactly the same, add a comment instead of making a new issue.
|
||||||
|
2. If you make a new issue, you should title it with the test name, prefixed by
|
||||||
|
"e2e/unit/integration flake:" (whichever is appropriate)
|
||||||
|
3. Reference any old issues you found in step one.
|
||||||
|
4. Paste, in block quotes, the entire log of the individual failing test, not
|
||||||
|
just the failure line.
|
||||||
|
5. Link to durable storage with the rest of the logs. This means (for all the
|
||||||
|
tests that Google runs) the GCS link is mandatory! The Jenkins test result
|
||||||
|
link is nice but strictly optional: not only does it expire more quickly,
|
||||||
|
it's not accesible to non-Googlers.
|
||||||
|
|
||||||
|
## Expectations when a flaky test is assigned to you
|
||||||
|
|
||||||
|
Note that we won't randomly assign these issues to you unless you've opted in or
|
||||||
|
you're part of a group that has opted in. We are more than happy to accept help
|
||||||
|
from anyone in fixing these, but due to the severity of the problem when merges
|
||||||
|
are blocked, we need reasonably quick turn-around time on test flakes. Therefore
|
||||||
|
we have the following guidelines:
|
||||||
|
|
||||||
|
1. If a flaky test is assigned to you, it's more important than anything else
|
||||||
|
you're doing unless you can get a special dispensation (in which case it will
|
||||||
|
be reassigned). If you have too many flaky tests assigned to you, or you
|
||||||
|
have such a dispensation, then it's *still* your responsibility to find new
|
||||||
|
owners (this may just mean giving stuff back to the relevant Team or SIG Lead).
|
||||||
|
2. You should make a reasonable effort to reproduce it. Somewhere between an
|
||||||
|
hour and half a day of concentrated effort is "reasonable". It is perfectly
|
||||||
|
reasonable to ask for help!
|
||||||
|
3. If you can reproduce it (or it's obvious from the logs what happened), you
|
||||||
|
should then be able to fix it, or in the case where someone is clearly more
|
||||||
|
qualified to fix it, reassign it with very clear instructions.
|
||||||
|
4. If you can't reproduce it: __don't just close it!__ Every time a flake comes
|
||||||
|
back, at least 2 hours of merge time is wasted. So we need to make monotonic
|
||||||
|
progress towards narrowing it down every time a flake occurs. If you can't
|
||||||
|
figure it out from the logs, add log messages that would have help you figure
|
||||||
|
it out.
|
||||||
|
|
||||||
|
# Reproducing unit test flakes
|
||||||
|
|
||||||
|
Try the [stress command](https://godoc.org/golang.org/x/tools/cmd/stress).
|
||||||
|
|
||||||
|
Just
|
||||||
|
|
||||||
|
```
|
||||||
|
$ go install golang.org/x/tools/cmd/stress
|
||||||
|
```
|
||||||
|
|
||||||
|
Then build your test binary
|
||||||
|
|
||||||
|
```
|
||||||
|
$ godep go test -c -race
|
||||||
|
```
|
||||||
|
|
||||||
|
Then run it under stress
|
||||||
|
|
||||||
|
```
|
||||||
|
$ stress ./package.test -test.run=FlakyTest
|
||||||
|
```
|
||||||
|
|
||||||
|
It runs the command and writes output to `/tmp/gostress-*` files when it fails.
|
||||||
|
It periodically reports with run counts. Be careful with tests that use the
|
||||||
|
`net/http/httptest` package; they could exhaust the available ports on your
|
||||||
|
system!
|
||||||
|
|
||||||
|
# Hunting flaky unit tests in Kubernetes
|
||||||
|
|
||||||
Sometimes unit tests are flaky. This means that due to (usually) race conditions, they will occasionally fail, even though most of the time they pass.
|
Sometimes unit tests are flaky. This means that due to (usually) race conditions, they will occasionally fail, even though most of the time they pass.
|
||||||
|
|
||||||
|
Loading…
Reference in New Issue
Block a user