mirror of
https://github.com/k3s-io/kubernetes.git
synced 2025-10-24 00:31:24 +00:00
This doc talks about flake tests but never links to all of them. Fix this so people can dive in.
229 lines
8.7 KiB
Markdown
229 lines
8.7 KiB
Markdown
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
|
|
|
<!-- BEGIN STRIP_FOR_RELEASE -->
|
|
|
|
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
|
|
width="25" height="25">
|
|
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
|
|
width="25" height="25">
|
|
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
|
|
width="25" height="25">
|
|
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
|
|
width="25" height="25">
|
|
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
|
|
width="25" height="25">
|
|
|
|
<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
|
|
|
|
If you are using a released version of Kubernetes, you should
|
|
refer to the docs that go with that version.
|
|
|
|
<!-- TAG RELEASE_LINK, added by the munger automatically -->
|
|
<strong>
|
|
The latest release of this document can be found
|
|
[here](http://releases.k8s.io/release-1.4/docs/devel/flaky-tests.md).
|
|
|
|
Documentation for other releases can be found at
|
|
[releases.k8s.io](http://releases.k8s.io).
|
|
</strong>
|
|
--
|
|
|
|
<!-- END STRIP_FOR_RELEASE -->
|
|
|
|
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
|
|
|
# Flaky tests
|
|
|
|
Any test that fails occasionally is "flaky". Since our merges only proceed when
|
|
all tests are green, and we have a number of different CI systems running the
|
|
tests in various combinations, even a small percentage of flakes results in a
|
|
lot of pain for people waiting for their PRs to merge.
|
|
|
|
Therefore, it's very important that we write tests defensively. Situations that
|
|
"almost never happen" happen with some regularity when run thousands of times in
|
|
resource-constrained environments. Since flakes can often be quite hard to
|
|
reproduce while still being common enough to block merges occasionally, it's
|
|
additionally important that the test logs be useful for narrowing down exactly
|
|
what caused the failure.
|
|
|
|
Note that flakes can occur in unit tests, integration tests, or end-to-end
|
|
tests, but probably occur most commonly in end-to-end tests.
|
|
|
|
## Filing issues for flaky tests
|
|
|
|
Because flakes may be rare, it's very important that all relevant logs be
|
|
discoverable from the issue.
|
|
|
|
1. Search for the test name. If you find an open issue and you're 90% sure the
|
|
flake is exactly the same, add a comment instead of making a new issue.
|
|
2. If you make a new issue, you should title it with the test name, prefixed by
|
|
"e2e/unit/integration flake:" (whichever is appropriate)
|
|
3. Reference any old issues you found in step one. Also, make a comment in the
|
|
old issue referencing your new issue, because people monitoring only their
|
|
email do not see the backlinks github adds. Alternatively, tag the person or
|
|
people who most recently worked on it.
|
|
4. Paste, in block quotes, the entire log of the individual failing test, not
|
|
just the failure line.
|
|
5. Link to durable storage with the rest of the logs. This means (for all the
|
|
tests that Google runs) the GCS link is mandatory! The Jenkins test result
|
|
link is nice but strictly optional: not only does it expire more quickly,
|
|
it's not accessible to non-Googlers.
|
|
|
|
## Finding filed flaky test cases
|
|
|
|
Find flaky tests issues on GitHub under the [kind/flake issue label][flake].
|
|
There are significant numbers of flaky tests reported on a regular basis and P2
|
|
flakes are under-investigated. Fixing flakes is a quick way to gain expertise
|
|
and community goodwill.
|
|
|
|
[flake]: https://github.com/kubernetes/kubernetes/issues?q=is%3Aopen+is%3Aissue+label%3Akind%2Fflake
|
|
|
|
## Expectations when a flaky test is assigned to you
|
|
|
|
Note that we won't randomly assign these issues to you unless you've opted in or
|
|
you're part of a group that has opted in. We are more than happy to accept help
|
|
from anyone in fixing these, but due to the severity of the problem when merges
|
|
are blocked, we need reasonably quick turn-around time on test flakes. Therefore
|
|
we have the following guidelines:
|
|
|
|
1. If a flaky test is assigned to you, it's more important than anything else
|
|
you're doing unless you can get a special dispensation (in which case it will
|
|
be reassigned). If you have too many flaky tests assigned to you, or you
|
|
have such a dispensation, then it's *still* your responsibility to find new
|
|
owners (this may just mean giving stuff back to the relevant Team or SIG Lead).
|
|
2. You should make a reasonable effort to reproduce it. Somewhere between an
|
|
hour and half a day of concentrated effort is "reasonable". It is perfectly
|
|
reasonable to ask for help!
|
|
3. If you can reproduce it (or it's obvious from the logs what happened), you
|
|
should then be able to fix it, or in the case where someone is clearly more
|
|
qualified to fix it, reassign it with very clear instructions.
|
|
4. PRs that fix or help debug flakes may have the P0 priority set to get them
|
|
through the merge queue as fast as possible.
|
|
5. Once you have made a change that you believe fixes a flake, it is conservative
|
|
to keep the issue for the flake open and see if it manifests again after the
|
|
change is merged.
|
|
6. If you can't reproduce a flake: __don't just close it!__ Every time a flake comes
|
|
back, at least 2 hours of merge time is wasted. So we need to make monotonic
|
|
progress towards narrowing it down every time a flake occurs. If you can't
|
|
figure it out from the logs, add log messages that would have help you figure
|
|
it out. If you make changes to make a flake more reproducible, please link
|
|
your pull request to the flake you're working on.
|
|
7. If a flake has been open, could not be reproduced, and has not manifested in
|
|
3 months, it is reasonable to close the flake issue with a note saying
|
|
why.
|
|
|
|
# Reproducing unit test flakes
|
|
|
|
Try the [stress command](https://godoc.org/golang.org/x/tools/cmd/stress).
|
|
|
|
Just
|
|
|
|
```
|
|
$ go install golang.org/x/tools/cmd/stress
|
|
```
|
|
|
|
Then build your test binary
|
|
|
|
```
|
|
$ go test -c -race
|
|
```
|
|
|
|
Then run it under stress
|
|
|
|
```
|
|
$ stress ./package.test -test.run=FlakyTest
|
|
```
|
|
|
|
It runs the command and writes output to `/tmp/gostress-*` files when it fails.
|
|
It periodically reports with run counts. Be careful with tests that use the
|
|
`net/http/httptest` package; they could exhaust the available ports on your
|
|
system!
|
|
|
|
# Hunting flaky unit tests in Kubernetes
|
|
|
|
Sometimes unit tests are flaky. This means that due to (usually) race
|
|
conditions, they will occasionally fail, even though most of the time they pass.
|
|
|
|
We have a goal of 99.9% flake free tests. This means that there is only one
|
|
flake in one thousand runs of a test.
|
|
|
|
Running a test 1000 times on your own machine can be tedious and time consuming.
|
|
Fortunately, there is a better way to achieve this using Kubernetes.
|
|
|
|
_Note: these instructions are mildly hacky for now, as we get run once semantics
|
|
and logging they will get better_
|
|
|
|
There is a testing image `brendanburns/flake` up on the docker hub. We will use
|
|
this image to test our fix.
|
|
|
|
Create a replication controller with the following config:
|
|
|
|
```yaml
|
|
apiVersion: v1
|
|
kind: ReplicationController
|
|
metadata:
|
|
name: flakecontroller
|
|
spec:
|
|
replicas: 24
|
|
template:
|
|
metadata:
|
|
labels:
|
|
name: flake
|
|
spec:
|
|
containers:
|
|
- name: flake
|
|
image: brendanburns/flake
|
|
env:
|
|
- name: TEST_PACKAGE
|
|
value: pkg/tools
|
|
- name: REPO_SPEC
|
|
value: https://github.com/kubernetes/kubernetes
|
|
```
|
|
|
|
Note that we omit the labels and the selector fields of the replication
|
|
controller, because they will be populated from the labels field of the pod
|
|
template by default.
|
|
|
|
```sh
|
|
kubectl create -f ./controller.yaml
|
|
```
|
|
|
|
This will spin up 24 instances of the test. They will run to completion, then
|
|
exit, and the kubelet will restart them, accumulating more and more runs of the
|
|
test.
|
|
|
|
You can examine the recent runs of the test by calling `docker ps -a` and
|
|
looking for tasks that exited with non-zero exit codes. Unfortunately, docker
|
|
ps -a only keeps around the exit status of the last 15-20 containers with the
|
|
same image, so you have to check them frequently.
|
|
|
|
You can use this script to automate checking for failures, assuming your cluster
|
|
is running on GCE and has four nodes:
|
|
|
|
```sh
|
|
echo "" > output.txt
|
|
for i in {1..4}; do
|
|
echo "Checking kubernetes-node-${i}"
|
|
echo "kubernetes-node-${i}:" >> output.txt
|
|
gcloud compute ssh "kubernetes-node-${i}" --command="sudo docker ps -a" >> output.txt
|
|
done
|
|
grep "Exited ([^0])" output.txt
|
|
```
|
|
|
|
Eventually you will have sufficient runs for your purposes. At that point you
|
|
can delete the replication controller by running:
|
|
|
|
```sh
|
|
kubectl delete replicationcontroller flakecontroller
|
|
```
|
|
|
|
If you do a final check for flakes with `docker ps -a`, ignore tasks that
|
|
exited -1, since that's what happens when you stop the replication controller.
|
|
|
|
Happy flake hunting!
|
|
|
|
|
|
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
|
[]()
|
|
<!-- END MUNGE: GENERATED_ANALYTICS -->
|