mirror of
				https://github.com/k3s-io/kubernetes.git
				synced 2025-10-31 13:50:01 +00:00 
			
		
		
		
	
		
			
				
	
	
		
			203 lines
		
	
	
		
			8.3 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			203 lines
		
	
	
		
			8.3 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| <!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
 | |
| 
 | |
| <!-- BEGIN STRIP_FOR_RELEASE -->
 | |
| 
 | |
| <img src="http://kubernetes.io/img/warning.png" alt="WARNING"
 | |
|      width="25" height="25">
 | |
| <img src="http://kubernetes.io/img/warning.png" alt="WARNING"
 | |
|      width="25" height="25">
 | |
| <img src="http://kubernetes.io/img/warning.png" alt="WARNING"
 | |
|      width="25" height="25">
 | |
| <img src="http://kubernetes.io/img/warning.png" alt="WARNING"
 | |
|      width="25" height="25">
 | |
| <img src="http://kubernetes.io/img/warning.png" alt="WARNING"
 | |
|      width="25" height="25">
 | |
| 
 | |
| <h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
 | |
| 
 | |
| If you are using a released version of Kubernetes, you should
 | |
| refer to the docs that go with that version.
 | |
| 
 | |
| <!-- TAG RELEASE_LINK, added by the munger automatically -->
 | |
| <strong>
 | |
| The latest release of this document can be found
 | |
| [here](http://releases.k8s.io/release-1.3/docs/devel/flaky-tests.md).
 | |
| 
 | |
| Documentation for other releases can be found at
 | |
| [releases.k8s.io](http://releases.k8s.io).
 | |
| </strong>
 | |
| --
 | |
| 
 | |
| <!-- END STRIP_FOR_RELEASE -->
 | |
| 
 | |
| <!-- END MUNGE: UNVERSIONED_WARNING -->
 | |
| 
 | |
| # Flaky tests
 | |
| 
 | |
| Any test that fails occasionally is "flaky". Since our merges only proceed when
 | |
| all tests are green, and we have a number of different CI systems running the
 | |
| tests in various combinations, even a small percentage of flakes results in a
 | |
| lot of pain for people waiting for their PRs to merge.
 | |
| 
 | |
| Therefore, it's very important that we write tests defensively. Situations that
 | |
| "almost never happen" happen with some regularity when run thousands of times in
 | |
| resource-constrained environments. Since flakes can often be quite hard to
 | |
| reproduce while still being common enough to block merges occasionally, it's
 | |
| additionally important that the test logs be useful for narrowing down exactly
 | |
| what caused the failure.
 | |
| 
 | |
| Note that flakes can occur in unit tests, integration tests, or end-to-end
 | |
| tests, but probably occur most commonly in end-to-end tests.
 | |
| 
 | |
| ## Filing issues for flaky tests
 | |
| 
 | |
| Because flakes may be rare, it's very important that all relevant logs be
 | |
| discoverable from the issue.
 | |
| 
 | |
| 1. Search for the test name. If you find an open issue and you're 90% sure the
 | |
|    flake is exactly the same, add a comment instead of making a new issue.
 | |
| 2. If you make a new issue, you should title it with the test name, prefixed by
 | |
|    "e2e/unit/integration flake:" (whichever is appropriate)
 | |
| 3. Reference any old issues you found in step one. Also, make a comment in the
 | |
|    old issue referencing your new issue, because people monitoring only their
 | |
|    email do not see the backlinks github adds. Alternatively, tag the person or
 | |
|    people who most recently worked on it.
 | |
| 4. Paste, in block quotes, the entire log of the individual failing test, not
 | |
|    just the failure line.
 | |
| 5. Link to durable storage with the rest of the logs. This means (for all the
 | |
|    tests that Google runs) the GCS link is mandatory! The Jenkins test result
 | |
|    link is nice but strictly optional: not only does it expire more quickly,
 | |
|    it's not accesible to non-Googlers.
 | |
| 
 | |
| ## Expectations when a flaky test is assigned to you
 | |
| 
 | |
| Note that we won't randomly assign these issues to you unless you've opted in or
 | |
| you're part of a group that has opted in. We are more than happy to accept help
 | |
| from anyone in fixing these, but due to the severity of the problem when merges
 | |
| are blocked, we need reasonably quick turn-around time on test flakes. Therefore
 | |
| we have the following guidelines:
 | |
| 
 | |
| 1. If a flaky test is assigned to you, it's more important than anything else
 | |
|    you're doing unless you can get a special dispensation (in which case it will
 | |
|    be reassigned).  If you have too many flaky tests assigned to you, or you
 | |
|    have such a dispensation, then it's *still* your responsibility to find new
 | |
|    owners (this may just mean giving stuff back to the relevant Team or SIG Lead).
 | |
| 2. You should make a reasonable effort to reproduce it. Somewhere between an
 | |
|    hour and half a day of concentrated effort is "reasonable". It is perfectly
 | |
|    reasonable to ask for help!
 | |
| 3. If you can reproduce it (or it's obvious from the logs what happened), you
 | |
|    should then be able to fix it, or in the case where someone is clearly more
 | |
|    qualified to fix it, reassign it with very clear instructions.
 | |
| 4. PRs that fix or help debug flakes may have the P0 priority set to get them
 | |
|    through the merge queue as fast as possible.
 | |
| 5. Once you have made a change that you believe fixes a flake, it is conservative
 | |
|    to keep the issue for the flake open and see if it manifests again after the
 | |
|    change is merged.
 | |
| 6. If you can't reproduce a flake: __don't just close it!__ Every time a flake comes
 | |
|    back, at least 2 hours of merge time is wasted. So we need to make monotonic
 | |
|    progress towards narrowing it down every time a flake occurs. If you can't
 | |
|    figure it out from the logs, add log messages that would have help you figure
 | |
|    it out.  If you make changes to make a flake more reproducible, please link
 | |
|    your pull request to the flake you're working on.
 | |
| 7. If a flake has been open, could not be reproduced, and has not manifested in
 | |
|    3 months, it is reasonable to close the flake issue with a note saying
 | |
|    why.
 | |
| 
 | |
| # Reproducing unit test flakes
 | |
| 
 | |
| Try the [stress command](https://godoc.org/golang.org/x/tools/cmd/stress).
 | |
| 
 | |
| Just
 | |
| 
 | |
| ```
 | |
| $ go install golang.org/x/tools/cmd/stress
 | |
| ```
 | |
| 
 | |
| Then build your test binary
 | |
| 
 | |
| ```
 | |
| $ go test -c -race
 | |
| ```
 | |
| 
 | |
| Then run it under stress
 | |
| 
 | |
| ```
 | |
| $ stress ./package.test -test.run=FlakyTest
 | |
| ```
 | |
| 
 | |
| It runs the command and writes output to `/tmp/gostress-*` files when it fails.
 | |
| It periodically reports with run counts. Be careful with tests that use the
 | |
| `net/http/httptest` package; they could exhaust the available ports on your
 | |
| system!
 | |
| 
 | |
| # Hunting flaky unit tests in Kubernetes
 | |
| 
 | |
| Sometimes unit tests are flaky.  This means that due to (usually) race conditions, they will occasionally fail, even though most of the time they pass.
 | |
| 
 | |
| We have a goal of 99.9% flake free tests.  This means that there is only one flake in one thousand runs of a test.
 | |
| 
 | |
| Running a test 1000 times on your own machine can be tedious and time consuming.  Fortunately, there is a better way to achieve this using Kubernetes.
 | |
| 
 | |
| _Note: these instructions are mildly hacky for now, as we get run once semantics and logging they will get better_
 | |
| 
 | |
| There is a testing image `brendanburns/flake` up on the docker hub.  We will use this image to test our fix.
 | |
| 
 | |
| Create a replication controller with the following config:
 | |
| 
 | |
| ```yaml
 | |
| apiVersion: v1
 | |
| kind: ReplicationController
 | |
| metadata:
 | |
|   name: flakecontroller
 | |
| spec:
 | |
|   replicas: 24
 | |
|   template:
 | |
|     metadata:
 | |
|       labels:
 | |
|         name: flake
 | |
|     spec:
 | |
|       containers:
 | |
|       - name: flake
 | |
|         image: brendanburns/flake
 | |
|         env:
 | |
|         - name: TEST_PACKAGE
 | |
|           value: pkg/tools
 | |
|         - name: REPO_SPEC
 | |
|           value: https://github.com/kubernetes/kubernetes
 | |
| ```
 | |
| 
 | |
| Note that we omit the labels and the selector fields of the replication controller, because they will be populated from the labels field of the pod template by default.
 | |
| 
 | |
| ```sh
 | |
| kubectl create -f ./controller.yaml
 | |
| ```
 | |
| 
 | |
| This will spin up 24 instances of the test.  They will run to completion, then exit, and the kubelet will restart them, accumulating more and more runs of the test.
 | |
| You can examine the recent runs of the test by calling `docker ps -a` and looking for tasks that exited with non-zero exit codes. Unfortunately, docker ps -a only keeps around the exit status of the last 15-20 containers with the same image, so you have to check them frequently.
 | |
| You can use this script to automate checking for failures, assuming your cluster is running on GCE and has four nodes:
 | |
| 
 | |
| ```sh
 | |
| echo "" > output.txt
 | |
| for i in {1..4}; do
 | |
|   echo "Checking kubernetes-node-${i}"
 | |
|   echo "kubernetes-node-${i}:" >> output.txt
 | |
|   gcloud compute ssh "kubernetes-node-${i}" --command="sudo docker ps -a" >> output.txt
 | |
| done
 | |
| grep "Exited ([^0])" output.txt
 | |
| ```
 | |
| 
 | |
| Eventually you will have sufficient runs for your purposes. At that point you can delete the replication controller by running:
 | |
| 
 | |
| ```sh
 | |
| kubectl delete replicationcontroller flakecontroller
 | |
| ```
 | |
| 
 | |
| If you do a final check for flakes with `docker ps -a`, ignore tasks that exited -1, since that's what happens when you stop the replication controller.
 | |
| 
 | |
| Happy flake hunting!
 | |
| 
 | |
| 
 | |
| <!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
 | |
| []()
 | |
| <!-- END MUNGE: GENERATED_ANALYTICS -->
 |