From 4c6c484900a966e2ea17fc3ebd646b7d4e783084 Mon Sep 17 00:00:00 2001 From: Brendan Burns Date: Tue, 26 Aug 2014 20:54:19 -0700 Subject: [PATCH] Add initial docs for flake hunting. --- README.md | 3 +++ docs/flaky-tests.md | 52 +++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 55 insertions(+) create mode 100644 docs/flaky-tests.md diff --git a/README.md b/README.md index 197d9bd344c..7150c0dbc2f 100644 --- a/README.md +++ b/README.md @@ -151,6 +151,9 @@ cd kubernetes hack/e2e-test.sh ``` +### Testing out flaky tests +[Instructions here](https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/flaky-tests.md) + ### Add/Update dependencies Kubernetes uses [godep](https://github.com/tools/godep) to manage dependencies. To add or update a package, please follow the instructions on [godep's document](https://github.com/tools/godep). diff --git a/docs/flaky-tests.md b/docs/flaky-tests.md new file mode 100644 index 00000000000..d2cc8fadaf4 --- /dev/null +++ b/docs/flaky-tests.md @@ -0,0 +1,52 @@ +# Hunting flaky tests in Kubernetes +Sometimes unit tests are flaky. This means that due to (usually) race conditions, they will occasionally fail, even though most of the time they pass. + +We have a goal of 99.9% flake free tests. This means that there is only one flake in one thousand runs of a test. + +Running a test 1000 times on your own machine can be tedious and time consuming. Fortunately, there is a better way to achieve this using Kubernetes. + +_Note: these instructions are mildly hacky for now, as we get run once semantics and logging they will get better_ + +There is a testing image ```brendanburns/flake``` up on the docker hub. We will use this image to test our fix. + +Create a replication controller with the following config: +```yaml +id: flakeController +desiredState: + replicas: 24 + replicaSelector: + name: flake + podTemplate: + desiredState: + manifest: + version: v1beta1 + id: "" + volumes: [] + containers: + - name: flake + image: brendanburns/flake + env: + - name: TEST_PACKAGE + value: pkg/tools + - name: REPO_SPEC + value: https://github.com/GoogleCloudPlatform/kubernetes + restartpolicy: {} + labels: + name: flake +labels: + name: flake +``` + +```./cluster/kubecfg.sh -c controller.yaml create replicaControllers``` + +This will spin up 100 instances of the test. They will run to completion, then exit, the kubelet will restart them, eventually you will have sufficient +runs for your purposes, and you can stop the replication controller: + +```sh +./cluster/kubecfg.sh stop flakeController +./cluster/kubecfg.sh rm flakeController +``` + +Now examine the machines with ```docker ps -a``` and look for tasks that exited with non-zero exit codes (ignore those that exited -1, since that's what happens when you stop the replica controller) + +Happy flake hunting!