From d3d71df9433a5339fc8bc2590bb356c694010282 Mon Sep 17 00:00:00 2001 From: Alex Robinson Date: Wed, 11 Feb 2015 12:16:16 -0800 Subject: [PATCH] Fix bad config in flaky test documentation and add script to help check for flakes. --- docs/devel/flaky-tests.md | 24 ++++++++++++++++++------ 1 file changed, 18 insertions(+), 6 deletions(-) diff --git a/docs/devel/flaky-tests.md b/docs/devel/flaky-tests.md index ccd32afbaf4..e352e11097b 100644 --- a/docs/devel/flaky-tests.md +++ b/docs/devel/flaky-tests.md @@ -11,7 +11,7 @@ There is a testing image ```brendanburns/flake``` up on the docker hub. We will Create a replication controller with the following config: ```yaml -id: flakeController +id: flakecontroller kind: ReplicationController apiVersion: v1beta1 desiredState: @@ -41,14 +41,26 @@ labels: ```./cluster/kubectl.sh create -f controller.yaml``` -This will spin up 100 instances of the test. They will run to completion, then exit, the kubelet will restart them, eventually you will have sufficient -runs for your purposes, and you can stop the replication controller by setting the ```replicas``` field to 0 and then running: +This will spin up 24 instances of the test. They will run to completion, then exit, and the kubelet will restart them, accumulating more and more runs of the test. +You can examine the recent runs of the test by calling ```docker ps -a``` and looking for tasks that exited with non-zero exit codes. Unfortunately, docker ps -a only keeps around the exit status of the last 15-20 containers with the same image, so you have to check them frequently. +You can use this script to automate checking for failures, assuming your cluster is running on GCE and has four nodes: ```sh -./cluster/kubectl.sh update -f controller.yaml -./cluster/kubectl.sh delete -f controller.yaml +echo "" > output.txt +for i in {1..4}; do + echo "Checking kubernetes-minion-${i}" + echo "kubernetes-minion-${i}:" >> output.txt + gcloud compute ssh "kubernetes-minion-${i}" --command="sudo docker ps -a" >> output.txt +done +grep "Exited ([^0])" output.txt ``` -Now examine the machines with ```docker ps -a``` and look for tasks that exited with non-zero exit codes (ignore those that exited -1, since that's what happens when you stop the replica controller) +Eventually you will have sufficient runs for your purposes. At that point you can stop and delete the replication controller by running: + +```sh +./cluster/kubectl.sh stop replicationcontroller flakecontroller +``` + +If you do a final check for flakes with ```docker ps -a```, ignore tasks that exited -1, since that's what happens when you stop the replication controller. Happy flake hunting!