From d3d71df9433a5339fc8bc2590bb356c694010282 Mon Sep 17 00:00:00 2001
From: Alex Robinson <arob@google.com>
Date: Wed, 11 Feb 2015 12:16:16 -0800
Subject: [PATCH] Fix bad config in flaky test documentation and add script to
 help check for flakes.

---
 docs/devel/flaky-tests.md | 24 ++++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/docs/devel/flaky-tests.md b/docs/devel/flaky-tests.md
index ccd32afbaf4..e352e11097b 100644
--- a/docs/devel/flaky-tests.md
+++ b/docs/devel/flaky-tests.md
@@ -11,7 +11,7 @@ There is a testing image ```brendanburns/flake``` up on the docker hub.  We will
 
 Create a replication controller with the following config:
 ```yaml
-id: flakeController
+id: flakecontroller
 kind: ReplicationController
 apiVersion: v1beta1
 desiredState:
@@ -41,14 +41,26 @@ labels:
 
 ```./cluster/kubectl.sh create -f controller.yaml```
 
-This will spin up 100 instances of the test.  They will run to completion, then exit, the kubelet will restart them, eventually you will have sufficient
-runs for your purposes, and you can stop the replication controller by setting the ```replicas``` field to 0 and then running:
+This will spin up 24 instances of the test.  They will run to completion, then exit, and the kubelet will restart them, accumulating more and more runs of the test.
+You can examine the recent runs of the test by calling ```docker ps -a``` and looking for tasks that exited with non-zero exit codes. Unfortunately, docker ps -a only keeps around the exit status of the last 15-20 containers with the same image, so you have to check them frequently.
+You can use this script to automate checking for failures, assuming your cluster is running on GCE and has four nodes:
 
 ```sh
-./cluster/kubectl.sh update -f controller.yaml
-./cluster/kubectl.sh delete -f controller.yaml
+echo "" > output.txt
+for i in {1..4}; do
+  echo "Checking kubernetes-minion-${i}"
+  echo "kubernetes-minion-${i}:" >> output.txt
+  gcloud compute ssh "kubernetes-minion-${i}" --command="sudo docker ps -a" >> output.txt
+done
+grep "Exited ([^0])" output.txt
 ```
 
-Now examine the machines with ```docker ps -a``` and look for tasks that exited with non-zero exit codes (ignore those that exited -1, since that's what happens when you stop the replica controller)
+Eventually you will have sufficient runs for your purposes. At that point you can stop and delete the replication controller by running:
+
+```sh
+./cluster/kubectl.sh stop replicationcontroller flakecontroller
+```
+
+If you do a final check for flakes with ```docker ps -a```, ignore tasks that exited -1, since that's what happens when you stop the replication controller.
 
 Happy flake hunting!