From a20e44a6ac6e64e565bad01ac3079264bee03d4d Mon Sep 17 00:00:00 2001
From: Eric Tune <etune@google.com>
Date: Wed, 21 Jan 2015 11:47:32 -0800
Subject: [PATCH 1/5] Availability and multi-cluster documentation.

---
 docs/availability.md | 123 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 123 insertions(+)
 create mode 100644 docs/availability.md

diff --git a/docs/availability.md b/docs/availability.md
new file mode 100644
index 00000000000..8a92688964a
--- /dev/null
+++ b/docs/availability.md
@@ -0,0 +1,123 @@
+# Availability
+
+This document collects advice on reasoning about and provisioning for high-availability when using Kubernetes clusters.
+
+## Failure modes
+
+This is an incomplete list of things that could go wrong, and how to deal with it.
+
+Root causes:
+  - VM(s) shutdown
+  - network partition within cluster, or between cluster and users.
+  - crashes in Kubernetes software 
+  - data loss or unavailability from storage
+  - operator error misconfigures kubernetes software or application software.
+
+Specific scenarios:
+  - Apiserver VM shutdown or apiserver crashing
+    - Results
+      - unable to stop, update, or start new pods, services, replication controller
+      - existing pods and services should continue to work normally, unless they depend on the Kubernetes API
+    - Mitigations
+      - Use cloud provider best practices for improving availability of a VM, such as automatic restart and reliable
+        storage for writeable state (GCE PD or AWS EBS volume).
+      - High-availability (replicated) APIserver is a planned feature for Kubernetes.  Will tolerate one or more
+        similtaneous apiserver failures.
+      - Multiple independent clusters will tolerate failure of all apiservers in one cluster.  
+  - Apiserver backing storage lost
+    - Results
+      - apiserver should fail to come up.
+      - kubelets will not be able to reach it but will continute to run the same pods and provide the same service proxying.
+      - manual recovery or recreation of apiserver state necessary before apiserver is restarted.
+    - Mitigations
+      - High-availability (replicated) APIserver is a planned feature for Kubernetes.  Each apiserver has independent
+        storage.  Etcd will recover from loss of one member.  Risk of total data loss greatly reduced.
+      - snapshot PD/EBS-volume periodically
+  - Supporting services (node controller, replication controller manager, scheduler, etc) VM shutdown or crashes
+    - currently those are colocated with the apiserver, and their unavailability has similar consequences as apiserver
+    - in future, these will be replicated as well and may not be co-located
+    - they do not have own persistent state
+  - Node (thing that runs kubelet and kube-proxy and pods) shutdown
+    - Results
+      - pods on that Node stop running
+    - Mitigations
+      - replication controller should be used to restart copy of the pod elsewhere
+      - service should be used to hide changes in the pod IP address after restart
+      - applications (containers) should tolerate unexpected restarts
+  - Kubelet software fault
+    - Results
+      - crashing kubelet cannot start new pods on the node
+      - kubelet might delete the pods or not
+      - node marked unhealthy
+      - replication controllers start new pods elsewhere
+    - Mitigations
+      - same as for Node shutdown case
+  - Cluster operator error
+    - Results:
+      - loss of pods, services, etc
+      - lost of apiserver backing store
+      - users unable to read API
+      - etc
+    - Mitigations
+      - run additional cluster(s) and do not make changes to all at once.
+      - snapshot apiserver PD/EBS-volume periodically
+
+## Chosing Multiple Kubernetes Clusters
+
+You may want to set up multiple kubernetes clusters, both to
+ to have clusters in different regions to be nearer to your users; and to tolerate failures and/or invasive maintenance.
+
+### Scope of a single cluster
+
+On IaaS providers such as Google Compute Engine or Amazon Web Services, a VM exists in a
+[zone](https://cloud.google.com/compute/docs/zones) or [availability
+zone](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html).
+We suggest that all the VMs in a Kubernetes cluster should be in the same availability zone, because:
+  - compared to having a single global Kubernetes cluster, there are fewer single-points of failure
+  - compared to a cluster that spans availability zones, it is easier to reason about the availability properties of a
+    single-zone cluster.
+  - when the Kubernetes developers are designing the system (e.g. making assumptions about latency, bandwidth, or
+    correlated failures) they are assuming all the machines are in a single data center, or otherwise closely connected.
+
+It is okay to have multiple clusters per availability zone, though on balance we think fewer is better.
+Reasons to prefer fewer clusters are:
+  - improved bin packing of Pods in some cases with more nodes in one cluster.
+  - reduced operational overhead, though advanatage diminished as ops tooling and processes matures.
+  - reduced costs for per-cluster CPU, Memory, and Disk needs (apiserver etc...); though small as a percentage
+    of overall cluster cost for medium to large clusters.
+Reasons you might want multiple clusters:
+  - strict security policies requiring isolation of one class of work from another (but, see Partitioning Clusters
+    below).
+  - test clusters to canary new Kubernetes releases or other cluster software.
+
+### Selecting the right number of clusters
+The selection of the number of kubernetes clusters may be a relatively static choice, only revisted occasionally.
+By contrast, the number of nodes in a cluster and the number of pods in a service may be change frequently according to
+load and growth.
+
+To pick the number of clusters, first, decide which regions you need to be in to have adequete latency to all your end users, for services that will run
+on Kubernetes (if you use a Content Distribution Network, the latency requirements for the CDN-hosted content need not
+be considered).  For example, a company with a global customer base might decide to have clusters in US, EU, AP, and SA regions.   That is the minimum number of
+Kubernetes clusters.  Call this `R`
+
+Second, decide how many clusters should be able to be unavailable at the same time, in order to meet your availability
+goals.  If you are not sure, then 1 is a good number.  Call this `U`.   Reasons for unavailability include:
+ - IaaS provider unavailable
+ - cluster operator error
+ - Kubernetes software fault
+
+If you are able and willing to fail over to a different region than some customers in the event of a cluster failure,
+then you need R + U clusters.  If you want to ensure low latency for all users in the event of a cluster failure, you
+need to have R*U clusters (U in each of R regions).  In either case, put each cluster in a different zone.
+
+Finally, if any of your clusters would need to be larger than the maximum number of nodes for a Kubernetes cluster, then
+you may need even more clusters.  Our roadmap (
+https://github.com/GoogleCloudPlatform/kubernetes/blob/24e59de06e4da61f5dafd4cd84c9340a2c0d112f/docs/roadmap.md)
+calls for maximum 100 node clusters at v1.0 and maximum 1000 node clusters in the middle of 2015.
+
+## Working with multiple clusters
+
+When you have multiple clusters, you would typically copies of a given service in each cluster and put each of those
+service instances behind a load balancer (AWS Elastic Load Balancer, GCE Forwarding Rule or HTTP Load Balancer), so that
+failures of a single cluster are not visible to end users.
+

From df60a2466bb7642fe2a97dc1b87f111da950dead Mon Sep 17 00:00:00 2001
From: Eric Tune <etune@google.com>
Date: Wed, 21 Jan 2015 13:14:36 -0800
Subject: [PATCH 2/5] Reorganize mitigations.

---
 docs/availability.md | 53 ++++++++++++++++++++++++++------------------
 1 file changed, 32 insertions(+), 21 deletions(-)

diff --git a/docs/availability.md b/docs/availability.md
index 8a92688964a..e940f45a7e2 100644
--- a/docs/availability.md
+++ b/docs/availability.md
@@ -4,13 +4,13 @@ This document collects advice on reasoning about and provisioning for high-avail
 
 ## Failure modes
 
-This is an incomplete list of things that could go wrong, and how to deal with it.
+This is an incomplete list of things that could go wrong, and how to deal with them.
 
 Root causes:
   - VM(s) shutdown
   - network partition within cluster, or between cluster and users.
   - crashes in Kubernetes software 
-  - data loss or unavailability from storage
+  - data loss or unavailability of persistent storage (e.g. GCE PD or AWS EBS volume).
   - operator error misconfigures kubernetes software or application software.
 
 Specific scenarios:
@@ -18,21 +18,11 @@ Specific scenarios:
     - Results
       - unable to stop, update, or start new pods, services, replication controller
       - existing pods and services should continue to work normally, unless they depend on the Kubernetes API
-    - Mitigations
-      - Use cloud provider best practices for improving availability of a VM, such as automatic restart and reliable
-        storage for writeable state (GCE PD or AWS EBS volume).
-      - High-availability (replicated) APIserver is a planned feature for Kubernetes.  Will tolerate one or more
-        similtaneous apiserver failures.
-      - Multiple independent clusters will tolerate failure of all apiservers in one cluster.  
   - Apiserver backing storage lost
     - Results
       - apiserver should fail to come up.
       - kubelets will not be able to reach it but will continute to run the same pods and provide the same service proxying.
       - manual recovery or recreation of apiserver state necessary before apiserver is restarted.
-    - Mitigations
-      - High-availability (replicated) APIserver is a planned feature for Kubernetes.  Each apiserver has independent
-        storage.  Etcd will recover from loss of one member.  Risk of total data loss greatly reduced.
-      - snapshot PD/EBS-volume periodically
   - Supporting services (node controller, replication controller manager, scheduler, etc) VM shutdown or crashes
     - currently those are colocated with the apiserver, and their unavailability has similar consequences as apiserver
     - in future, these will be replicated as well and may not be co-located
@@ -40,27 +30,48 @@ Specific scenarios:
   - Node (thing that runs kubelet and kube-proxy and pods) shutdown
     - Results
       - pods on that Node stop running
-    - Mitigations
-      - replication controller should be used to restart copy of the pod elsewhere
-      - service should be used to hide changes in the pod IP address after restart
-      - applications (containers) should tolerate unexpected restarts
   - Kubelet software fault
     - Results
       - crashing kubelet cannot start new pods on the node
       - kubelet might delete the pods or not
       - node marked unhealthy
       - replication controllers start new pods elsewhere
-    - Mitigations
-      - same as for Node shutdown case
   - Cluster operator error
     - Results:
       - loss of pods, services, etc
       - lost of apiserver backing store
       - users unable to read API
       - etc
-    - Mitigations
-      - run additional cluster(s) and do not make changes to all at once.
-      - snapshot apiserver PD/EBS-volume periodically
+
+Mitigations:
+- Action: Use IaaS providers automatic VM restarting feature for IaaS VMs.
+  - Mitigates: Apiserver VM shutdown or apiserver crashing
+  - Mitigates: Supporting services VM shutdown or crashes
+
+- Action use IaaS providers reliable storage (e.g GCE PD or AWS EBS volume) for VMs with apiserver+etcd.
+  - Mitigates: Apiserver backing storage lost
+
+- Action: Use Replicated APIserver feature (when complete: feature is planned but not implemented)
+  - Mitigates: Apiserver VM shutdown or apiserver crashing
+    - Will tolerate one or more similtaneous apiserver failures.
+  - Mitigates: Apiserver backing storage lost
+    - Each apiserver has independent storage.  Etcd will recover from loss of one member.  Risk of total data loss greatly reduced.
+
+- Action: Snapshot apiserver PDs/EBS-volumes periodically
+  - Mitigates: Apiserver backing storage lost
+  - Mitigates: Some cases of operator error
+  - Mitigates: Some cases of kubernetes software fault
+
+- Action: use replication controller and services in front of pods
+  - Mitigates: Node shutdown
+  - Mitigates: Kubelet software fault
+
+- Action: applications (containers) designed to tolerate unexpected restarts
+  - Mitigates: Node shutdown
+  - Mitigates: Kubelet software fault
+
+- Action: Multiple independent clusters (and avoid making risky changes to all clusters at once)
+  - Mitigates: Everything listed above.
 
 ## Chosing Multiple Kubernetes Clusters
 

From 144d19086fc74e957ed338948d4cc780f86d11b6 Mon Sep 17 00:00:00 2001
From: Eric Tune <etune@google.com>
Date: Wed, 21 Jan 2015 13:18:42 -0800
Subject: [PATCH 3/5] Fix.

---
 docs/availability.md | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/docs/availability.md b/docs/availability.md
index e940f45a7e2..d9b16de0862 100644
--- a/docs/availability.md
+++ b/docs/availability.md
@@ -76,7 +76,7 @@ Mitigations:
 ## Chosing Multiple Kubernetes Clusters
 
 You may want to set up multiple kubernetes clusters, both to
- to have clusters in different regions to be nearer to your users; and to tolerate failures and/or invasive maintenance.
+have clusters in different regions to be nearer to your users; and to tolerate failures and/or invasive maintenance.
 
 ### Scope of a single cluster
 
@@ -93,10 +93,11 @@ We suggest that all the VMs in a Kubernetes cluster should be in the same availa
 It is okay to have multiple clusters per availability zone, though on balance we think fewer is better.
 Reasons to prefer fewer clusters are:
   - improved bin packing of Pods in some cases with more nodes in one cluster.
-  - reduced operational overhead, though advanatage diminished as ops tooling and processes matures.
-  - reduced costs for per-cluster CPU, Memory, and Disk needs (apiserver etc...); though small as a percentage
-    of overall cluster cost for medium to large clusters.
-Reasons you might want multiple clusters:
+  - reduced operational overhead (though the advantage is diminished as ops tooling and processes matures).
+  - reduced costs for per-cluster fixed resource costs, e.g. apiserver VMs (but small as a percentage
+    of overall cluster cost for medium to large clusters).
+
+Reasons to have multiple clusters include:
   - strict security policies requiring isolation of one class of work from another (but, see Partitioning Clusters
     below).
   - test clusters to canary new Kubernetes releases or other cluster software.

From 845f0e9dd1cf8d4bfdc0b7bd0988b545e827a8c8 Mon Sep 17 00:00:00 2001
From: Eric Tune <etune@google.com>
Date: Wed, 21 Jan 2015 13:26:40 -0800
Subject: [PATCH 4/5] Fix.

---
 docs/availability.md | 19 ++++++++-----------
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/docs/availability.md b/docs/availability.md
index d9b16de0862..b8085268cef 100644
--- a/docs/availability.md
+++ b/docs/availability.md
@@ -109,20 +109,17 @@ load and growth.
 
 To pick the number of clusters, first, decide which regions you need to be in to have adequete latency to all your end users, for services that will run
 on Kubernetes (if you use a Content Distribution Network, the latency requirements for the CDN-hosted content need not
-be considered).  For example, a company with a global customer base might decide to have clusters in US, EU, AP, and SA regions.   That is the minimum number of
-Kubernetes clusters.  Call this `R`
+be considered).  Legal issues might influence this as well. For example, a company with a global customer base might decide to have clusters in US, EU, AP, and SA regions. 
+Call the number of regions to be in `R`.
 
-Second, decide how many clusters should be able to be unavailable at the same time, in order to meet your availability
-goals.  If you are not sure, then 1 is a good number.  Call this `U`.   Reasons for unavailability include:
- - IaaS provider unavailable
- - cluster operator error
- - Kubernetes software fault
+Second, decide how many clusters should be able to be unavailable at the same time, while still being available.  Call
+the number that can be unavailable `U`.  If you are not sure, then 1 is a fine choice.
 
-If you are able and willing to fail over to a different region than some customers in the event of a cluster failure,
-then you need R + U clusters.  If you want to ensure low latency for all users in the event of a cluster failure, you
-need to have R*U clusters (U in each of R regions).  In either case, put each cluster in a different zone.
+If it is allowable for load-balancing to direct traffic to any region in the event of a cluster failure, then 
+then you need `R + U` clusters.  If it is not (e.g you want to ensure low latency for all users in the event of a
+cluster failure), then you need to have `R * U` clusters (`U` in each of `R` regions).  In any case, try to put each cluster in a different zone.
 
-Finally, if any of your clusters would need to be larger than the maximum number of nodes for a Kubernetes cluster, then
+Finally, if any of your clusters would need more than the maximum recommended number of nodes for a Kubernetes cluster, then
 you may need even more clusters.  Our roadmap (
 https://github.com/GoogleCloudPlatform/kubernetes/blob/24e59de06e4da61f5dafd4cd84c9340a2c0d112f/docs/roadmap.md)
 calls for maximum 100 node clusters at v1.0 and maximum 1000 node clusters in the middle of 2015.

From cb6d23b186dfa17284affb6bd10d03e4b076f339 Mon Sep 17 00:00:00 2001
From: Eric Tune <etune@google.com>
Date: Wed, 21 Jan 2015 13:28:35 -0800
Subject: [PATCH 5/5] Fix.

---
 docs/availability.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/availability.md b/docs/availability.md
index b8085268cef..3f8db2b6ebf 100644
--- a/docs/availability.md
+++ b/docs/availability.md
@@ -126,7 +126,7 @@ calls for maximum 100 node clusters at v1.0 and maximum 1000 node clusters in th
 
 ## Working with multiple clusters
 
-When you have multiple clusters, you would typically copies of a given service in each cluster and put each of those
+When you have multiple clusters, you would typically create services with the same config in each cluster and put each of those
 service instances behind a load balancer (AWS Elastic Load Balancer, GCE Forwarding Rule or HTTP Load Balancer), so that
 failures of a single cluster are not visible to end users.