From 2e5f27d940ca1395a447170cb5f2b1e365710d12 Mon Sep 17 00:00:00 2001
From: Filip Grzadkowski <filipg@google.com>
Date: Wed, 27 Jul 2016 03:29:11 +0200
Subject: [PATCH] Design for automated HA master deployment

---
 docs/design/ha_master.md | 265 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 265 insertions(+)
 create mode 100644 docs/design/ha_master.md
diff --git a/docs/design/ha_master.md b/docs/design/ha_master.md
new file mode 100644
index 00000000000..6f2d91d7417
--- /dev/null
+++ b/docs/design/ha_master.md
@@ -0,0 +1,265 @@
+<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
+
+<!-- BEGIN STRIP_FOR_RELEASE -->
+
+<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
+     width="25" height="25">
+
+<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
+
+If you are using a released version of Kubernetes, you should
+refer to the docs that go with that version.
+
+Documentation for other releases can be found at
+[releases.k8s.io](http://releases.k8s.io).
+</strong>
+--
+
+<!-- END STRIP_FOR_RELEASE -->
+
+<!-- END MUNGE: UNVERSIONED_WARNING -->
+
+# Automated HA master deployment
+
+**Author:** filipg@, jsz@
+
+# Introduction
+
+We want to allow users to easily replicate kubernetes masters to have highly available cluster,
+initially using `kube-up.sh` and `kube-down.sh`.
+
+This document describes technical design of this feature. It assumes that we are using aforementioned
+scripts for cluster deployment. All of the ideas described in the following sections should be easy
+to implement on GCE, AWS and other cloud providers.
+
+It is a non-goal to design a specific setup for bare-metal environment, which
+might be very different.
+
+# Overview
+
+In a cluster with replicated master, we will have N VMs, each running regular master components
+such as apiserver, etcd, scheduler or controller manager. These components will interact in the
+following way:
+* All etcd replicas will be clustered together and will be using master election
+  and quorum mechanism to agree on the state. All of these mechanisms are integral
+  parts of etcd and we will only have to configure them properly.
+* All apiserver replicas will be working independently talking to an etcd on
+  127.0.0.1 (i.e. local etcd replica), which if needed will forward requests to the current etcd master
+  (as explained [here](https://coreos.com/etcd/docs/latest/getting-started-with-etcd.html)).
+* We will introduce provider specific solutions to load balance traffic between master replicas
+  (see section `load balancing`)
+* Controller manager, scheduler & cluster autoscaler will use lease mechanism and
+  only a single instance will be an active master. All other will be waiting in a standby mode.
+* All add-on managers will work independently and each of them will try to keep add-ons in sync
+
+# Detailed design
+
+## Components
+
+### etcd
+
+```
+Note: This design for etcd clustering is quite pet-set like - each etcd
+replica has its name which is explicitly used in etcd configuration etc. In
+medium-term future we would like to have the ability to run masters as part of
+autoscaling-group (AWS) or managed-instance-group (GCE) and add/remove replicas
+automatically. This is pretty tricky and this design does not cover this.
+It will be covered in a separate doc.
+```
+
+All etcd instances will be clustered together and one of them will be an elected master.
+In order to commit any change quorum of the cluster will have to confirm it. Etcd will be
+configured in such a way that all writes and reads will go through the master (requests
+will be forwarded by the local etcd server such that it’s invisible for the user). It will
+affect latency for all operations, but it should not increase by much more than the network
+latency between master replicas (latency between GCE zones with a region is < 10ms).
+
+Currently etcd exposes port only using localhost interface. In order to allow clustering
+and inter-VM communication we will also have to use public interface. To secure the
+communication we will use SSL (as described [here](https://coreos.com/etcd/docs/latest/security.html)).
+
+When generating command line for etcd we will always assume it’s part of a cluster
+(initially of size 1) and list all existing kubernetes master replicas.
+Based on that, we will set the following flags:
+* `-initial-cluster` - list of all hostnames/DNS names for master replicas (including the new one)
+* `-initial-cluster-state` (keep in mind that we are adding master replicas one by one):
+  * `new` if we are adding the first replica, i.e. the list of existing master replicas is empty
+  * `existing` if there are more than one replica, i.e. the list of existing master replicas is non-empty.
+
+This will allow us to have exactly the same logic for HA and non-HA master. List of DNS names for VMs
+with master replicas will be generated in `kube-up.sh` script and passed to as a env variable
+`INITIAL_ETCD_CLUSTER`.
+
+### apiservers
+
+All apiservers will work independently. They will contact etcd on 127.0.0.1, i.e. they will always contact
+etcd replica running on the same VM. If needed, such requests will be forwarded by etcd server to the
+etcd leader. This functionality is completely hidden from the client (apiserver
+in our case).
+
+Caching mechanism, which is implemented in apiserver, will not be affected by
+replicating master because:
+* GET requests go directly to etcd
+* LIST requests go either directly to etcd or to cache populated via watch
+  (depending on the ResourceVersion in ListOptions). In the second scenario,
+  after a PUT/POST request, changes might not be visible in LIST response.
+  This is however not worse than it is with the current single master.
+* WATCH does not give any guarantees when change will be delivered.
+
+#### load balancing
+
+With multiple apiservers we need a way to load balance traffic to/from master replicas. As different cloud
+providers have different capabilities and limitations, we will not try to find a common lowest
+denominator that will work everywhere. Instead we will document various options and apply different
+solution for different deployments. Below we list possible approaches:
+
+1. `Managed DNS` - user need to specify a domain name during cluster creation. DNS entries will be managed
+automaticaly by the deployment tool that will be intergrated with solutions like Route53 (AWS)
+or Google Cloud DNS (GCP). For load balancing we will have two options:
+  1.1. create an L4 load balancer in front of all apiservers and update DNS name appropriately
+  1.2. use round-robin DNS technique to access all apiservers directly
+2. `Unmanaged DNS` - this is very similar to `Managed DNS`, with the exception that DNS entries
+will be manually managed by the user. We will provide detailed documentation for the entries we
+expect.
+3. [GCP only] `Promote master IP` - in GCP, when we create the first master replica, we generate a static
+external IP address that is later assigned to the master VM. When creating additional replicas we
+will create a loadbalancer infront of them and reassign aforementioned IP to point to the load balancer
+instead of a single master. When removing second to last replica we will reverse this operation (assign
+IP address to the remaining master VM and delete load balancer). That way user will not have to provide
+a domain name and all client configurations will keep working.
+
+This will also impact `kubelet <-> master` communication as it should use load
+balancing for it. Depending on the chosen method we will use it to properly configure
+kubelet.
+
+#### `kubernetes` service
+
+Kubernetes maintains a special service called `kubernetes`. Currently it keeps a
+list of IP addresses for all apiservers. As it uses a command line flag
+`--apiserver-count` it is not very dynamic and would require restarting all
+masters to change number of master replicas.
+
+To allow dynamic changes to the number of apiservers in the cluster, we will
+introduce a `ConfigMap` in `kube-system` namespace, that will keep an expiration
+time for each apiserver (keyed by IP). Each apiserver will do three things:
+
+1. periodically update expiration time for it's own IP address
+2. remove all the stale IP addresses from the endpoints list
+3. add it's own IP address if it's not on the list yet.
+
+That way we will not only solve the problem of dynamically changing number
+of apiservers in the cluster, but also the problem of non-responsive apiservers
+that should be removed from the `kubernetes` service endpoints list.
+
+#### Certificates
+
+Certificate generation will work as today. In particular, on GCE, we will
+generate it for the public IP used to access the cluster (see `load balancing`
+section) and local IP of the master replica VM.
+
+That means that with multiple master replicas and a load balancer in front
+of them, accessing one of the replicas directly (using it's ephemeral public
+IP) will not work on GCE without appropriate flags:
+
+- `kubectl --insecure-skip-tls-verify=true`
+- `curl --insecure`
+- `wget --no-check-certificate`
+
+For other deployment tools and providers the details of certificate generation
+may be different, but it must be possible to access the cluster by using either
+the main cluster endpoint (DNS name or IP address) or internal service called
+`kubernetes` that points directly to the apiservers.
+
+### controller manager, scheduler & cluster autoscaler
+
+Controller manager and scheduler will by default use a lease mechanism to choose an active instance
+among all masters. Only one instance will be performing any operations.
+All other will be waiting in standby mode.
+
+We will use the same configuration in non-replicated mode to simplify deployment scripts.
+
+### add-on manager
+
+All add-on managers will be working independently. Each of them will observe current state of
+add-ons and will try to sync it with files on disk. As a result, due to races, a single add-on
+can be updated multiple times in a row after upgrading the master. Long-term we should fix this
+by using a similar mechanisms as controller manager or scheduler. However, currently add-on
+manager is just a bash script and adding a master election mechanism would not be easy.
+
+## Adding replica
+
+Command to add new replica on GCE using kube-up script:
+
+```
+KUBE_REPLICATE_EXISTING_MASTER=true KUBE_GCE_ZONE=us-central1-b kubernetes/cluster/kube-up.sh
+```
+
+A pseudo-code for adding a new master replica using managed DNS and a loadbalancer is the following:
+
+```
+1. If there is no load balancer for this cluster:
+  1. Create load balancer using ephemeral IP address
+  2. Add existing apiserver to the load balancer
+  3. Wait until load balancer is working, i.e. all data is propagated, in GCE up to 20 min (sic!)
+  4. Update DNS to point to the load balancer.
+2. Clone existing master (create a new VM with the same configuration) including
+   all env variables (certificates, IP ranges etc), with the exception of
+   `INITIAL_ETCD_CLUSTER`.
+3. SSH to an existing master and run the following command to extend etcd cluster
+   with the new instance:
+   `curl <existing_master>:4001/v2/members -XPOST -H "Content-Type: application/json" -d '{"peerURLs":["http://<new_master>:2380"]}'`
+4. Add IP address of the new apiserver to the load balancer.
+```
+
+A simplified algorithm for adding a new master replica and promoting master IP to the load balancer
+is identical to the one when using DNS, with a different step to setup load balancer:
+
+```
+1. If there is no load balancer for this cluster:
+  1. Unassign IP from the existing master replica
+  2. Create load balancer using static IP reclaimed in the previous step
+  3. Add existing apiserver to the load balancer
+  4. Wait until load balancer is working, i.e. all data is propagated, in GCE up to 20 min (sic!)
+...
+```
+
+## Deleting replica
+
+Command to delete one replica on GCE using kube-up script:
+
+```
+KUBE_DELETE_NODES=false KUBE_GCE_ZONE=us-central1-b kubernetes/cluster/kube-down.sh
+```
+
+A pseudo-code for deleting an existing replica for the master is the following:
+
+```
+1. Remove replica IP address from the load balancer or DNS configuration
+2. SSH to one of the remaining masters and run the following command to remove replica from the cluster:
+  `curl etcd-0:4001/v2/members/<id> -XDELETE -L`
+3. Delete replica VM
+4. If load balancer has only a single target instance, then delete load balancer
+5. Update DNS to point to the remaining master replica, or [on GCE] assign static IP back to the master VM.
+```
+
+## Upgrades
+
+Upgrading replicated master will be possible by upgrading them one by one using existing tools
+(e.g. upgrade.sh for GCE). This will work out of the box because:
+* Requests from nodes will be correctly served by either new or old master because apiserver is backward compatible.
+* Requests from scheduler (and controllers) go to a local apiserver via localhost interface, so both components
+will be in the same version.
+* Apiserver talks only to a local etcd replica which will be in a compatible version
+* We assume we will introduce this setup after we upgrade to etcd v3 so we don't need to cover upgrading database.
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/ha_master.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->