mirror of
				https://github.com/k3s-io/kubernetes.git
				synced 2025-11-01 06:10:17 +00:00 
			
		
		
		
	
		
			
				
	
	
		
			237 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			237 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| # Automated HA master deployment
 | ||
| 
 | ||
| **Author:** filipg@, jsz@
 | ||
| 
 | ||
| # Introduction
 | ||
| 
 | ||
| We want to allow users to easily replicate kubernetes masters to have highly available cluster,
 | ||
| initially using `kube-up.sh` and `kube-down.sh`.
 | ||
| 
 | ||
| This document describes technical design of this feature. It assumes that we are using aforementioned
 | ||
| scripts for cluster deployment. All of the ideas described in the following sections should be easy
 | ||
| to implement on GCE, AWS and other cloud providers.
 | ||
| 
 | ||
| It is a non-goal to design a specific setup for bare-metal environment, which
 | ||
| might be very different.
 | ||
| 
 | ||
| # Overview
 | ||
| 
 | ||
| In a cluster with replicated master, we will have N VMs, each running regular master components
 | ||
| such as apiserver, etcd, scheduler or controller manager. These components will interact in the
 | ||
| following way:
 | ||
| * All etcd replicas will be clustered together and will be using master election
 | ||
|   and quorum mechanism to agree on the state. All of these mechanisms are integral
 | ||
|   parts of etcd and we will only have to configure them properly.
 | ||
| * All apiserver replicas will be working independently talking to an etcd on
 | ||
|   127.0.0.1 (i.e. local etcd replica), which if needed will forward requests to the current etcd master
 | ||
|   (as explained [here](https://coreos.com/etcd/docs/latest/getting-started-with-etcd.html)).
 | ||
| * We will introduce provider specific solutions to load balance traffic between master replicas
 | ||
|   (see section `load balancing`)
 | ||
| * Controller manager, scheduler & cluster autoscaler will use lease mechanism and
 | ||
|   only a single instance will be an active master. All other will be waiting in a standby mode.
 | ||
| * All add-on managers will work independently and each of them will try to keep add-ons in sync
 | ||
| 
 | ||
| # Detailed design
 | ||
| 
 | ||
| ## Components
 | ||
| 
 | ||
| ### etcd
 | ||
| 
 | ||
| ```
 | ||
| Note: This design for etcd clustering is quite pet-set like - each etcd
 | ||
| replica has its name which is explicitly used in etcd configuration etc. In
 | ||
| medium-term future we would like to have the ability to run masters as part of
 | ||
| autoscaling-group (AWS) or managed-instance-group (GCE) and add/remove replicas
 | ||
| automatically. This is pretty tricky and this design does not cover this.
 | ||
| It will be covered in a separate doc.
 | ||
| ```
 | ||
| 
 | ||
| All etcd instances will be clustered together and one of them will be an elected master.
 | ||
| In order to commit any change quorum of the cluster will have to confirm it. Etcd will be
 | ||
| configured in such a way that all writes and reads will go through the master (requests
 | ||
| will be forwarded by the local etcd server such that it’s invisible for the user). It will
 | ||
| affect latency for all operations, but it should not increase by much more than the network
 | ||
| latency between master replicas (latency between GCE zones with a region is < 10ms).
 | ||
| 
 | ||
| Currently etcd exposes port only using localhost interface. In order to allow clustering
 | ||
| and inter-VM communication we will also have to use public interface. To secure the
 | ||
| communication we will use SSL (as described [here](https://coreos.com/etcd/docs/latest/security.html)).
 | ||
| 
 | ||
| When generating command line for etcd we will always assume it’s part of a cluster
 | ||
| (initially of size 1) and list all existing kubernetes master replicas.
 | ||
| Based on that, we will set the following flags:
 | ||
| * `-initial-cluster` - list of all hostnames/DNS names for master replicas (including the new one)
 | ||
| * `-initial-cluster-state` (keep in mind that we are adding master replicas one by one):
 | ||
|   * `new` if we are adding the first replica, i.e. the list of existing master replicas is empty
 | ||
|   * `existing` if there are more than one replica, i.e. the list of existing master replicas is non-empty.
 | ||
| 
 | ||
| This will allow us to have exactly the same logic for HA and non-HA master. List of DNS names for VMs
 | ||
| with master replicas will be generated in `kube-up.sh` script and passed to as a env variable
 | ||
| `INITIAL_ETCD_CLUSTER`.
 | ||
| 
 | ||
| ### apiservers
 | ||
| 
 | ||
| All apiservers will work independently. They will contact etcd on 127.0.0.1, i.e. they will always contact
 | ||
| etcd replica running on the same VM. If needed, such requests will be forwarded by etcd server to the
 | ||
| etcd leader. This functionality is completely hidden from the client (apiserver
 | ||
| in our case).
 | ||
| 
 | ||
| Caching mechanism, which is implemented in apiserver, will not be affected by
 | ||
| replicating master because:
 | ||
| * GET requests go directly to etcd
 | ||
| * LIST requests go either directly to etcd or to cache populated via watch
 | ||
|   (depending on the ResourceVersion in ListOptions). In the second scenario,
 | ||
|   after a PUT/POST request, changes might not be visible in LIST response.
 | ||
|   This is however not worse than it is with the current single master.
 | ||
| * WATCH does not give any guarantees when change will be delivered.
 | ||
| 
 | ||
| #### load balancing
 | ||
| 
 | ||
| With multiple apiservers we need a way to load balance traffic to/from master replicas. As different cloud
 | ||
| providers have different capabilities and limitations, we will not try to find a common lowest
 | ||
| denominator that will work everywhere. Instead we will document various options and apply different
 | ||
| solution for different deployments. Below we list possible approaches:
 | ||
| 
 | ||
| 1. `Managed DNS` - user need to specify a domain name during cluster creation. DNS entries will be managed
 | ||
| automaticaly by the deployment tool that will be intergrated with solutions like Route53 (AWS)
 | ||
| or Google Cloud DNS (GCP). For load balancing we will have two options:
 | ||
|   1.1. create an L4 load balancer in front of all apiservers and update DNS name appropriately
 | ||
|   1.2. use round-robin DNS technique to access all apiservers directly
 | ||
| 2. `Unmanaged DNS` - this is very similar to `Managed DNS`, with the exception that DNS entries
 | ||
| will be manually managed by the user. We will provide detailed documentation for the entries we
 | ||
| expect.
 | ||
| 3. [GCP only] `Promote master IP` - in GCP, when we create the first master replica, we generate a static
 | ||
| external IP address that is later assigned to the master VM. When creating additional replicas we
 | ||
| will create a loadbalancer infront of them and reassign aforementioned IP to point to the load balancer
 | ||
| instead of a single master. When removing second to last replica we will reverse this operation (assign
 | ||
| IP address to the remaining master VM and delete load balancer). That way user will not have to provide
 | ||
| a domain name and all client configurations will keep working.
 | ||
| 
 | ||
| This will also impact `kubelet <-> master` communication as it should use load
 | ||
| balancing for it. Depending on the chosen method we will use it to properly configure
 | ||
| kubelet.
 | ||
| 
 | ||
| #### `kubernetes` service
 | ||
| 
 | ||
| Kubernetes maintains a special service called `kubernetes`. Currently it keeps a
 | ||
| list of IP addresses for all apiservers. As it uses a command line flag
 | ||
| `--apiserver-count` it is not very dynamic and would require restarting all
 | ||
| masters to change number of master replicas.
 | ||
| 
 | ||
| To allow dynamic changes to the number of apiservers in the cluster, we will
 | ||
| introduce a `ConfigMap` in `kube-system` namespace, that will keep an expiration
 | ||
| time for each apiserver (keyed by IP). Each apiserver will do three things:
 | ||
| 
 | ||
| 1. periodically update expiration time for it's own IP address
 | ||
| 2. remove all the stale IP addresses from the endpoints list
 | ||
| 3. add it's own IP address if it's not on the list yet.
 | ||
| 
 | ||
| That way we will not only solve the problem of dynamically changing number
 | ||
| of apiservers in the cluster, but also the problem of non-responsive apiservers
 | ||
| that should be removed from the `kubernetes` service endpoints list.
 | ||
| 
 | ||
| #### Certificates
 | ||
| 
 | ||
| Certificate generation will work as today. In particular, on GCE, we will
 | ||
| generate it for the public IP used to access the cluster (see `load balancing`
 | ||
| section) and local IP of the master replica VM.
 | ||
| 
 | ||
| That means that with multiple master replicas and a load balancer in front
 | ||
| of them, accessing one of the replicas directly (using it's ephemeral public
 | ||
| IP) will not work on GCE without appropriate flags:
 | ||
| 
 | ||
| - `kubectl --insecure-skip-tls-verify=true`
 | ||
| - `curl --insecure`
 | ||
| - `wget --no-check-certificate`
 | ||
| 
 | ||
| For other deployment tools and providers the details of certificate generation
 | ||
| may be different, but it must be possible to access the cluster by using either
 | ||
| the main cluster endpoint (DNS name or IP address) or internal service called
 | ||
| `kubernetes` that points directly to the apiservers.
 | ||
| 
 | ||
| ### controller manager, scheduler & cluster autoscaler
 | ||
| 
 | ||
| Controller manager and scheduler will by default use a lease mechanism to choose an active instance
 | ||
| among all masters. Only one instance will be performing any operations.
 | ||
| All other will be waiting in standby mode.
 | ||
| 
 | ||
| We will use the same configuration in non-replicated mode to simplify deployment scripts.
 | ||
| 
 | ||
| ### add-on manager
 | ||
| 
 | ||
| All add-on managers will be working independently. Each of them will observe current state of
 | ||
| add-ons and will try to sync it with files on disk. As a result, due to races, a single add-on
 | ||
| can be updated multiple times in a row after upgrading the master. Long-term we should fix this
 | ||
| by using a similar mechanisms as controller manager or scheduler. However, currently add-on
 | ||
| manager is just a bash script and adding a master election mechanism would not be easy.
 | ||
| 
 | ||
| ## Adding replica
 | ||
| 
 | ||
| Command to add new replica on GCE using kube-up script:
 | ||
| 
 | ||
| ```
 | ||
| KUBE_REPLICATE_EXISTING_MASTER=true KUBE_GCE_ZONE=us-central1-b kubernetes/cluster/kube-up.sh
 | ||
| ```
 | ||
| 
 | ||
| A pseudo-code for adding a new master replica using managed DNS and a loadbalancer is the following:
 | ||
| 
 | ||
| ```
 | ||
| 1. If there is no load balancer for this cluster:
 | ||
|   1. Create load balancer using ephemeral IP address
 | ||
|   2. Add existing apiserver to the load balancer
 | ||
|   3. Wait until load balancer is working, i.e. all data is propagated, in GCE up to 20 min (sic!)
 | ||
|   4. Update DNS to point to the load balancer.
 | ||
| 2. Clone existing master (create a new VM with the same configuration) including
 | ||
|    all env variables (certificates, IP ranges etc), with the exception of
 | ||
|    `INITIAL_ETCD_CLUSTER`.
 | ||
| 3. SSH to an existing master and run the following command to extend etcd cluster
 | ||
|    with the new instance:
 | ||
|    `curl <existing_master>:4001/v2/members -XPOST -H "Content-Type: application/json" -d '{"peerURLs":["http://<new_master>:2380"]}'`
 | ||
| 4. Add IP address of the new apiserver to the load balancer.
 | ||
| ```
 | ||
| 
 | ||
| A simplified algorithm for adding a new master replica and promoting master IP to the load balancer
 | ||
| is identical to the one when using DNS, with a different step to setup load balancer:
 | ||
| 
 | ||
| ```
 | ||
| 1. If there is no load balancer for this cluster:
 | ||
|   1. Unassign IP from the existing master replica
 | ||
|   2. Create load balancer using static IP reclaimed in the previous step
 | ||
|   3. Add existing apiserver to the load balancer
 | ||
|   4. Wait until load balancer is working, i.e. all data is propagated, in GCE up to 20 min (sic!)
 | ||
| ...
 | ||
| ```
 | ||
| 
 | ||
| ## Deleting replica
 | ||
| 
 | ||
| Command to delete one replica on GCE using kube-up script:
 | ||
| 
 | ||
| ```
 | ||
| KUBE_DELETE_NODES=false KUBE_GCE_ZONE=us-central1-b kubernetes/cluster/kube-down.sh
 | ||
| ```
 | ||
| 
 | ||
| A pseudo-code for deleting an existing replica for the master is the following:
 | ||
| 
 | ||
| ```
 | ||
| 1. Remove replica IP address from the load balancer or DNS configuration
 | ||
| 2. SSH to one of the remaining masters and run the following command to remove replica from the cluster:
 | ||
|   `curl etcd-0:4001/v2/members/<id> -XDELETE -L`
 | ||
| 3. Delete replica VM
 | ||
| 4. If load balancer has only a single target instance, then delete load balancer
 | ||
| 5. Update DNS to point to the remaining master replica, or [on GCE] assign static IP back to the master VM.
 | ||
| ```
 | ||
| 
 | ||
| ## Upgrades
 | ||
| 
 | ||
| Upgrading replicated master will be possible by upgrading them one by one using existing tools
 | ||
| (e.g. upgrade.sh for GCE). This will work out of the box because:
 | ||
| * Requests from nodes will be correctly served by either new or old master because apiserver is backward compatible.
 | ||
| * Requests from scheduler (and controllers) go to a local apiserver via localhost interface, so both components
 | ||
| will be in the same version.
 | ||
| * Apiserver talks only to a local etcd replica which will be in a compatible version
 | ||
| * We assume we will introduce this setup after we upgrade to etcd v3 so we don't need to cover upgrading database.
 | ||
| 
 | ||
| <!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
 | ||
| []()
 | ||
| <!-- END MUNGE: GENERATED_ANALYTICS -->
 |