mirror of
				https://github.com/k3s-io/kubernetes.git
				synced 2025-10-31 13:50:01 +00:00 
			
		
		
		
	
		
			
				
	
	
		
			236 lines
		
	
	
		
			11 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			236 lines
		
	
	
		
			11 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| <!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
 | |
| 
 | |
| <!-- BEGIN STRIP_FOR_RELEASE -->
 | |
| 
 | |
| <img src="http://kubernetes.io/img/warning.png" alt="WARNING"
 | |
|      width="25" height="25">
 | |
| <img src="http://kubernetes.io/img/warning.png" alt="WARNING"
 | |
|      width="25" height="25">
 | |
| <img src="http://kubernetes.io/img/warning.png" alt="WARNING"
 | |
|      width="25" height="25">
 | |
| <img src="http://kubernetes.io/img/warning.png" alt="WARNING"
 | |
|      width="25" height="25">
 | |
| <img src="http://kubernetes.io/img/warning.png" alt="WARNING"
 | |
|      width="25" height="25">
 | |
| 
 | |
| <h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
 | |
| 
 | |
| If you are using a released version of Kubernetes, you should
 | |
| refer to the docs that go with that version.
 | |
| 
 | |
| <!-- TAG RELEASE_LINK, added by the munger automatically -->
 | |
| <strong>
 | |
| The latest release of this document can be found
 | |
| [here](http://releases.k8s.io/release-1.2/docs/proposals/federation-lite.md).
 | |
| 
 | |
| Documentation for other releases can be found at
 | |
| [releases.k8s.io](http://releases.k8s.io).
 | |
| </strong>
 | |
| --
 | |
| 
 | |
| <!-- END STRIP_FOR_RELEASE -->
 | |
| 
 | |
| <!-- END MUNGE: UNVERSIONED_WARNING -->
 | |
| 
 | |
| # Kubernetes Multi-AZ Clusters
 | |
| 
 | |
| ## (a.k.a. "Ubernetes-Lite")
 | |
| 
 | |
| ## Introduction
 | |
| 
 | |
| Full Ubernetes will offer sophisticated federation between multiple kuberentes
 | |
| clusters, offering true high-availability, multiple provider support &
 | |
| cloud-bursting, multiple region support etc.  However, many users have
 | |
| expressed a desire for a "reasonably" high-available cluster, that runs in
 | |
| multiple zones on GCE or availability zones in AWS, and can tolerate the failure
 | |
| of a single zone without the complexity of running multiple clusters.
 | |
| 
 | |
| Ubernetes-Lite aims to deliver exactly that functionality: to run a single
 | |
| Kubernetes cluster in multiple zones.  It will attempt to make reasonable
 | |
| scheduling decisions, in particular so that a replication controller's pods are
 | |
| spread across zones, and it will try to be aware of constraints - for example
 | |
| that a volume cannot be mounted on a node in a different zone.
 | |
| 
 | |
| Ubernetes-Lite is deliberately limited in scope; for many advanced functions
 | |
| the answer will be "use Ubernetes (full)".  For example, multiple-region
 | |
| support is not in scope.  Routing affinity (e.g. so that a webserver will
 | |
| prefer to talk to a backend service in the same zone) is similarly not in
 | |
| scope.
 | |
| 
 | |
| ## Design
 | |
| 
 | |
| These are the main requirements:
 | |
| 
 | |
| 1. kube-up must allow bringing up a cluster that spans multiple zones.
 | |
| 1. pods in a replication controller should attempt to spread across zones.
 | |
| 1. pods which require volumes should not be scheduled onto nodes in a different zone.
 | |
| 1. load-balanced services should work reasonably
 | |
| 
 | |
| ### kube-up support
 | |
| 
 | |
| kube-up support for multiple zones will initially be considered
 | |
| advanced/experimental functionality, so the interface is not initially going to
 | |
| be particularly user-friendly.  As we design the evolution of kube-up, we will
 | |
| make multiple zones better supported.
 | |
| 
 | |
| For the initial implemenation, kube-up must be run multiple times, once for
 | |
| each zone.  The first kube-up will take place as normal, but then for each
 | |
| additional zone the user must run kube-up again, specifying
 | |
| `KUBE_USE_EXISTING_MASTER=true` and `KUBE_SUBNET_CIDR=172.20.x.0/24`.  This will then
 | |
| create additional nodes in a different zone, but will register them with the
 | |
| existing master.
 | |
| 
 | |
| ### Zone spreading
 | |
| 
 | |
| This will be implemented by modifying the existing scheduler priority function
 | |
| `SelectorSpread`.  Currently this priority function aims to put pods in an RC
 | |
| on different hosts, but it will be extended first to spread across zones, and
 | |
| then to spread across hosts.
 | |
| 
 | |
| So that the scheduler does not need to call out to the cloud provider on every
 | |
| scheduling decision, we must somehow record the zone information for each node.
 | |
| The implementation of this will be described in the implementation section.
 | |
| 
 | |
| Note that zone spreading is 'best effort'; zones are just be one of the factors
 | |
| in making scheduling decisions, and thus it is not guaranteed that pods will
 | |
| spread evenly across zones.  However, this is likely desirable: if a zone is
 | |
| overloaded or failing, we still want to schedule the requested number of pods.
 | |
| 
 | |
| ### Volume affinity
 | |
| 
 | |
| Most cloud providers (at least GCE and AWS) cannot attach their persistent
 | |
| volumes across zones.  Thus when a pod is being scheduled, if there is a volume
 | |
| attached, that will dictate the zone.  This will be implemented using a new
 | |
| scheduler predicate (a hard constraint): `VolumeZonePredicate`.
 | |
| 
 | |
| When `VolumeZonePredicate` observes a pod scheduling request that includes a
 | |
| volume, if that volume is zone-specific, `VolumeZonePredicate` will exclude any
 | |
| nodes not in that zone.
 | |
| 
 | |
| Again, to avoid the scheduler calling out to the cloud provider, this will rely
 | |
| on information attached to the volumes.  This means that this will only support
 | |
| PersistentVolumeClaims, because direct mounts do not have a place to attach
 | |
| zone information.  PersistentVolumes will then include zone information where
 | |
| volumes are zone-specific.
 | |
| 
 | |
| ### Load-balanced services should operate reasonably
 | |
| 
 | |
| For both AWS & GCE, Kubernetes creates a native cloud load-balancer for each
 | |
| service of type LoadBalancer.  The native cloud load-balancers on both AWS &
 | |
| GCE are region-level, and support load-balancing across instances in multiple
 | |
| zones (in the same region).  For both clouds, the behaviour of the native cloud
 | |
| load-balancer is reasonable in the face of failures (indeed, this is why clouds
 | |
| provide load-balancing as a primitve).
 | |
| 
 | |
| For Ubernetes-Lite we will therefore simply rely on the native cloud provider
 | |
| load balancer behaviour, and we do not anticipate substantial code changes.
 | |
| 
 | |
| One notable shortcoming here is that load-balanced traffic still goes through
 | |
| kube-proxy controlled routing, and kube-proxy does not (currently) favor
 | |
| targeting a pod running on the same instance or even the same zone.  This will
 | |
| likely produce a lot of unnecessary cross-zone traffic (which is likely slower
 | |
| and more expensive).  This might be sufficiently low-hanging fruit that we
 | |
| choose to address it in kube-proxy / Ubernetes-Lite, but this can be addressed
 | |
| after the initial Ubernetes-Lite implementation.
 | |
| 
 | |
| 
 | |
| ## Implementation
 | |
| 
 | |
| The main implementation points are:
 | |
| 
 | |
| 1. how to attach zone information to Nodes and PersistentVolumes
 | |
| 1. how nodes get zone information
 | |
| 1. how volumes get zone information
 | |
| 
 | |
| ### Attaching zone information
 | |
| 
 | |
| We must attach zone information to Nodes and PersistentVolumes, and possibly to
 | |
| other resources in future.  There are two obvious alternatives: we can use
 | |
| labels/annotations, or we can extend the schema to include the information.
 | |
| 
 | |
| For the initial implementation, we propose to use labels.  The reasoning is:
 | |
| 
 | |
| 1. It is considerably easier to implement.
 | |
| 1. We will reserve the two labels `failure-domain.alpha.kubernetes.io/zone` and
 | |
| `failure-domain.alpha.kubernetes.io/region` for the two pieces of information
 | |
| we need.  By putting this under the `kubernetes.io` namespace there is no risk
 | |
| of collision, and by putting it under `alpha.kubernetes.io` we clearly mark
 | |
| this as an experimental feature.
 | |
| 1. We do not yet know whether these labels will be sufficient for all
 | |
| environments, nor which entities will require zone information.  Labels give us
 | |
| more flexibility here.
 | |
| 1. Because the labels are reserved, we can move to schema-defined fields in
 | |
| future using our cross-version mapping techniques.
 | |
| 
 | |
| ### Node labeling
 | |
| 
 | |
| We do not want to require an administrator to manually label nodes.  We instead
 | |
| modify the kubelet to include the appropriate labels when it registers itself.
 | |
| The information is easily obtained by the kubelet from the cloud provider.
 | |
| 
 | |
| ### Volume labeling
 | |
| 
 | |
| As with nodes, we do not want to require an administrator to manually label
 | |
| volumes.  We will create an admission controller `PersistentVolumeLabel`.
 | |
| `PersistentVolumeLabel` will intercept requests to create PersistentVolumes,
 | |
| and will label them appropriately by calling in to the cloud provider.
 | |
| 
 | |
| ## AWS Specific Considerations
 | |
| 
 | |
| The AWS implementation here is fairly straightforward.  The AWS API is
 | |
| region-wide, meaning that a single call will find instances and volumes in all
 | |
| zones.  In addition, instance ids and volume ids are unique per-region (and
 | |
| hence also per-zone).  I believe they are actually globally unique, but I do
 | |
| not know if this is guaranteed; in any case we only need global uniqueness if
 | |
| we are to span regions, which will not be supported by Ubernetes-Lite (to do
 | |
| that correctly requires an Ubernetes-Full type approach).
 | |
| 
 | |
| ## GCE Specific Considerations
 | |
| 
 | |
| The GCE implementation is more complicated than the AWS implementation because
 | |
| GCE APIs are zone-scoped.  To perform an operation, we must perform one REST
 | |
| call per zone and combine the results, unless we can determine in advance that
 | |
| an operation references a particular zone.  For many operations, we can make
 | |
| that determination, but in some cases - such as listing all instances, we must
 | |
| combine results from calls in all relevant zones.
 | |
| 
 | |
| A further complexity is that GCE volume names are scoped per-zone, not
 | |
| per-region.  Thus it is permitted to have two volumes both named `myvolume` in
 | |
| two different GCE zones. (Instance names are currently unique per-region, and
 | |
| thus are not a problem for Ubernetes-Lite).
 | |
| 
 | |
| The volume scoping leads to a (small) behavioural change for Ubernetes-Lite on
 | |
| GCE.  If you had two volumes both named `myvolume` in two different GCE zones,
 | |
| this would not be ambiguous when Kubernetes is operating only in a single zone.
 | |
| But, if Ubernetes-Lite is operating in multiple zones, `myvolume` is no longer
 | |
| sufficient to specify a volume uniquely.  Worse, the fact that a volume happens
 | |
| to be unambigious at a particular time is no guarantee that it will continue to
 | |
| be unambigious in future, because a volume with the same name could
 | |
| subsequently be created in a second zone.  While perhaps unlikely in practice,
 | |
| we cannot automatically enable Ubernetes-Lite for GCE users if this then causes
 | |
| volume mounts to stop working.
 | |
| 
 | |
| This suggests that (at least on GCE), Ubernetes-Lite must be optional (i.e.
 | |
| there must be a feature-flag).  It may be that we can make this feature
 | |
| semi-automatic in future, by detecting whether nodes are running in multiple
 | |
| zones, but it seems likely that kube-up could instead simply set this flag.
 | |
| 
 | |
| For the initial implementation, creating volumes with identical names will
 | |
| yield undefined results.  Later, we may add some way to specify the zone for a
 | |
| volume (and possibly require that volumes have their zone specified when
 | |
| running with Ubernetes-Lite).  We could add a new `zone` field to the
 | |
| PersistentVolume type for GCE PD volumes, or we could use a DNS-style dotted
 | |
| name for the volume name (<name>.<zone>)
 | |
| 
 | |
| Initially therefore, the GCE changes will be to:
 | |
| 
 | |
| 1. change kube-up to support creation of a cluster in multiple zones
 | |
| 1. pass a flag enabling Ubernetes-Lite with kube-up
 | |
| 1. change the kuberentes cloud provider to iterate through relevant zones when resolving items
 | |
| 1. tag GCE PD volumes with the appropriate zone information
 | |
| 
 | |
| 
 | |
| <!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
 | |
| []()
 | |
| <!-- END MUNGE: GENERATED_ANALYTICS -->
 |