Merge pull request #19257 from justinsb/doc_ubernetes_lite

Auto commit by PR queue bot
2025-09-07 04:03:20 +00:00 · 2016-01-08 05:23:47 -08:00
parent 33085cc2ae 0f8d6aee96
commit cb80be10f7
1 changed files with 230 additions and 0 deletions
--- a/docs/proposals/federation-lite.md
+++ b/docs/proposals/federation-lite.md
@@ -0,0 +1,230 @@
+<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
+
+<!-- BEGIN STRIP_FOR_RELEASE -->
+
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+
+<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
+
+If you are using a released version of Kubernetes, you should
+refer to the docs that go with that version.
+
+Documentation for other releases can be found at
+[releases.k8s.io](http://releases.k8s.io).
+</strong>
+--
+
+<!-- END STRIP_FOR_RELEASE -->
+
+<!-- END MUNGE: UNVERSIONED_WARNING -->
+
+# Kubernetes Multi-AZ Clusters
+
+## (a.k.a. "Ubernetes-Lite")
+
+## Introduction
+
+Full Ubernetes will offer sophisticated federation between multiple kuberentes
+clusters, offering true high-availability, multiple provider support &
+cloud-bursting, multiple region support etc.  However, many users have
+expressed a desire for a "reasonably" high-available cluster, that runs in
+multiple zones on GCE or availablity zones in AWS, and can tolerate the failure
+of a single zone without the complexity of running multiple clusters.
+
+Ubernetes-Lite aims to deliver exactly that functionality: to run a single
+Kubernetes cluster in multiple zones.  It will attempt to make reasonable
+scheduling decisions, in particular so that a replication controller's pods are
+spread across zones, and it will try to be aware of constraints - for example
+that a volume cannot be mounted on a node in a different zone.
+
+Ubernetes-Lite is deliberately limited in scope; for many advanced functions
+the answer will be "use Ubernetes (full)".  For example, multiple-region
+support is not in scope.  Routing affinity (e.g. so that a webserver will
+prefer to talk to a backend service in the same zone) is similarly not in
+scope.
+
+## Design
+
+These are the main requirements:
+
+1. kube-up must allow bringing up a cluster that spans multiple zones.
+1. pods in a replication controller should attempt to spread across zones.
+1. pods which require volumes should not be scheduled onto nodes in a different zone.
+1. load-balanced services should work reasonably
+
+### kube-up support
+
+kube-up support for multiple zones will initially be considered
+advanced/experimental functionality, so the interface is not initially going to
+be particularly user-friendly.  As we design the evolution of kube-up, we will
+make multiple zones better supported.
+
+For the initial implemenation, kube-up must be run multiple times, once for
+each zone.  The first kube-up will take place as normal, but then for each
+additional zone the user must run kube-up again, specifying
+`KUBE_SHARE_MASTER=true` and `KUBE_SUBNET_CIDR=172.20.x.0/24`.  This will then
+create additional nodes in a different zone, but will register them with the
+existing master.
+
+### Zone spreading
+
+This will be implemented by modifying the existing scheduler priority function
+`SelectorSpread`.  Currently this priority function aims to put pods in an RC
+on different hosts, but it will be extended first to spread across zones, and
+then to spread across hosts.
+
+So that the scheduler does not need to call out to the cloud provider on every
+scheduling decision, we must somehow record the zone information for each node.
+The implementation of this will be described in the implementation section.
+
+Note that zone spreading is 'best effort'; zones are just be one of the factors
+in making scheduling decisions, and thus it is not guaranteed that pods will
+spread evenly across zones.  However, this is likely desireable: if a zone is
+overloaded or failing, we still want to schedule the requested number of pods.
+
+### Volume affinity
+
+Most cloud providers (at least GCE and AWS) cannot attach their persistent
+volumes across zones.  Thus when a pod is being scheduled, if there is a volume
+attached, that will dictate the zone.  This will be implemented using a new
+scheduler predicate (a hard constraint): `VolumeZonePredicate`.
+
+When `VolumeZonePredicate` observes a pod scheduling request that includes a
+volume, if that volume is zone-specific, `VolumeZonePredicate` will exclude any
+nodes not in that zone.
+
+Again, to avoid the scheduler calling out to the cloud provider, this will rely
+on information attached to the volumes.  This means that this will only support
+PersistentVolumeClaims, because direct mounts do not have a place to attach
+zone information.  PersistentVolumes will then include zone information where
+volumes are zone-specific.
+
+### Load-balanced services should operate reasonably
+
+For both AWS & GCE, Kubernetes creates a native cloud load-balancer for each
+service of type LoadBalancer.  The native cloud load-balancers on both AWS &
+GCE are region-level, and support load-balancing across instances in multiple
+zones (in the same region).  For both clouds, the behaviour of the native cloud
+load-balancer is reasonable in the face of failures (indeed, this is why clouds
+provide load-balancing as a primitve).
+
+For Ubernetes-Lite we will therefore simply rely on the native cloud provider
+load balancer behaviour, and we do not anticipate substantial code changes.
+
+One notable shortcoming here is that load-balanced traffic still goes through
+kube-proxy controlled routing, and kube-proxy does not (currently) favor
+targeting a pod running on the same instance or even the same zone.  This will
+likely produce a lot of unnecessary cross-zone traffic (which is likely slower
+and more expensive).  This might be sufficiently low-hanging fruit that we
+choose to address it in kube-proxy / Ubernetes-Lite, but this can be addressed
+after the initial Ubernetes-Lite implementation.
+
+
+## Implementation
+
+The main implementation points are:
+
+1. how to attach zone information to Nodes and PersistentVolumes
+1. how nodes get zone information
+1. how volumes get zone information
+
+### Attaching zone information
+
+We must attach zone information to Nodes and PersistentVolumes, and possibly to
+other resources in future.  There are two obvious alternatives: we can use
+labels/annotations, or we can extend the schema to include the information.
+
+For the initial implementation, we propose to use labels.  The reasoning is:
+
+1. It is considerably easier to implement.
+1. We will reserve the two labels `failure-domain.alpha.kubernetes.io/zone` and
+`failure-domain.alpha.kubernetes.io/region` for the two pieces of information
+we need.  By putting this under the `kubernetes.io` namespace there is no risk
+of collision, and by putting it under `alpha.kubernetes.io` we clearly mark
+this as an experimental feature.
+1. We do not yet know whether these labels will be sufficient for all
+environments, nor which entities will require zone information.  Labels give us
+more flexibility here.
+1. Because the labels are reserved, we can move to schema-defined fields in
+future using our cross-version mapping techniques.
+
+### Node labeling
+
+We do not want to require an administrator to manually label nodes.  We instead
+modify the kubelet to include the appropriate labels when it registers itself.
+The information is easily obtained by the kubelet from the cloud provider.
+
+### Volume labeling
+
+As with nodes, we do not want to require an administrator to manually label
+volumes.  We will create an admission controller `PersistentVolumeLabel`.
+`PersistentVolumeLabel` will intercept requests to create PersistentVolumes,
+and will label them appropriately by calling in to the cloud provider.
+
+## AWS Specific Considerations
+
+The AWS implementation here is fairly straightforward.  The AWS API is
+region-wide, meaning that a single call will find instances and volumes in all
+zones.  In addition, instance ids and volume ids are unique per-region (and
+hence also per-zone).  I believe they are actually globally unique, but I do
+not know if this is guaranteed; in any case we only need global uniqueness if
+we are to span regions, which will not be supported by Ubernetes-Lite (to do
+that correctly requires an Ubernetes-Full type approach).
+
+## GCE Specific Considerations
+
+The GCE implementation is more complicated than the AWS implementation because
+GCE APIs are zone-scoped.  To perform an operation, we must perform one REST
+call per zone and combine the results, unless we can determine in advance that
+an operation references a particular zone.  For many operations, we can make
+that determination, but in some cases - such as listing all instances, we must
+combine results from calls in all relevant zones.
+
+A further complexity is that GCE volume names are scoped per-zone, not
+per-region.  Thus it is permitted to have two volumes both named `myvolume` in
+two different GCE zones. (Instance names are currently unique per-region, and
+thus are not a problem for Ubernetes-Lite).
+
+The volume scoping leads to a (small) behavioural change for Ubernetes-Lite on
+GCE.  If you had two volumes both named `myvolume` in two different GCE zones,
+this would not be ambiguous when Kubernetes is operating only in a single zone.
+But, if Ubernetes-Lite is operating in multiple zones, `myvolume` is no longer
+sufficient to specify a volume uniquely.  Worse, the fact that a volume happens
+to be unambigious at a particular time is no guarantee that it will continue to
+be unambigious in future, because a volume with the same name could
+subsequently be created in a second zone.  While perhaps unlikely in practice,
+we cannot automatically enable Ubernetes-Lite for GCE users if this then causes
+volume mounts to stop working.
+
+This suggests that (at least on GCE), Ubernetes-Lite must be optional (i.e.
+there must be a feature-flag).  It may be that we can make this feature
+semi-automatic in future, by detecting whether nodes are running in multiple
+zones, but it seems likely that kube-up could instead simply set this flag.
+
+For the initial implementation, creating volumes with identical names will
+yield undefined results.  Later, we may add some way to specify the zone for a
+volume (and possibly require that volumes have their zone specified when
+running with Ubernetes-Lite).  We could add a new `zone` field to the
+PersistentVolume type for GCE PD volumes, or we could use a DNS-style dotted
+name for the volume name (<name>.<zone>)
+
+Initially therefore, the GCE changes will be to:
+
+1. change kube-up to support creation of a cluster in multiple zones
+1. pass a flag enabling Ubernetes-Lite with kube-up
+1. change the kuberentes cloud provider to iterate through relevant zones when resolving items
+1. tag GCE PD volumes with the appropriate zone information
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/federation-lite.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->