diff --git a/docs/proposals/federation-lite.md b/docs/proposals/federation-lite.md new file mode 100644 index 00000000000..44fe52d4b57 --- /dev/null +++ b/docs/proposals/federation-lite.md @@ -0,0 +1,230 @@ + + + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + + + + +# Kubernetes Multi-AZ Clusters + +## (a.k.a. "Ubernetes-Lite") + +## Introduction + +Full Ubernetes will offer sophisticated federation between multiple kuberentes +clusters, offering true high-availability, multiple provider support & +cloud-bursting, multiple region support etc. However, many users have +expressed a desire for a "reasonably" high-available cluster, that runs in +multiple zones on GCE or availablity zones in AWS, and can tolerate the failure +of a single zone without the complexity of running multiple clusters. + +Ubernetes-Lite aims to deliver exactly that functionality: to run a single +Kubernetes cluster in multiple zones. It will attempt to make reasonable +scheduling decisions, in particular so that a replication controller's pods are +spread across zones, and it will try to be aware of constraints - for example +that a volume cannot be mounted on a node in a different zone. + +Ubernetes-Lite is deliberately limited in scope; for many advanced functions +the answer will be "use Ubernetes (full)". For example, multiple-region +support is not in scope. Routing affinity (e.g. so that a webserver will +prefer to talk to a backend service in the same zone) is similarly not in +scope. + +## Design + +These are the main requirements: + +1. kube-up must allow bringing up a cluster that spans multiple zones. +1. pods in a replication controller should attempt to spread across zones. +1. pods which require volumes should not be scheduled onto nodes in a different zone. +1. load-balanced services should work reasonably + +### kube-up support + +kube-up support for multiple zones will initially be considered +advanced/experimental functionality, so the interface is not initially going to +be particularly user-friendly. As we design the evolution of kube-up, we will +make multiple zones better supported. + +For the initial implemenation, kube-up must be run multiple times, once for +each zone. The first kube-up will take place as normal, but then for each +additional zone the user must run kube-up again, specifying +`KUBE_SHARE_MASTER=true` and `KUBE_SUBNET_CIDR=172.20.x.0/24`. This will then +create additional nodes in a different zone, but will register them with the +existing master. + +### Zone spreading + +This will be implemented by modifying the existing scheduler priority function +`SelectorSpread`. Currently this priority function aims to put pods in an RC +on different hosts, but it will be extended first to spread across zones, and +then to spread across hosts. + +So that the scheduler does not need to call out to the cloud provider on every +scheduling decision, we must somehow record the zone information for each node. +The implementation of this will be described in the implementation section. + +Note that zone spreading is 'best effort'; zones are just be one of the factors +in making scheduling decisions, and thus it is not guaranteed that pods will +spread evenly across zones. However, this is likely desireable: if a zone is +overloaded or failing, we still want to schedule the requested number of pods. + +### Volume affinity + +Most cloud providers (at least GCE and AWS) cannot attach their persistent +volumes across zones. Thus when a pod is being scheduled, if there is a volume +attached, that will dictate the zone. This will be implemented using a new +scheduler predicate (a hard constraint): `VolumeZonePredicate`. + +When `VolumeZonePredicate` observes a pod scheduling request that includes a +volume, if that volume is zone-specific, `VolumeZonePredicate` will exclude any +nodes not in that zone. + +Again, to avoid the scheduler calling out to the cloud provider, this will rely +on information attached to the volumes. This means that this will only support +PersistentVolumeClaims, because direct mounts do not have a place to attach +zone information. PersistentVolumes will then include zone information where +volumes are zone-specific. + +### Load-balanced services should operate reasonably + +For both AWS & GCE, Kubernetes creates a native cloud load-balancer for each +service of type LoadBalancer. The native cloud load-balancers on both AWS & +GCE are region-level, and support load-balancing across instances in multiple +zones (in the same region). For both clouds, the behaviour of the native cloud +load-balancer is reasonable in the face of failures (indeed, this is why clouds +provide load-balancing as a primitve). + +For Ubernetes-Lite we will therefore simply rely on the native cloud provider +load balancer behaviour, and we do not anticipate substantial code changes. + +One notable shortcoming here is that load-balanced traffic still goes through +kube-proxy controlled routing, and kube-proxy does not (currently) favor +targeting a pod running on the same instance or even the same zone. This will +likely produce a lot of unnecessary cross-zone traffic (which is likely slower +and more expensive). This might be sufficiently low-hanging fruit that we +choose to address it in kube-proxy / Ubernetes-Lite, but this can be addressed +after the initial Ubernetes-Lite implementation. + + +## Implementation + +The main implementation points are: + +1. how to attach zone information to Nodes and PersistentVolumes +1. how nodes get zone information +1. how volumes get zone information + +### Attaching zone information + +We must attach zone information to Nodes and PersistentVolumes, and possibly to +other resources in future. There are two obvious alternatives: we can use +labels/annotations, or we can extend the schema to include the information. + +For the initial implementation, we propose to use labels. The reasoning is: + +1. It is considerably easier to implement. +1. We will reserve the two labels `failure-domain.alpha.kubernetes.io/zone` and +`failure-domain.alpha.kubernetes.io/region` for the two pieces of information +we need. By putting this under the `kubernetes.io` namespace there is no risk +of collision, and by putting it under `alpha.kubernetes.io` we clearly mark +this as an experimental feature. +1. We do not yet know whether these labels will be sufficient for all +environments, nor which entities will require zone information. Labels give us +more flexibility here. +1. Because the labels are reserved, we can move to schema-defined fields in +future using our cross-version mapping techniques. + +### Node labeling + +We do not want to require an administrator to manually label nodes. We instead +modify the kubelet to include the appropriate labels when it registers itself. +The information is easily obtained by the kubelet from the cloud provider. + +### Volume labeling + +As with nodes, we do not want to require an administrator to manually label +volumes. We will create an admission controller `PersistentVolumeLabel`. +`PersistentVolumeLabel` will intercept requests to create PersistentVolumes, +and will label them appropriately by calling in to the cloud provider. + +## AWS Specific Considerations + +The AWS implementation here is fairly straightforward. The AWS API is +region-wide, meaning that a single call will find instances and volumes in all +zones. In addition, instance ids and volume ids are unique per-region (and +hence also per-zone). I believe they are actually globally unique, but I do +not know if this is guaranteed; in any case we only need global uniqueness if +we are to span regions, which will not be supported by Ubernetes-Lite (to do +that correctly requires an Ubernetes-Full type approach). + +## GCE Specific Considerations + +The GCE implementation is more complicated than the AWS implementation because +GCE APIs are zone-scoped. To perform an operation, we must perform one REST +call per zone and combine the results, unless we can determine in advance that +an operation references a particular zone. For many operations, we can make +that determination, but in some cases - such as listing all instances, we must +combine results from calls in all relevant zones. + +A further complexity is that GCE volume names are scoped per-zone, not +per-region. Thus it is permitted to have two volumes both named `myvolume` in +two different GCE zones. (Instance names are currently unique per-region, and +thus are not a problem for Ubernetes-Lite). + +The volume scoping leads to a (small) behavioural change for Ubernetes-Lite on +GCE. If you had two volumes both named `myvolume` in two different GCE zones, +this would not be ambiguous when Kubernetes is operating only in a single zone. +But, if Ubernetes-Lite is operating in multiple zones, `myvolume` is no longer +sufficient to specify a volume uniquely. Worse, the fact that a volume happens +to be unambigious at a particular time is no guarantee that it will continue to +be unambigious in future, because a volume with the same name could +subsequently be created in a second zone. While perhaps unlikely in practice, +we cannot automatically enable Ubernetes-Lite for GCE users if this then causes +volume mounts to stop working. + +This suggests that (at least on GCE), Ubernetes-Lite must be optional (i.e. +there must be a feature-flag). It may be that we can make this feature +semi-automatic in future, by detecting whether nodes are running in multiple +zones, but it seems likely that kube-up could instead simply set this flag. + +For the initial implementation, creating volumes with identical names will +yield undefined results. Later, we may add some way to specify the zone for a +volume (and possibly require that volumes have their zone specified when +running with Ubernetes-Lite). We could add a new `zone` field to the +PersistentVolume type for GCE PD volumes, or we could use a DNS-style dotted +name for the volume name (.) + +Initially therefore, the GCE changes will be to: + +1. change kube-up to support creation of a cluster in multiple zones +1. pass a flag enabling Ubernetes-Lite with kube-up +1. change the kuberentes cloud provider to iterate through relevant zones when resolving items +1. tag GCE PD volumes with the appropriate zone information + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/federation-lite.md?pixel)]() +