mirror of
https://github.com/k3s-io/kubernetes.git
synced 2025-07-23 19:56:01 +00:00
Merge pull request #19257 from justinsb/doc_ubernetes_lite
Auto commit by PR queue bot
This commit is contained in:
commit
cb80be10f7
230
docs/proposals/federation-lite.md
Normal file
230
docs/proposals/federation-lite.md
Normal file
@ -0,0 +1,230 @@
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
<!-- BEGIN STRIP_FOR_RELEASE -->
|
||||
|
||||
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
||||
width="25" height="25">
|
||||
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
||||
width="25" height="25">
|
||||
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
||||
width="25" height="25">
|
||||
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
||||
width="25" height="25">
|
||||
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
||||
width="25" height="25">
|
||||
|
||||
<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
|
||||
|
||||
If you are using a released version of Kubernetes, you should
|
||||
refer to the docs that go with that version.
|
||||
|
||||
Documentation for other releases can be found at
|
||||
[releases.k8s.io](http://releases.k8s.io).
|
||||
</strong>
|
||||
--
|
||||
|
||||
<!-- END STRIP_FOR_RELEASE -->
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# Kubernetes Multi-AZ Clusters
|
||||
|
||||
## (a.k.a. "Ubernetes-Lite")
|
||||
|
||||
## Introduction
|
||||
|
||||
Full Ubernetes will offer sophisticated federation between multiple kuberentes
|
||||
clusters, offering true high-availability, multiple provider support &
|
||||
cloud-bursting, multiple region support etc. However, many users have
|
||||
expressed a desire for a "reasonably" high-available cluster, that runs in
|
||||
multiple zones on GCE or availablity zones in AWS, and can tolerate the failure
|
||||
of a single zone without the complexity of running multiple clusters.
|
||||
|
||||
Ubernetes-Lite aims to deliver exactly that functionality: to run a single
|
||||
Kubernetes cluster in multiple zones. It will attempt to make reasonable
|
||||
scheduling decisions, in particular so that a replication controller's pods are
|
||||
spread across zones, and it will try to be aware of constraints - for example
|
||||
that a volume cannot be mounted on a node in a different zone.
|
||||
|
||||
Ubernetes-Lite is deliberately limited in scope; for many advanced functions
|
||||
the answer will be "use Ubernetes (full)". For example, multiple-region
|
||||
support is not in scope. Routing affinity (e.g. so that a webserver will
|
||||
prefer to talk to a backend service in the same zone) is similarly not in
|
||||
scope.
|
||||
|
||||
## Design
|
||||
|
||||
These are the main requirements:
|
||||
|
||||
1. kube-up must allow bringing up a cluster that spans multiple zones.
|
||||
1. pods in a replication controller should attempt to spread across zones.
|
||||
1. pods which require volumes should not be scheduled onto nodes in a different zone.
|
||||
1. load-balanced services should work reasonably
|
||||
|
||||
### kube-up support
|
||||
|
||||
kube-up support for multiple zones will initially be considered
|
||||
advanced/experimental functionality, so the interface is not initially going to
|
||||
be particularly user-friendly. As we design the evolution of kube-up, we will
|
||||
make multiple zones better supported.
|
||||
|
||||
For the initial implemenation, kube-up must be run multiple times, once for
|
||||
each zone. The first kube-up will take place as normal, but then for each
|
||||
additional zone the user must run kube-up again, specifying
|
||||
`KUBE_SHARE_MASTER=true` and `KUBE_SUBNET_CIDR=172.20.x.0/24`. This will then
|
||||
create additional nodes in a different zone, but will register them with the
|
||||
existing master.
|
||||
|
||||
### Zone spreading
|
||||
|
||||
This will be implemented by modifying the existing scheduler priority function
|
||||
`SelectorSpread`. Currently this priority function aims to put pods in an RC
|
||||
on different hosts, but it will be extended first to spread across zones, and
|
||||
then to spread across hosts.
|
||||
|
||||
So that the scheduler does not need to call out to the cloud provider on every
|
||||
scheduling decision, we must somehow record the zone information for each node.
|
||||
The implementation of this will be described in the implementation section.
|
||||
|
||||
Note that zone spreading is 'best effort'; zones are just be one of the factors
|
||||
in making scheduling decisions, and thus it is not guaranteed that pods will
|
||||
spread evenly across zones. However, this is likely desireable: if a zone is
|
||||
overloaded or failing, we still want to schedule the requested number of pods.
|
||||
|
||||
### Volume affinity
|
||||
|
||||
Most cloud providers (at least GCE and AWS) cannot attach their persistent
|
||||
volumes across zones. Thus when a pod is being scheduled, if there is a volume
|
||||
attached, that will dictate the zone. This will be implemented using a new
|
||||
scheduler predicate (a hard constraint): `VolumeZonePredicate`.
|
||||
|
||||
When `VolumeZonePredicate` observes a pod scheduling request that includes a
|
||||
volume, if that volume is zone-specific, `VolumeZonePredicate` will exclude any
|
||||
nodes not in that zone.
|
||||
|
||||
Again, to avoid the scheduler calling out to the cloud provider, this will rely
|
||||
on information attached to the volumes. This means that this will only support
|
||||
PersistentVolumeClaims, because direct mounts do not have a place to attach
|
||||
zone information. PersistentVolumes will then include zone information where
|
||||
volumes are zone-specific.
|
||||
|
||||
### Load-balanced services should operate reasonably
|
||||
|
||||
For both AWS & GCE, Kubernetes creates a native cloud load-balancer for each
|
||||
service of type LoadBalancer. The native cloud load-balancers on both AWS &
|
||||
GCE are region-level, and support load-balancing across instances in multiple
|
||||
zones (in the same region). For both clouds, the behaviour of the native cloud
|
||||
load-balancer is reasonable in the face of failures (indeed, this is why clouds
|
||||
provide load-balancing as a primitve).
|
||||
|
||||
For Ubernetes-Lite we will therefore simply rely on the native cloud provider
|
||||
load balancer behaviour, and we do not anticipate substantial code changes.
|
||||
|
||||
One notable shortcoming here is that load-balanced traffic still goes through
|
||||
kube-proxy controlled routing, and kube-proxy does not (currently) favor
|
||||
targeting a pod running on the same instance or even the same zone. This will
|
||||
likely produce a lot of unnecessary cross-zone traffic (which is likely slower
|
||||
and more expensive). This might be sufficiently low-hanging fruit that we
|
||||
choose to address it in kube-proxy / Ubernetes-Lite, but this can be addressed
|
||||
after the initial Ubernetes-Lite implementation.
|
||||
|
||||
|
||||
## Implementation
|
||||
|
||||
The main implementation points are:
|
||||
|
||||
1. how to attach zone information to Nodes and PersistentVolumes
|
||||
1. how nodes get zone information
|
||||
1. how volumes get zone information
|
||||
|
||||
### Attaching zone information
|
||||
|
||||
We must attach zone information to Nodes and PersistentVolumes, and possibly to
|
||||
other resources in future. There are two obvious alternatives: we can use
|
||||
labels/annotations, or we can extend the schema to include the information.
|
||||
|
||||
For the initial implementation, we propose to use labels. The reasoning is:
|
||||
|
||||
1. It is considerably easier to implement.
|
||||
1. We will reserve the two labels `failure-domain.alpha.kubernetes.io/zone` and
|
||||
`failure-domain.alpha.kubernetes.io/region` for the two pieces of information
|
||||
we need. By putting this under the `kubernetes.io` namespace there is no risk
|
||||
of collision, and by putting it under `alpha.kubernetes.io` we clearly mark
|
||||
this as an experimental feature.
|
||||
1. We do not yet know whether these labels will be sufficient for all
|
||||
environments, nor which entities will require zone information. Labels give us
|
||||
more flexibility here.
|
||||
1. Because the labels are reserved, we can move to schema-defined fields in
|
||||
future using our cross-version mapping techniques.
|
||||
|
||||
### Node labeling
|
||||
|
||||
We do not want to require an administrator to manually label nodes. We instead
|
||||
modify the kubelet to include the appropriate labels when it registers itself.
|
||||
The information is easily obtained by the kubelet from the cloud provider.
|
||||
|
||||
### Volume labeling
|
||||
|
||||
As with nodes, we do not want to require an administrator to manually label
|
||||
volumes. We will create an admission controller `PersistentVolumeLabel`.
|
||||
`PersistentVolumeLabel` will intercept requests to create PersistentVolumes,
|
||||
and will label them appropriately by calling in to the cloud provider.
|
||||
|
||||
## AWS Specific Considerations
|
||||
|
||||
The AWS implementation here is fairly straightforward. The AWS API is
|
||||
region-wide, meaning that a single call will find instances and volumes in all
|
||||
zones. In addition, instance ids and volume ids are unique per-region (and
|
||||
hence also per-zone). I believe they are actually globally unique, but I do
|
||||
not know if this is guaranteed; in any case we only need global uniqueness if
|
||||
we are to span regions, which will not be supported by Ubernetes-Lite (to do
|
||||
that correctly requires an Ubernetes-Full type approach).
|
||||
|
||||
## GCE Specific Considerations
|
||||
|
||||
The GCE implementation is more complicated than the AWS implementation because
|
||||
GCE APIs are zone-scoped. To perform an operation, we must perform one REST
|
||||
call per zone and combine the results, unless we can determine in advance that
|
||||
an operation references a particular zone. For many operations, we can make
|
||||
that determination, but in some cases - such as listing all instances, we must
|
||||
combine results from calls in all relevant zones.
|
||||
|
||||
A further complexity is that GCE volume names are scoped per-zone, not
|
||||
per-region. Thus it is permitted to have two volumes both named `myvolume` in
|
||||
two different GCE zones. (Instance names are currently unique per-region, and
|
||||
thus are not a problem for Ubernetes-Lite).
|
||||
|
||||
The volume scoping leads to a (small) behavioural change for Ubernetes-Lite on
|
||||
GCE. If you had two volumes both named `myvolume` in two different GCE zones,
|
||||
this would not be ambiguous when Kubernetes is operating only in a single zone.
|
||||
But, if Ubernetes-Lite is operating in multiple zones, `myvolume` is no longer
|
||||
sufficient to specify a volume uniquely. Worse, the fact that a volume happens
|
||||
to be unambigious at a particular time is no guarantee that it will continue to
|
||||
be unambigious in future, because a volume with the same name could
|
||||
subsequently be created in a second zone. While perhaps unlikely in practice,
|
||||
we cannot automatically enable Ubernetes-Lite for GCE users if this then causes
|
||||
volume mounts to stop working.
|
||||
|
||||
This suggests that (at least on GCE), Ubernetes-Lite must be optional (i.e.
|
||||
there must be a feature-flag). It may be that we can make this feature
|
||||
semi-automatic in future, by detecting whether nodes are running in multiple
|
||||
zones, but it seems likely that kube-up could instead simply set this flag.
|
||||
|
||||
For the initial implementation, creating volumes with identical names will
|
||||
yield undefined results. Later, we may add some way to specify the zone for a
|
||||
volume (and possibly require that volumes have their zone specified when
|
||||
running with Ubernetes-Lite). We could add a new `zone` field to the
|
||||
PersistentVolume type for GCE PD volumes, or we could use a DNS-style dotted
|
||||
name for the volume name (<name>.<zone>)
|
||||
|
||||
Initially therefore, the GCE changes will be to:
|
||||
|
||||
1. change kube-up to support creation of a cluster in multiple zones
|
||||
1. pass a flag enabling Ubernetes-Lite with kube-up
|
||||
1. change the kuberentes cloud provider to iterate through relevant zones when resolving items
|
||||
1. tag GCE PD volumes with the appropriate zone information
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
Loading…
Reference in New Issue
Block a user