mirror of
				https://github.com/k3s-io/kubernetes.git
				synced 2025-11-03 23:40:03 +00:00 
			
		
		
		
	
		
			
				
	
	
		
			681 lines
		
	
	
		
			33 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			681 lines
		
	
	
		
			33 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
 | 
						|
 | 
						|
<!-- BEGIN STRIP_FOR_RELEASE -->
 | 
						|
 | 
						|
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
 | 
						|
     width="25" height="25">
 | 
						|
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
 | 
						|
     width="25" height="25">
 | 
						|
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
 | 
						|
     width="25" height="25">
 | 
						|
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
 | 
						|
     width="25" height="25">
 | 
						|
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
 | 
						|
     width="25" height="25">
 | 
						|
 | 
						|
<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
 | 
						|
 | 
						|
If you are using a released version of Kubernetes, you should
 | 
						|
refer to the docs that go with that version.
 | 
						|
 | 
						|
<!-- TAG RELEASE_LINK, added by the munger automatically -->
 | 
						|
<strong>
 | 
						|
The latest release of this document can be found
 | 
						|
[here](http://releases.k8s.io/release-1.1/docs/proposals/federation.md).
 | 
						|
 | 
						|
Documentation for other releases can be found at
 | 
						|
[releases.k8s.io](http://releases.k8s.io).
 | 
						|
</strong>
 | 
						|
--
 | 
						|
 | 
						|
<!-- END STRIP_FOR_RELEASE -->
 | 
						|
 | 
						|
<!-- END MUNGE: UNVERSIONED_WARNING -->
 | 
						|
 | 
						|
# Kubernetes Cluster Federation
 | 
						|
 | 
						|
## (a.k.a. "Ubernetes")
 | 
						|
 | 
						|
## Requirements Analysis and Product Proposal
 | 
						|
 | 
						|
## _by Quinton Hoole ([quinton@google.com](mailto:quinton@google.com))_
 | 
						|
 | 
						|
_Initial revision: 2015-03-05_
 | 
						|
_Last updated: 2015-08-20_
 | 
						|
This doc: [tinyurl.com/ubernetesv2](http://tinyurl.com/ubernetesv2)
 | 
						|
Original slides: [tinyurl.com/ubernetes-slides](http://tinyurl.com/ubernetes-slides)
 | 
						|
Updated slides: [tinyurl.com/ubernetes-whereto](http://tinyurl.com/ubernetes-whereto)
 | 
						|
 | 
						|
## Introduction
 | 
						|
 | 
						|
Today, each Kubernetes cluster is a relatively self-contained unit,
 | 
						|
which typically runs in a single "on-premise" data centre or single
 | 
						|
availability zone of a cloud provider (Google's GCE, Amazon's AWS,
 | 
						|
etc).
 | 
						|
 | 
						|
Several current and potential Kubernetes users and customers have
 | 
						|
expressed a keen interest in tying together ("federating") multiple
 | 
						|
clusters in some sensible way in order to enable the following kinds
 | 
						|
of use cases (intentionally vague):
 | 
						|
 | 
						|
1. _"Preferentially run my workloads in my on-premise cluster(s), but
 | 
						|
   automatically overflow to my cloud-hosted cluster(s) if I run out
 | 
						|
   of on-premise capacity"_.
 | 
						|
1. _"Most of my workloads should run in my preferred cloud-hosted
 | 
						|
   cluster(s), but some are privacy-sensitive, and should be
 | 
						|
   automatically diverted to run in my secure, on-premise
 | 
						|
   cluster(s)"_.
 | 
						|
1. _"I want to avoid vendor lock-in, so I want my workloads to run
 | 
						|
   across multiple cloud providers all the time.  I change my set of
 | 
						|
   such cloud providers, and my pricing contracts with them,
 | 
						|
   periodically"_.
 | 
						|
1. _"I want to be immune to any single data centre or cloud
 | 
						|
   availability zone outage, so I want to spread my service across
 | 
						|
   multiple such zones (and ideally even across multiple cloud
 | 
						|
   providers)."_
 | 
						|
 | 
						|
The above use cases are by necessity left imprecisely defined.  The
 | 
						|
rest of this document explores these use cases and their implications
 | 
						|
in further detail, and compares a few alternative high level
 | 
						|
approaches to addressing them.  The idea of cluster federation has
 | 
						|
informally become known as_ "Ubernetes"_.
 | 
						|
 | 
						|
## Summary/TL;DR
 | 
						|
 | 
						|
Four primary customer-driven use cases are explored in more detail.
 | 
						|
The two highest priority ones relate to High Availability and
 | 
						|
Application Portability (between cloud providers, and between
 | 
						|
on-premise and cloud providers).
 | 
						|
 | 
						|
Four primary federation primitives are identified (location affinity,
 | 
						|
cross-cluster scheduling, service discovery and application
 | 
						|
migration).  Fortunately not all four of these primitives are required
 | 
						|
for each primary use case, so incremental development is feasible.
 | 
						|
 | 
						|
## What exactly is a Kubernetes Cluster?
 | 
						|
 | 
						|
A central design concept in Kubernetes is that of a _cluster_. While
 | 
						|
loosely speaking, a cluster can be thought of as running in a single
 | 
						|
data center, or cloud provider availability zone, a more precise
 | 
						|
definition is that each cluster provides:
 | 
						|
 | 
						|
1. a single Kubernetes API entry point,
 | 
						|
1. a consistent, cluster-wide resource naming scheme
 | 
						|
1. a scheduling/container placement domain
 | 
						|
1. a service network routing domain
 | 
						|
1. an authentication and authorization model.
 | 
						|
 | 
						|
The above in turn imply the need for a relatively performant, reliable
 | 
						|
and cheap network within each cluster.
 | 
						|
 | 
						|
There is also assumed to be some degree of failure correlation across
 | 
						|
a cluster, i.e.  whole clusters are expected to fail, at least
 | 
						|
occasionally (due to cluster-wide power and network failures, natural
 | 
						|
disasters etc). Clusters are often relatively homogenous in that all
 | 
						|
compute nodes are typically provided by a single cloud provider or
 | 
						|
hardware vendor, and connected by a common, unified network fabric.
 | 
						|
But these are not hard requirements of Kubernetes.
 | 
						|
 | 
						|
Other classes of Kubernetes deployments than the one sketched above
 | 
						|
are technically feasible, but come with some challenges of their own,
 | 
						|
and are not yet common or explicitly supported.
 | 
						|
 | 
						|
More specifically, having a Kubernetes cluster span multiple
 | 
						|
well-connected availability zones within a single geographical region
 | 
						|
(e.g. US North East, UK, Japan etc) is worthy of further
 | 
						|
consideration, in particular because it potentially addresses
 | 
						|
some of these requirements.
 | 
						|
 | 
						|
## What use cases require Cluster Federation?
 | 
						|
 | 
						|
Let's name a few concrete use cases to aid the discussion:
 | 
						|
 | 
						|
## 1.Capacity Overflow
 | 
						|
 | 
						|
_"I want to preferentially run my workloads in my on-premise cluster(s), but automatically "overflow" to my cloud-hosted cluster(s) when I run out of on-premise capacity."_
 | 
						|
 | 
						|
This idea is known in some circles as "[cloudbursting](http://searchcloudcomputing.techtarget.com/definition/cloud-bursting)".
 | 
						|
 | 
						|
**Clarifying questions:** What is the unit of overflow?  Individual
 | 
						|
  pods? Probably not always.  Replication controllers and their
 | 
						|
  associated sets of pods?  Groups of replication controllers
 | 
						|
  (a.k.a. distributed applications)?  How are persistent disks
 | 
						|
  overflowed?  Can the "overflowed" pods communicate with their
 | 
						|
  brethren and sistren pods and services in the other cluster(s)?
 | 
						|
  Presumably yes, at higher cost and latency, provided that they use
 | 
						|
  external service discovery. Is "overflow" enabled only when creating
 | 
						|
  new workloads/replication controllers, or are existing workloads
 | 
						|
  dynamically migrated between clusters based on fluctuating available
 | 
						|
  capacity?  If so, what is the desired behaviour, and how is it
 | 
						|
  achieved?  How, if at all, does this relate to quota enforcement
 | 
						|
  (e.g. if we run out of on-premise capacity, can all or only some
 | 
						|
  quotas transfer to other, potentially more expensive off-premise
 | 
						|
  capacity?)
 | 
						|
 | 
						|
It seems that most of this boils down to:
 | 
						|
 | 
						|
1. **location affinity** (pods relative to each other, and to other
 | 
						|
   stateful services like persistent storage - how is this expressed
 | 
						|
   and enforced?)
 | 
						|
1. **cross-cluster scheduling** (given location affinity constraints
 | 
						|
   and other scheduling policy, which resources are assigned to which
 | 
						|
   clusters, and by what?)
 | 
						|
1. **cross-cluster service discovery** (how do pods in one cluster
 | 
						|
   discover and communicate with pods in another cluster?)
 | 
						|
1. **cross-cluster migration** (how do compute and storage resources,
 | 
						|
   and the distributed applications to which they belong, move from
 | 
						|
   one cluster to another)
 | 
						|
1. **cross-cluster load-balancing** (how does is user traffic directed
 | 
						|
   to an appropriate cluster?)
 | 
						|
1. **cross-cluster monitoring and auditing** (a.k.a. Unified Visibility)
 | 
						|
 | 
						|
## 2. Sensitive Workloads
 | 
						|
 | 
						|
_"I want most of my workloads to run in my preferred cloud-hosted
 | 
						|
cluster(s), but some are privacy-sensitive, and should be
 | 
						|
automatically diverted to run in my secure, on-premise cluster(s). The
 | 
						|
list of privacy-sensitive workloads changes over time, and they're
 | 
						|
subject to external auditing."_
 | 
						|
 | 
						|
**Clarifying questions:**
 | 
						|
1. What kinds of rules determine which
 | 
						|
workloads go where?
 | 
						|
  1. Is there in fact a requirement to have these rules be
 | 
						|
      declaratively expressed and automatically enforced, or is it
 | 
						|
      acceptable/better to have users manually select where to run
 | 
						|
      their workloads when starting them?
 | 
						|
  1. Is a static mapping from container (or more typically,
 | 
						|
      replication controller) to cluster maintained and enforced?
 | 
						|
  1. If so, is it only enforced on startup, or are things migrated
 | 
						|
      between clusters when the mappings change?
 | 
						|
 | 
						|
This starts to look quite similar to "1. Capacity Overflow", and again
 | 
						|
seems to boil down to:
 | 
						|
 | 
						|
1. location affinity
 | 
						|
1. cross-cluster scheduling
 | 
						|
1. cross-cluster service discovery
 | 
						|
1. cross-cluster migration
 | 
						|
1. cross-cluster monitoring and auditing
 | 
						|
1. cross-cluster load balancing
 | 
						|
 | 
						|
## 3. Vendor lock-in avoidance
 | 
						|
 | 
						|
_"My CTO wants us to avoid vendor lock-in, so she wants our workloads
 | 
						|
to run across multiple cloud providers at all times.  She changes our
 | 
						|
set of preferred cloud providers and pricing contracts with them
 | 
						|
periodically, and doesn't want to have to communicate and manually
 | 
						|
enforce these policy changes across the organization every time this
 | 
						|
happens.  She wants it centrally and automatically enforced, monitored
 | 
						|
and audited."_
 | 
						|
 | 
						|
**Clarifying questions:**
 | 
						|
 | 
						|
1. How does this relate to other use cases (high availability,
 | 
						|
capacity overflow etc), as they may all be across multiple vendors.
 | 
						|
It's probably not strictly speaking a separate
 | 
						|
use case, but it's brought up so often as a requirement, that it's
 | 
						|
worth calling out explicitly.
 | 
						|
1. Is a useful intermediate step to make it as simple as possible to
 | 
						|
   migrate an application from one vendor to another in a one-off fashion?
 | 
						|
 | 
						|
Again, I think that this can probably be
 | 
						|
  reformulated as a Capacity Overflow problem - the fundamental
 | 
						|
  principles seem to be the same or substantially similar to those
 | 
						|
  above.
 | 
						|
 | 
						|
## 4. "High Availability"
 | 
						|
 | 
						|
_"I want to be immune to any single data centre or cloud availability
 | 
						|
zone outage, so I want to spread my service across multiple such zones
 | 
						|
(and ideally even across multiple cloud providers), and have my
 | 
						|
service remain available even if one of the availability zones or
 | 
						|
cloud providers "goes down"_.
 | 
						|
 | 
						|
It seems useful to split this into multiple sets of sub use cases:
 | 
						|
 | 
						|
1. Multiple availability zones within a single cloud provider (across
 | 
						|
   which feature sets like private networks, load balancing,
 | 
						|
   persistent disks, data snapshots etc are typically consistent and
 | 
						|
   explicitly designed to inter-operate).
 | 
						|
   1. within the same geographical region (e.g. metro) within which network
 | 
						|
   is fast and cheap enough to be almost analogous to a single data
 | 
						|
   center.
 | 
						|
   1. across multiple geographical regions, where high network cost and
 | 
						|
   poor network performance may be prohibitive.
 | 
						|
1. Multiple cloud providers (typically with inconsistent feature sets,
 | 
						|
   more limited interoperability, and typically no cheap inter-cluster
 | 
						|
   networking described above).
 | 
						|
 | 
						|
The single cloud provider case might be easier to implement (although
 | 
						|
the multi-cloud provider implementation should just work for a single
 | 
						|
cloud provider).  Propose high-level design catering for both, with
 | 
						|
initial implementation targeting single cloud provider only.
 | 
						|
 | 
						|
**Clarifying questions:**
 | 
						|
**How does global external service discovery work?** In the steady
 | 
						|
  state, which external clients connect to which clusters?  GeoDNS or
 | 
						|
  similar?  What is the tolerable failover latency if a cluster goes
 | 
						|
  down?  Maybe something like (make up some numbers, notwithstanding
 | 
						|
  some buggy DNS resolvers, TTL's, caches etc) ~3 minutes for ~90% of
 | 
						|
  clients to re-issue DNS lookups and reconnect to a new cluster when
 | 
						|
  their home cluster fails is good enough for most Kubernetes users
 | 
						|
  (or at least way better than the status quo), given that these sorts
 | 
						|
  of failure only happen a small number of times a year?
 | 
						|
 | 
						|
**How does dynamic load balancing across clusters work, if at all?**
 | 
						|
  One simple starting point might be "it doesn't".  i.e. if a service
 | 
						|
  in a cluster is deemed to be "up", it receives as much traffic as is
 | 
						|
  generated "nearby" (even if it overloads).  If the service is deemed
 | 
						|
  to "be down" in a given cluster, "all" nearby traffic is redirected
 | 
						|
  to some other cluster within some number of seconds (failover could
 | 
						|
  be automatic or manual).  Failover is essentially binary.  An
 | 
						|
  improvement would be to detect when a service in a cluster reaches
 | 
						|
  maximum serving capacity, and dynamically divert additional traffic
 | 
						|
  to other clusters.  But how exactly does all of this work, and how
 | 
						|
  much of it is provided by Kubernetes, as opposed to something else
 | 
						|
  bolted on top (e.g. external monitoring and manipulation of GeoDNS)?
 | 
						|
 | 
						|
**How does this tie in with auto-scaling of services?** More
 | 
						|
  specifically, if I run my service across _n_ clusters globally, and
 | 
						|
  one (or more) of them fail, how do I ensure that the remaining _n-1_
 | 
						|
  clusters have enough capacity to serve the additional, failed-over
 | 
						|
  traffic?  Either:
 | 
						|
 | 
						|
1. I constantly over-provision all clusters by 1/n (potentially expensive), or
 | 
						|
1. I "manually" (or automatically) update my replica count configurations in the
 | 
						|
   remaining clusters by 1/n when the failure occurs, and Kubernetes
 | 
						|
   takes care of the rest for me, or
 | 
						|
1. Auto-scaling in the remaining clusters takes
 | 
						|
   care of it for me automagically as the additional failed-over
 | 
						|
   traffic arrives (with some latency).  Note that this implies that
 | 
						|
   the cloud provider keeps the necessary resources on hand to
 | 
						|
   accommodate such auto-scaling (e.g. via something similar to AWS reserved
 | 
						|
   and spot instances)
 | 
						|
 | 
						|
Up to this point, this use case ("Unavailability Zones") seems materially different from all the others above.  It does not require dynamic cross-cluster service migration (we assume that the service is already running in more than one cluster when the failure occurs).  Nor does it necessarily involve cross-cluster service discovery or location affinity.  As a result, I propose that we address this use case somewhat independently of the others (although I strongly suspect that it will become substantially easier once we've solved the others).
 | 
						|
 | 
						|
All of the above (regarding "Unavailibility Zones") refers primarily
 | 
						|
to already-running user-facing services, and minimizing the impact on
 | 
						|
end users of those services becoming unavailable in a given cluster.
 | 
						|
What about the people and systems that deploy Kubernetes services
 | 
						|
(devops etc)?  Should they be automatically shielded from the impact
 | 
						|
of the cluster outage? i.e. have their new resource creation requests
 | 
						|
automatically diverted to another cluster during the outage?  While
 | 
						|
this specific requirement seems non-critical (manual fail-over seems
 | 
						|
relatively non-arduous, ignoring the user-facing issues above), it
 | 
						|
smells a lot like the first three use cases listed above ("Capacity
 | 
						|
Overflow, Sensitive Services, Vendor lock-in..."), so if we address
 | 
						|
those, we probably get this one free of charge.
 | 
						|
 | 
						|
## Core Challenges of Cluster Federation
 | 
						|
 | 
						|
As we saw above, a few common challenges fall out of most of the use
 | 
						|
cases considered above, namely:
 | 
						|
 | 
						|
## Location Affinity
 | 
						|
 | 
						|
Can the pods comprising a single distributed application be
 | 
						|
partitioned across more than one cluster? More generally, how far
 | 
						|
apart, in network terms, can a given client and server within a
 | 
						|
distributed application reasonably be?  A server need not necessarily
 | 
						|
be a pod, but could instead be a persistent disk housing data, or some
 | 
						|
other stateful network service.  What is tolerable is typically
 | 
						|
application-dependent, primarily influenced by network bandwidth
 | 
						|
consumption, latency requirements and cost sensitivity.
 | 
						|
 | 
						|
For simplicity, lets assume that all Kubernetes distributed
 | 
						|
applications fall into one of three categories with respect to relative
 | 
						|
location affinity:
 | 
						|
 | 
						|
1. **"Strictly Coupled"**: Those applications that strictly cannot be
 | 
						|
   partitioned between clusters.  They simply fail if they are
 | 
						|
   partitioned.  When scheduled, all pods _must_ be scheduled to the
 | 
						|
   same cluster.  To move them, we need to shut the whole distributed
 | 
						|
   application down (all pods) in one cluster, possibly move some
 | 
						|
   data, and then bring the up all of the pods in another cluster.  To
 | 
						|
   avoid downtime, we might bring up the replacement cluster and
 | 
						|
   divert traffic there before turning down the original, but the
 | 
						|
   principle is much the same.  In some cases moving the data might be
 | 
						|
   prohibitively expensive or time-consuming, in which case these
 | 
						|
   applications may be effectively _immovable_.
 | 
						|
1. **"Strictly Decoupled"**: Those applications that can be
 | 
						|
   indefinitely partitioned across more than one cluster, to no
 | 
						|
   disadvantage.  An embarrassingly parallel YouTube porn detector,
 | 
						|
   where each pod repeatedly dequeues a video URL from a remote work
 | 
						|
   queue, downloads and chews on the video for a few hours, and
 | 
						|
   arrives at a binary verdict, might be one such example. The pods
 | 
						|
   derive no benefit from being close to each other, or anything else
 | 
						|
   (other than the source of YouTube videos, which is assumed to be
 | 
						|
   equally remote from all clusters in this example).  Each pod can be
 | 
						|
   scheduled independently, in any cluster, and moved at any time.
 | 
						|
1. **"Preferentially Coupled"**: Somewhere between Coupled and
 | 
						|
   Decoupled.  These applications prefer to have all of their pods
 | 
						|
   located in the same cluster (e.g. for failure correlation, network
 | 
						|
   latency or bandwidth cost reasons), but can tolerate being
 | 
						|
   partitioned for "short" periods of time (for example while
 | 
						|
   migrating the application from one cluster to another). Most small
 | 
						|
   to medium sized LAMP stacks with not-very-strict latency goals
 | 
						|
   probably fall into this category (provided that they use sane
 | 
						|
   service discovery and reconnect-on-fail, which they need to do
 | 
						|
   anyway to run effectively, even in a single Kubernetes cluster).
 | 
						|
 | 
						|
From a fault isolation point of view, there are also opposites of the
 | 
						|
above.  For example a master database and it's slave replica might
 | 
						|
need to be in different availability zones.  We'll refer to this a
 | 
						|
anti-affinity, although it is largely outside the scope of this
 | 
						|
document.
 | 
						|
 | 
						|
Note that there is somewhat of a continuum with respect to network
 | 
						|
cost and quality between any two nodes, ranging from two nodes on the
 | 
						|
same L2 network segment (lowest latency and cost, highest bandwidth)
 | 
						|
to two nodes on different continents (highest latency and cost, lowest
 | 
						|
bandwidth). One interesting point on that continuum relates to
 | 
						|
multiple availability zones within a well-connected metro or region
 | 
						|
and single cloud provider.  Despite being in different data centers,
 | 
						|
or areas within a mega data center, network in this case is often very fast
 | 
						|
and effectively free or very cheap. For the purposes of this network location
 | 
						|
affinity discussion, this case is considered analogous to a single
 | 
						|
availability zone. Furthermore,  if a given application doesn't fit
 | 
						|
cleanly into one of the above, shoe-horn it into the best fit,
 | 
						|
defaulting to the "Strictly Coupled and Immovable" bucket if you're
 | 
						|
not sure.
 | 
						|
 | 
						|
And then there's what I'll call _absolute_ location affinity.  Some
 | 
						|
applications are required to run in bounded geographical or network
 | 
						|
topology locations.  The reasons for this are typically
 | 
						|
political/legislative (data privacy laws etc), or driven by network
 | 
						|
proximity to consumers (or data providers) of the application ("most
 | 
						|
of our users are in Western Europe, U.S. West Coast" etc).
 | 
						|
 | 
						|
**Proposal:** First tackle Strictly Decoupled applications (which can
 | 
						|
  be trivially scheduled, partitioned or moved, one pod at a time).
 | 
						|
  Then tackle Preferentially Coupled applications (which must be
 | 
						|
  scheduled in totality in a single cluster, and can be moved, but
 | 
						|
  ultimately in total, and necessarily within some bounded time).
 | 
						|
  Leave strictly coupled applications to be manually moved between
 | 
						|
  clusters as required for the foreseeable future.
 | 
						|
 | 
						|
## Cross-cluster service discovery
 | 
						|
 | 
						|
I propose having pods use standard discovery methods used by external
 | 
						|
clients of Kubernetes applications (i.e. DNS).  DNS might resolve to a
 | 
						|
public endpoint in the local or a remote cluster. Other than Strictly
 | 
						|
Coupled applications, software should be largely oblivious of which of
 | 
						|
the two occurs.
 | 
						|
 | 
						|
_Aside:_ How do we avoid "tromboning" through an external VIP when DNS
 | 
						|
resolves to a public IP on the local cluster?  Strictly speaking this
 | 
						|
would be an optimization for some cases, and probably only matters to
 | 
						|
high-bandwidth, low-latency communications.  We could potentially
 | 
						|
eliminate the trombone with some kube-proxy magic if necessary. More
 | 
						|
detail to be added here, but feel free to shoot down the basic DNS
 | 
						|
idea in the mean time.  In addition, some applications rely on private
 | 
						|
networking between clusters for security (e.g. AWS VPC or more
 | 
						|
generally VPN).  It should not be necessary to forsake this in
 | 
						|
order to use Ubernetes, for example by being forced to use public
 | 
						|
connectivity between clusters.
 | 
						|
 | 
						|
## Cross-cluster Scheduling
 | 
						|
 | 
						|
This is closely related to location affinity above, and also discussed
 | 
						|
there.  The basic idea is that some controller, logically outside of
 | 
						|
the basic Kubernetes control plane of the clusters in question, needs
 | 
						|
to be able to:
 | 
						|
 | 
						|
1. Receive "global" resource creation requests.
 | 
						|
1. Make policy-based decisions as to which cluster(s) should be used
 | 
						|
   to fulfill each given resource request. In a simple case, the
 | 
						|
   request is just redirected to one cluster.  In a more complex case,
 | 
						|
   the request is "demultiplexed" into multiple sub-requests, each to
 | 
						|
   a different cluster. Knowledge of the (albeit approximate)
 | 
						|
   available capacity in each cluster will be required by the
 | 
						|
   controller to sanely split the request.  Similarly, knowledge of
 | 
						|
   the properties of the application (Location Affinity class --
 | 
						|
   Strictly Coupled, Strictly Decoupled etc, privacy class etc) will
 | 
						|
   be required.  It is also conceivable that knowledge of service
 | 
						|
   SLAs and monitoring thereof might provide an input into
 | 
						|
   scheduling/placement algorithms.
 | 
						|
1. Multiplex the responses from the individual clusters into an
 | 
						|
   aggregate response.
 | 
						|
 | 
						|
There is of course a lot of detail still missing from this section,
 | 
						|
including discussion of:
 | 
						|
 | 
						|
1. admission control
 | 
						|
1. initial placement of instances of a new
 | 
						|
service vs scheduling new instances of an existing service in response
 | 
						|
to auto-scaling
 | 
						|
1. rescheduling pods due to failure (response might be
 | 
						|
different depending on if it's failure of a node, rack, or whole AZ)
 | 
						|
1. data placement relative to compute capacity,
 | 
						|
etc.
 | 
						|
 | 
						|
## Cross-cluster Migration
 | 
						|
 | 
						|
Again this is closely related to location affinity discussed above,
 | 
						|
and is in some sense an extension of Cross-cluster Scheduling. When
 | 
						|
certain events occur, it becomes necessary or desirable for the
 | 
						|
cluster federation system to proactively move distributed applications
 | 
						|
(either in part or in whole) from one cluster to another. Examples of
 | 
						|
such events include:
 | 
						|
 | 
						|
1. A low capacity event in a cluster (or a cluster failure).
 | 
						|
1. A change of scheduling policy ("we no longer use cloud provider X").
 | 
						|
1. A change of resource pricing ("cloud provider Y dropped their
 | 
						|
   prices - lets migrate there").
 | 
						|
 | 
						|
Strictly Decoupled applications can be trivially moved, in part or in
 | 
						|
whole, one pod at a time, to one or more clusters (within applicable
 | 
						|
policy constraints, for example "PrivateCloudOnly").
 | 
						|
 | 
						|
For Preferentially Decoupled applications, the federation system must
 | 
						|
first locate a single cluster with sufficient capacity to accommodate
 | 
						|
the entire application, then reserve that capacity, and incrementally
 | 
						|
move the application, one (or more) resources at a time, over to the
 | 
						|
new cluster, within some bounded time period (and possibly within a
 | 
						|
predefined "maintenance" window).  Strictly Coupled applications (with
 | 
						|
the exception of those deemed completely immovable) require the
 | 
						|
federation system to:
 | 
						|
 | 
						|
1. start up an entire replica application in the destination cluster
 | 
						|
1. copy persistent data to the new application instance (possibly
 | 
						|
   before starting pods)
 | 
						|
1. switch user traffic across
 | 
						|
1. tear down the original application instance
 | 
						|
 | 
						|
It is proposed that support for automated migration of Strictly
 | 
						|
Coupled applications be deferred to a later date.
 | 
						|
 | 
						|
## Other Requirements
 | 
						|
 | 
						|
These are often left implicit by customers, but are worth calling out explicitly:
 | 
						|
 | 
						|
1. Software failure isolation between Kubernetes clusters should be
 | 
						|
   retained as far as is practically possible.  The federation system
 | 
						|
   should not materially increase the failure correlation across
 | 
						|
   clusters.  For this reason the federation control plane software
 | 
						|
   should ideally be completely independent of the Kubernetes cluster
 | 
						|
   control software, and look just like any other Kubernetes API
 | 
						|
   client, with no special treatment.  If the federation control plane
 | 
						|
   software fails catastrophically, the underlying Kubernetes clusters
 | 
						|
   should remain independently usable.
 | 
						|
1. Unified monitoring, alerting and auditing across federated Kubernetes clusters.
 | 
						|
1. Unified authentication, authorization and quota management across
 | 
						|
   clusters (this is in direct conflict with failure isolation above,
 | 
						|
   so there are some tough trade-offs to be made here).
 | 
						|
 | 
						|
## Proposed High-Level Architectures
 | 
						|
 | 
						|
Two distinct potential architectural approaches have emerged from discussions
 | 
						|
thus far:
 | 
						|
 | 
						|
1. An explicitly decoupled and hierarchical architecture, where the
 | 
						|
    Federation Control Plane sits logically above a set of independent
 | 
						|
    Kubernetes clusters, each of which is (potentially) unaware of the
 | 
						|
    other clusters, and of the Federation Control Plane itself (other
 | 
						|
    than to the extent that it is an API client much like any other).
 | 
						|
    One possible example of this general architecture is illustrated
 | 
						|
    below, and will be referred to as the "Decoupled, Hierarchical"
 | 
						|
    approach.
 | 
						|
1. A more monolithic architecture, where a single instance of the
 | 
						|
    Kubernetes control plane itself manages a single logical cluster
 | 
						|
    composed of nodes in multiple availability zones and cloud
 | 
						|
    providers.
 | 
						|
 | 
						|
A very brief, non-exhaustive list of pro's and con's of the two
 | 
						|
approaches follows.  (In the interest of full disclosure, the author
 | 
						|
prefers the Decoupled Hierarchical model for the reasons stated below).
 | 
						|
 | 
						|
1. **Failure isolation:** The Decoupled Hierarchical approach provides
 | 
						|
    better failure isolation than the Monolithic approach, as each
 | 
						|
    underlying Kubernetes cluster, and the Federation Control Plane,
 | 
						|
    can operate and fail completely independently of each other.  In
 | 
						|
    particular, their software and configurations can be updated
 | 
						|
    independently. Such updates are, in our experience, the primary
 | 
						|
    cause of control-plane failures, in general.
 | 
						|
1. **Failure probability:** The Decoupled Hierarchical model incorporates
 | 
						|
    numerically more independent pieces of software and configuration
 | 
						|
    than the Monolithic one. But the complexity of each of these
 | 
						|
    decoupled pieces is arguably better contained in the Decoupled
 | 
						|
    model (per standard arguments for modular rather than monolithic
 | 
						|
    software design).  Which of the two models presents higher
 | 
						|
    aggregate complexity and consequent failure probability remains
 | 
						|
    somewhat of an open question.
 | 
						|
1. **Scalability:** Conceptually the Decoupled Hierarchical model wins
 | 
						|
    here, as each underlying Kubernetes cluster can be scaled
 | 
						|
    completely independently w.r.t. scheduling, node state management,
 | 
						|
    monitoring, network connectivity etc. It is even potentially
 | 
						|
    feasible to stack "Ubernetes" federated clusters (i.e. create
 | 
						|
    federations of federations) should scalability of the independent
 | 
						|
    Federation Control Plane become an issue (although the author does
 | 
						|
    not envision this being a problem worth solving in the short
 | 
						|
    term).
 | 
						|
1. **Code complexity:** I think that an argument can be made both ways
 | 
						|
    here. It depends on whether you prefer to weave the logic for
 | 
						|
    handling nodes in multiple availability zones and cloud providers
 | 
						|
    within a single logical cluster into the existing Kubernetes
 | 
						|
    control plane code base (which was explicitly not designed for
 | 
						|
    this), or separate it into a decoupled Federation system (with
 | 
						|
    possible code sharing between the two via shared libraries).  The
 | 
						|
    author prefers the latter because it:
 | 
						|
  1. Promotes better code modularity and interface design.
 | 
						|
  1. Allows the code
 | 
						|
      bases of Kubernetes and the Federation system to progress
 | 
						|
      largely independently (different sets of developers, different
 | 
						|
      release schedules etc).
 | 
						|
1. **Administration complexity:** Again, I think that this could be argued
 | 
						|
    both ways.  Superficially it would seem that administration of a
 | 
						|
    single Monolithic multi-zone cluster might be simpler by virtue of
 | 
						|
    being only "one thing to manage", however in practise each of the
 | 
						|
    underlying availability zones (and possibly cloud providers) has
 | 
						|
    it's own capacity, pricing, hardware platforms, and possibly
 | 
						|
    bureaucratic boundaries (e.g. "our EMEA IT department manages those
 | 
						|
    European clusters").  So explicitly allowing for (but not
 | 
						|
    mandating) completely independent administration of each
 | 
						|
    underlying Kubernetes cluster, and the Federation system itself,
 | 
						|
    in the Decoupled Hierarchical model seems to have real practical
 | 
						|
    benefits that outweigh the superficial simplicity of the
 | 
						|
    Monolithic model.
 | 
						|
1. **Application development and deployment complexity:** It's not clear
 | 
						|
    to me that there is any significant difference between the two
 | 
						|
    models in this regard.  Presumably the API exposed by the two
 | 
						|
    different architectures would look very similar, as would the
 | 
						|
    behavior of the deployed applications.  It has even been suggested
 | 
						|
    to write the code in such a way that it could be run in either
 | 
						|
    configuration.  It's not clear that this makes sense in practise
 | 
						|
    though.
 | 
						|
1. **Control plane cost overhead:** There is a minimum per-cluster
 | 
						|
   overhead -- two possibly virtual machines, or more for redundant HA
 | 
						|
   deployments.  For deployments of very small Kubernetes
 | 
						|
   clusters with the Decoupled Hierarchical approach, this cost can
 | 
						|
   become significant.
 | 
						|
 | 
						|
### The Decoupled, Hierarchical Approach - Illustrated
 | 
						|
 | 
						|

 | 
						|
 | 
						|
## Ubernetes API
 | 
						|
 | 
						|
It is proposed that this look a lot like the existing Kubernetes API
 | 
						|
but be explicitly multi-cluster.
 | 
						|
 | 
						|
+ Clusters become first class objects, which can be registered,
 | 
						|
   listed, described, deregistered etc via the API.
 | 
						|
+ Compute resources can be explicitly requested in specific clusters,
 | 
						|
   or automatically scheduled to the "best" cluster by Ubernetes (by a
 | 
						|
   pluggable Policy Engine).
 | 
						|
+ There is a federated equivalent of a replication controller type (or
 | 
						|
   perhaps a [deployment](deployment.md)),
 | 
						|
   which is multicluster-aware, and delegates to cluster-specific
 | 
						|
   replication controllers/deployments as required (e.g. a federated RC for n
 | 
						|
   replicas might simply spawn multiple replication controllers in
 | 
						|
   different clusters to do the hard work).
 | 
						|
 | 
						|
## Policy Engine and Migration/Replication Controllers
 | 
						|
 | 
						|
The Policy Engine decides which parts of each application go into each
 | 
						|
cluster at any point in time, and stores this desired state in the
 | 
						|
Desired Federation State store (an etcd or
 | 
						|
similar). Migration/Replication Controllers reconcile this against the
 | 
						|
desired states stored in the underlying Kubernetes clusters (by
 | 
						|
watching both, and creating or updating the underlying Replication
 | 
						|
Controllers and related Services accordingly).
 | 
						|
 | 
						|
## Authentication and Authorization
 | 
						|
 | 
						|
This should ideally be delegated to some external auth system, shared
 | 
						|
by the underlying clusters, to avoid duplication and inconsistency.
 | 
						|
Either that, or we end up with multilevel auth.  Local readonly
 | 
						|
eventually consistent auth slaves in each cluster and in Ubernetes
 | 
						|
could potentially cache auth, to mitigate an SPOF auth system.
 | 
						|
 | 
						|
## Data consistency, failure and availability characteristics
 | 
						|
 | 
						|
The services comprising the Ubernetes Control Plane) have to run
 | 
						|
   somewhere.  Several options exist here:
 | 
						|
* For high availability Ubernetes deployments, these
 | 
						|
   services may run in either:
 | 
						|
  * a dedicated Kubernetes cluster, not co-located in the same
 | 
						|
	 availability zone with any of the federated clusters (for fault
 | 
						|
	 isolation reasons).  If that cluster/availability zone, and hence the Federation
 | 
						|
	 system, fails catastrophically, the underlying pods and
 | 
						|
	 applications continue to run correctly, albeit temporarily
 | 
						|
	 without the Federation system.
 | 
						|
  * across multiple Kubernetes availability zones, probably with
 | 
						|
       some sort of cross-AZ quorum-based store.  This provides
 | 
						|
       theoretically higher availability, at the cost of some
 | 
						|
       complexity related to data consistency across multiple
 | 
						|
       availability zones.
 | 
						|
  * For simpler, less highly available deployments, just co-locate the
 | 
						|
     Federation control plane in/on/with one of the underlying
 | 
						|
     Kubernetes clusters.  The downside of this approach is that if
 | 
						|
     that specific cluster fails, all automated failover and scaling
 | 
						|
     logic which relies on the federation system will also be
 | 
						|
     unavailable at the same time (i.e. precisely when it is needed).
 | 
						|
	But if one of the other federated clusters fails, everything
 | 
						|
     should work just fine.
 | 
						|
 | 
						|
There is some further thinking to be done around the data consistency
 | 
						|
    model upon which the Federation system is based, and it's impact
 | 
						|
    on the detailed semantics, failure and availability
 | 
						|
    characteristics of the system.
 | 
						|
 | 
						|
## Proposed Next Steps
 | 
						|
 | 
						|
Identify concrete applications of each use case and configure a proof
 | 
						|
of concept service that exercises the use case.  For example, cluster
 | 
						|
failure tolerance seems popular, so set up an apache frontend with
 | 
						|
replicas in each of three availability zones with either an Amazon Elastic
 | 
						|
Load Balancer or Google Cloud Load Balancer pointing at them? What
 | 
						|
does the zookeeper config look like for N=3 across 3 AZs -- and how
 | 
						|
does each replica find the other replicas and how do clients find
 | 
						|
their primary zookeeper replica? And now how do I do a shared, highly
 | 
						|
available redis database?  Use a few common specific use cases like
 | 
						|
this to flesh out the detailed API and semantics of Ubernetes.
 | 
						|
 | 
						|
 | 
						|
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
 | 
						|
[]()
 | 
						|
<!-- END MUNGE: GENERATED_ANALYTICS -->
 |