mirror of
https://github.com/k3s-io/kubernetes.git
synced 2025-09-10 13:42:02 +00:00
Cloning docs for 0.19.0
This commit is contained in:
260
release-0.19.0/docs/proposals/autoscaling.md
Normal file
260
release-0.19.0/docs/proposals/autoscaling.md
Normal file
@@ -0,0 +1,260 @@
|
||||
## Abstract
|
||||
Auto-scaling is a data-driven feature that allows users to increase or decrease capacity as needed by controlling the
|
||||
number of pods deployed within the system automatically.
|
||||
|
||||
## Motivation
|
||||
|
||||
Applications experience peaks and valleys in usage. In order to respond to increases and decreases in load, administrators
|
||||
scale their applications by adding computing resources. In the cloud computing environment this can be
|
||||
done automatically based on statistical analysis and thresholds.
|
||||
|
||||
### Goals
|
||||
|
||||
* Provide a concrete proposal for implementing auto-scaling pods within Kubernetes
|
||||
* Implementation proposal should be in line with current discussions in existing issues:
|
||||
* Scale verb - [1629](https://github.com/GoogleCloudPlatform/kubernetes/issues/1629)
|
||||
* Config conflicts - [Config](https://github.com/GoogleCloudPlatform/kubernetes/blob/c7cb991987193d4ca33544137a5cb7d0292cf7df/docs/config.md#automated-re-configuration-processes)
|
||||
* Rolling updates - [1353](https://github.com/GoogleCloudPlatform/kubernetes/issues/1353)
|
||||
* Multiple scalable types - [1624](https://github.com/GoogleCloudPlatform/kubernetes/issues/1624)
|
||||
|
||||
## Constraints and Assumptions
|
||||
|
||||
* This proposal is for horizontal scaling only. Vertical scaling will be handled in [issue 2072](https://github.com/GoogleCloudPlatform/kubernetes/issues/2072)
|
||||
* `ReplicationControllers` will not know about the auto-scaler, they are the target of the auto-scaler. The `ReplicationController` responsibilities are
|
||||
constrained to only ensuring that the desired number of pods are operational per the [Replication Controller Design](http://docs.k8s.io/replication-controller.md#responsibilities-of-the-replication-controller)
|
||||
* Auto-scalers will be loosely coupled with data gathering components in order to allow a wide variety of input sources
|
||||
* Auto-scalable resources will support a scale verb ([1629](https://github.com/GoogleCloudPlatform/kubernetes/issues/1629))
|
||||
such that the auto-scaler does not directly manipulate the underlying resource.
|
||||
* Initially, most thresholds will be set by application administrators. It should be possible for an autoscaler to be
|
||||
written later that sets thresholds automatically based on past behavior (CPU used vs incoming requests).
|
||||
* The auto-scaler must be aware of user defined actions so it does not override them unintentionally (for instance someone
|
||||
explicitly setting the replica count to 0 should mean that the auto-scaler does not try to scale the application up)
|
||||
* It should be possible to write and deploy a custom auto-scaler without modifying existing auto-scalers
|
||||
* Auto-scalers must be able to monitor multiple replication controllers while only targeting a single scalable
|
||||
object (for now a ReplicationController, but in the future it could be a job or any resource that implements scale)
|
||||
|
||||
## Use Cases
|
||||
|
||||
### Scaling based on traffic
|
||||
|
||||
The current, most obvious, use case is scaling an application based on network traffic like requests per second. Most
|
||||
applications will expose one or more network endpoints for clients to connect to. Many of those endpoints will be load
|
||||
balanced or situated behind a proxy - the data from those proxies and load balancers can be used to estimate client to
|
||||
server traffic for applications. This is the primary, but not sole, source of data for making decisions.
|
||||
|
||||
Within Kubernetes a [kube proxy](http://docs.k8s.io/services.md#ips-and-vips)
|
||||
running on each node directs service requests to the underlying implementation.
|
||||
|
||||
While the proxy provides internal inter-pod connections, there will be L3 and L7 proxies and load balancers that manage
|
||||
traffic to backends. OpenShift, for instance, adds a "route" resource for defining external to internal traffic flow.
|
||||
The "routers" are HAProxy or Apache load balancers that aggregate many different services and pods and can serve as a
|
||||
data source for the number of backends.
|
||||
|
||||
### Scaling based on predictive analysis
|
||||
|
||||
Scaling may also occur based on predictions of system state like anticipated load, historical data, etc. Hand in hand
|
||||
with scaling based on traffic, predictive analysis may be used to determine anticipated system load and scale the application automatically.
|
||||
|
||||
### Scaling based on arbitrary data
|
||||
|
||||
Administrators may wish to scale the application based on any number of arbitrary data points such as job execution time or
|
||||
duration of active sessions. There are any number of reasons an administrator may wish to increase or decrease capacity which
|
||||
means the auto-scaler must be a configurable, extensible component.
|
||||
|
||||
## Specification
|
||||
|
||||
In order to facilitate talking about auto-scaling the following definitions are used:
|
||||
|
||||
* `ReplicationController` - the first building block of auto scaling. Pods are deployed and scaled by a `ReplicationController`.
|
||||
* kube proxy - The proxy handles internal inter-pod traffic, an example of a data source to drive an auto-scaler
|
||||
* L3/L7 proxies - A routing layer handling outside to inside traffic requests, an example of a data source to drive an auto-scaler
|
||||
* auto-scaler - scales replicas up and down by using the `scale` endpoint provided by scalable resources (`ReplicationController`)
|
||||
|
||||
|
||||
### Auto-Scaler
|
||||
|
||||
The Auto-Scaler is a state reconciler responsible for checking data against configured scaling thresholds
|
||||
and calling the `scale` endpoint to change the number of replicas. The scaler will
|
||||
use a client/cache implementation to receive watch data from the data aggregators and respond to them by
|
||||
scaling the application. Auto-scalers are created and defined like other resources via REST endpoints and belong to the
|
||||
namespace just as a `ReplicationController` or `Service`.
|
||||
|
||||
Since an auto-scaler is a durable object it is best represented as a resource.
|
||||
|
||||
```go
|
||||
//The auto scaler interface
|
||||
type AutoScalerInterface interface {
|
||||
//ScaleApplication adjusts a resource's replica count. Calls scale endpoint.
|
||||
//Args to this are based on what the endpoint
|
||||
//can support. See https://github.com/GoogleCloudPlatform/kubernetes/issues/1629
|
||||
ScaleApplication(num int) error
|
||||
}
|
||||
|
||||
type AutoScaler struct {
|
||||
//common construct
|
||||
TypeMeta
|
||||
//common construct
|
||||
ObjectMeta
|
||||
|
||||
//Spec defines the configuration options that drive the behavior for this auto-scaler
|
||||
Spec AutoScalerSpec
|
||||
|
||||
//Status defines the current status of this auto-scaler.
|
||||
Status AutoScalerStatus
|
||||
}
|
||||
|
||||
type AutoScalerSpec struct {
|
||||
//AutoScaleThresholds holds a collection of AutoScaleThresholds that drive the auto scaler
|
||||
AutoScaleThresholds []AutoScaleThreshold
|
||||
|
||||
//Enabled turns auto scaling on or off
|
||||
Enabled boolean
|
||||
|
||||
//MaxAutoScaleCount defines the max replicas that the auto scaler can use.
|
||||
//This value must be greater than 0 and >= MinAutoScaleCount
|
||||
MaxAutoScaleCount int
|
||||
|
||||
//MinAutoScaleCount defines the minimum number replicas that the auto scaler can reduce to,
|
||||
//0 means that the application is allowed to idle
|
||||
MinAutoScaleCount int
|
||||
|
||||
//TargetSelector provides the scalable target(s). Right now this is a ReplicationController
|
||||
//in the future it could be a job or any resource that implements scale.
|
||||
TargetSelector map[string]string
|
||||
|
||||
//MonitorSelector defines a set of capacity that the auto-scaler is monitoring
|
||||
//(replication controllers). Monitored objects are used by thresholds to examine
|
||||
//statistics. Example: get statistic X for object Y to see if threshold is passed
|
||||
MonitorSelector map[string]string
|
||||
}
|
||||
|
||||
type AutoScalerStatus struct {
|
||||
// TODO: open for discussion on what meaningful information can be reported in the status
|
||||
// The status may return the replica count here but we may want more information
|
||||
// such as if the count reflects a threshold being passed
|
||||
}
|
||||
|
||||
|
||||
//AutoScaleThresholdInterface abstracts the data analysis from the auto-scaler
|
||||
//example: scale by 1 (Increment) when RequestsPerSecond (Type) pass
|
||||
//comparison (Comparison) of 50 (Value) for 30 seconds (Duration)
|
||||
type AutoScaleThresholdInterface interface {
|
||||
//called by the auto-scaler to determine if this threshold is met or not
|
||||
ShouldScale() boolean
|
||||
}
|
||||
|
||||
|
||||
//AutoScaleThreshold is a single statistic used to drive the auto-scaler in scaling decisions
|
||||
type AutoScaleThreshold struct {
|
||||
// Type is the type of threshold being used, intention or value
|
||||
Type AutoScaleThresholdType
|
||||
|
||||
// ValueConfig holds the config for value based thresholds
|
||||
ValueConfig AutoScaleValueThresholdConfig
|
||||
|
||||
// IntentionConfig holds the config for intention based thresholds
|
||||
IntentionConfig AutoScaleIntentionThresholdConfig
|
||||
}
|
||||
|
||||
// AutoScaleIntentionThresholdConfig holds configuration for intention based thresholds
|
||||
// a intention based threshold defines no increment, the scaler will adjust by 1 accordingly
|
||||
// and maintain once the intention is reached. Also, no selector is defined, the intention
|
||||
// should dictate the selector used for statistics. Same for duration although we
|
||||
// may want a configurable duration later so intentions are more customizable.
|
||||
type AutoScaleIntentionThresholdConfig struct {
|
||||
// Intent is the lexicon of what intention is requested
|
||||
Intent AutoScaleIntentionType
|
||||
|
||||
// Value is intention dependent in terms of above, below, equal and represents
|
||||
// the value to check against
|
||||
Value float
|
||||
}
|
||||
|
||||
// AutoScaleValueThresholdConfig holds configuration for value based thresholds
|
||||
type AutoScaleValueThresholdConfig struct {
|
||||
//Increment determines how the auot-scaler should scale up or down (positive number to
|
||||
//scale up based on this threshold negative number to scale down by this threshold)
|
||||
Increment int
|
||||
//Selector represents the retrieval mechanism for a statistic value from statistics
|
||||
//storage. Once statistics are better defined the retrieval mechanism may change.
|
||||
//Ultimately, the selector returns a representation of a statistic that can be
|
||||
//compared against the threshold value.
|
||||
Selector map[string]string
|
||||
//Duration is the time lapse after which this threshold is considered passed
|
||||
Duration time.Duration
|
||||
//Value is the number at which, after the duration is passed, this threshold is considered
|
||||
//to be triggered
|
||||
Value float
|
||||
//Comparison component to be applied to the value.
|
||||
Comparison string
|
||||
}
|
||||
|
||||
// AutoScaleThresholdType is either intention based or value based
|
||||
type AutoScaleThresholdType string
|
||||
|
||||
// AutoScaleIntentionType is a lexicon for intentions such as "cpu-utilization",
|
||||
// "max-rps-per-endpoint"
|
||||
type AutoScaleIntentionType string
|
||||
```
|
||||
|
||||
#### Boundary Definitions
|
||||
The `AutoScaleThreshold` definitions provide the boundaries for the auto-scaler. By defining comparisons that form a range
|
||||
along with positive and negative increments you may define bi-directional scaling. For example the upper bound may be
|
||||
specified as "when requests per second rise above 50 for 30 seconds scale the application up by 1" and a lower bound may
|
||||
be specified as "when requests per second fall below 25 for 30 seconds scale the application down by 1 (implemented by using -1)".
|
||||
|
||||
### Data Aggregator
|
||||
|
||||
This section has intentionally been left empty. I will defer to folks who have more experience gathering and analyzing
|
||||
time series statistics.
|
||||
|
||||
Data aggregation is opaque to the the auto-scaler resource. The auto-scaler is configured to use `AutoScaleThresholds`
|
||||
that know how to work with the underlying data in order to know if an application must be scaled up or down. Data aggregation
|
||||
must feed a common data structure to ease the development of `AutoScaleThreshold`s but it does not matter to the
|
||||
auto-scaler whether this occurs in a push or pull implementation, whether or not the data is stored at a granular level,
|
||||
or what algorithm is used to determine the final statistics value. Ultimately, the auto-scaler only requires that a statistic
|
||||
resolves to a value that can be checked against a configured threshold.
|
||||
|
||||
Of note: If the statistics gathering mechanisms can be initialized with a registry other components storing statistics can
|
||||
potentially piggyback on this registry.
|
||||
|
||||
### Multi-target Scaling Policy
|
||||
If multiple scalable targets satisfy the `TargetSelector` criteria the auto-scaler should be configurable as to which
|
||||
target(s) are scaled. To begin with, if multiple targets are found the auto-scaler will scale the largest target up
|
||||
or down as appropriate. In the future this may be more configurable.
|
||||
|
||||
### Interactions with a deployment
|
||||
|
||||
In a deployment it is likely that multiple replication controllers must be monitored. For instance, in a [rolling deployment](http://docs.k8s.io/replication-controller.md#rolling-updates)
|
||||
there will be multiple replication controllers, with one scaling up and another scaling down. This means that an
|
||||
auto-scaler must be aware of the entire set of capacity that backs a service so it does not fight with the deployer. `AutoScalerSpec.MonitorSelector`
|
||||
is what provides this ability. By using a selector that spans the entire service the auto-scaler can monitor capacity
|
||||
of multiple replication controllers and check that capacity against the `AutoScalerSpec.MaxAutoScaleCount` and
|
||||
`AutoScalerSpec.MinAutoScaleCount` while still only targeting a specific set of `ReplicationController`s with `TargetSelector`.
|
||||
|
||||
In the course of a deployment it is up to the deployment orchestration to decide how to manage the labels
|
||||
on the replication controllers if it needs to ensure that only specific replication controllers are targeted by
|
||||
the auto-scaler. By default, the auto-scaler will scale the largest replication controller that meets the target label
|
||||
selector criteria.
|
||||
|
||||
During deployment orchestration the auto-scaler may be making decisions to scale its target up or down. In order to prevent
|
||||
the scaler from fighting with a deployment process that is scaling one replication controller up and scaling another one
|
||||
down the deployment process must assume that the current replica count may be changed by objects other than itself and
|
||||
account for this in the scale up or down process. Therefore, the deployment process may no longer target an exact number
|
||||
of instances to be deployed. It must be satisfied that the replica count for the deployment meets or exceeds the number
|
||||
of requested instances.
|
||||
|
||||
Auto-scaling down in a deployment scenario is a special case. In order for the deployment to complete successfully the
|
||||
deployment orchestration must ensure that the desired number of instances that are supposed to be deployed has been met.
|
||||
If the auto-scaler is trying to scale the application down (due to no traffic, or other statistics) then the deployment
|
||||
process and auto-scaler are fighting to increase and decrease the count of the targeted replication controller. In order
|
||||
to prevent this, deployment orchestration should notify the auto-scaler that a deployment is occurring. This will
|
||||
temporarily disable negative decrement thresholds until the deployment process is completed. It is more important for
|
||||
an auto-scaler to be able to grow capacity during a deployment than to shrink the number of instances precisely.
|
||||
|
||||
|
||||
|
||||
[]()
|
||||
|
||||
|
||||
[]()
|
BIN
release-0.19.0/docs/proposals/federation-high-level-arch.png
Normal file
BIN
release-0.19.0/docs/proposals/federation-high-level-arch.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 31 KiB |
437
release-0.19.0/docs/proposals/federation.md
Normal file
437
release-0.19.0/docs/proposals/federation.md
Normal file
@@ -0,0 +1,437 @@
|
||||
#Kubernetes Cluster Federation
|
||||
##(a.k.a. "Ubernetes")
|
||||
|
||||
## Requirements Analysis and Product Proposal
|
||||
|
||||
## _by Quinton Hoole ([quinton@google.com](mailto:quinton@google.com))_
|
||||
_Initial revision: 2015-03-05_
|
||||
_Last updated: 2015-03-09_
|
||||
This doc: [tinyurl.com/ubernetes](http://tinyurl.com/ubernetes)
|
||||
Slides: [tinyurl.com/ubernetes-slides](http://tinyurl.com/ubernetes-slides)
|
||||
|
||||
## Introduction
|
||||
|
||||
Today, each Kubernetes cluster is a relatively self-contained unit,
|
||||
which typically runs in a single "on-premise" data centre or single
|
||||
availability zone of a cloud provider (Google's GCE, Amazon's AWS,
|
||||
etc).
|
||||
|
||||
Several current and potential Kubernetes users and customers have
|
||||
expressed a keen interest in tying together ("federating") multiple
|
||||
clusters in some sensible way in order to enable the following kinds
|
||||
of use cases (intentionally vague):
|
||||
|
||||
1. _"Preferentially run my workloads in my on-premise cluster(s), but
|
||||
automatically overflow to my cloud-hosted cluster(s) if I run out
|
||||
of on-premise capacity"_.
|
||||
1. _"Most of my workloads should run in my preferred cloud-hosted
|
||||
cluster(s), but some are privacy-sensitive, and should be
|
||||
automatically diverted to run in my secure, on-premise
|
||||
cluster(s)"_.
|
||||
1. _"I want to avoid vendor lock-in, so I want my workloads to run
|
||||
across multiple cloud providers all the time. I change my set of
|
||||
such cloud providers, and my pricing contracts with them,
|
||||
periodically"_.
|
||||
1. _"I want to be immune to any single data centre or cloud
|
||||
availability zone outage, so I want to spread my service across
|
||||
multiple such zones (and ideally even across multiple cloud
|
||||
providers)."_
|
||||
|
||||
The above use cases are by necessity left imprecisely defined. The
|
||||
rest of this document explores these use cases and their implications
|
||||
in further detail, and compares a few alternative high level
|
||||
approaches to addressing them. The idea of cluster federation has
|
||||
informally become known as_ "Ubernetes"_.
|
||||
|
||||
## Summary/TL;DR
|
||||
|
||||
TBD
|
||||
|
||||
## What exactly is a Kubernetes Cluster?
|
||||
|
||||
A central design concept in Kubernetes is that of a _cluster_. While
|
||||
loosely speaking, a cluster can be thought of as running in a single
|
||||
data center, or cloud provider availability zone, a more precise
|
||||
definition is that each cluster provides:
|
||||
|
||||
1. a single Kubernetes API entry point,
|
||||
1. a consistent, cluster-wide resource naming scheme
|
||||
1. a scheduling/container placement domain
|
||||
1. a service network routing domain
|
||||
1. (in future) an authentication and authorization model.
|
||||
1. ....
|
||||
|
||||
The above in turn imply the need for a relatively performant, reliable
|
||||
and cheap network within each cluster.
|
||||
|
||||
There is also assumed to be some degree of failure correlation across
|
||||
a cluster, i.e. whole clusters are expected to fail, at least
|
||||
occasionally (due to cluster-wide power and network failures, natural
|
||||
disasters etc). Clusters are often relatively homogenous in that all
|
||||
compute nodes are typically provided by a single cloud provider or
|
||||
hardware vendor, and connected by a common, unified network fabric.
|
||||
But these are not hard requirements of Kubernetes.
|
||||
|
||||
Other classes of Kubernetes deployments than the one sketched above
|
||||
are technically feasible, but come with some challenges of their own,
|
||||
and are not yet common or explicitly supported.
|
||||
|
||||
More specifically, having a Kubernetes cluster span multiple
|
||||
well-connected availability zones within a single geographical region
|
||||
(e.g. US North East, UK, Japan etc) is worthy of further
|
||||
consideration, in particular because it potentially addresses
|
||||
some of these requirements.
|
||||
|
||||
## What use cases require Cluster Federation?
|
||||
|
||||
Let's name a few concrete use cases to aid the discussion:
|
||||
|
||||
## 1.Capacity Overflow
|
||||
|
||||
_"I want to preferentially run my workloads in my on-premise cluster(s), but automatically "overflow" to my cloud-hosted cluster(s) when I run out of on-premise capacity."_
|
||||
|
||||
This idea is known in some circles as "[cloudbursting](http://searchcloudcomputing.techtarget.com/definition/cloud-bursting)".
|
||||
|
||||
**Clarifying questions:** What is the unit of overflow? Individual
|
||||
pods? Probably not always. Replication controllers and their
|
||||
associated sets of pods? Groups of replication controllers
|
||||
(a.k.a. distributed applications)? How are persistent disks
|
||||
overflowed? Can the "overflowed" pods communicate with their
|
||||
brethren and sistren pods and services in the other cluster(s)?
|
||||
Presumably yes, at higher cost and latency, provided that they use
|
||||
external service discovery. Is "overflow" enabled only when creating
|
||||
new workloads/replication controllers, or are existing workloads
|
||||
dynamically migrated between clusters based on fluctuating available
|
||||
capacity? If so, what is the desired behaviour, and how is it
|
||||
achieved? How, if at all, does this relate to quota enforcement
|
||||
(e.g. if we run out of on-premise capacity, can all or only some
|
||||
quotas transfer to other, potentially more expensive off-premise
|
||||
capacity?)
|
||||
|
||||
It seems that most of this boils down to:
|
||||
|
||||
1. **location affinity** (pods relative to each other, and to other
|
||||
stateful services like persistent storage - how is this expressed
|
||||
and enforced?)
|
||||
1. **cross-cluster scheduling** (given location affinity constraints
|
||||
and other scheduling policy, which resources are assigned to which
|
||||
clusters, and by what?)
|
||||
1. **cross-cluster service discovery** (how do pods in one cluster
|
||||
discover and communicate with pods in another cluster?)
|
||||
1. **cross-cluster migration** (how do compute and storage resources,
|
||||
and the distributed applications to which they belong, move from
|
||||
one cluster to another)
|
||||
|
||||
## 2. Sensitive Workloads
|
||||
|
||||
_"I want most of my workloads to run in my preferred cloud-hosted
|
||||
cluster(s), but some are privacy-sensitive, and should be
|
||||
automatically diverted to run in my secure, on-premise cluster(s). The
|
||||
list of privacy-sensitive workloads changes over time, and they're
|
||||
subject to external auditing."_
|
||||
|
||||
**Clarifying questions:** What kinds of rules determine which
|
||||
workloads go where? Is a static mapping from container (or more
|
||||
typically, replication controller) to cluster maintained and
|
||||
enforced? If so, is it only enforced on startup, or are things
|
||||
migrated between clusters when the mappings change? This starts to
|
||||
look quite similar to "1. Capacity Overflow", and again seems to
|
||||
boil down to:
|
||||
|
||||
1. location affinity
|
||||
1. cross-cluster scheduling
|
||||
1. cross-cluster service discovery
|
||||
1. cross-cluster migration
|
||||
with the possible addition of:
|
||||
|
||||
+ cross-cluster monitoring and auditing (which is conveniently deemed
|
||||
to be outside the scope of this document, for the time being at
|
||||
least)
|
||||
|
||||
## 3. Vendor lock-in avoidance
|
||||
|
||||
_"My CTO wants us to avoid vendor lock-in, so she wants our workloads
|
||||
to run across multiple cloud providers at all times. She changes our
|
||||
set of preferred cloud providers and pricing contracts with them
|
||||
periodically, and doesn't want to have to communicate and manually
|
||||
enforce these policy changes across the organization every time this
|
||||
happens. She wants it centrally and automatically enforced, monitored
|
||||
and audited."_
|
||||
|
||||
**Clarifying questions:** Again, I think that this can potentially be
|
||||
reformulated as a Capacity Overflow problem - the fundamental
|
||||
principles seem to be the same or substantially similar to those
|
||||
above.
|
||||
|
||||
## 4. "Unavailability Zones"
|
||||
|
||||
_"I want to be immune to any single data centre or cloud availability
|
||||
zone outage, so I want to spread my service across multiple such zones
|
||||
(and ideally even across multiple cloud providers), and have my
|
||||
service remain available even if one of the availability zones or
|
||||
cloud providers "goes down"_.
|
||||
|
||||
It seems useful to split this into two sub use cases:
|
||||
|
||||
1. Multiple availability zones within a single cloud provider (across
|
||||
which feature sets like private networks, load balancing,
|
||||
persistent disks, data snapshots etc are typically consistent and
|
||||
explicitly designed to inter-operate).
|
||||
1. Multiple cloud providers (typically with inconsistent feature sets
|
||||
and more limited interoperability).
|
||||
|
||||
The single cloud provider case might be easier to implement (although
|
||||
the multi-cloud provider implementation should just work for a single
|
||||
cloud provider). Propose high-level design catering for both, with
|
||||
initial implementation targeting single cloud provider only.
|
||||
|
||||
**Clarifying questions:**
|
||||
**How does global external service discovery work?** In the steady
|
||||
state, which external clients connect to which clusters? GeoDNS or
|
||||
similar? What is the tolerable failover latency if a cluster goes
|
||||
down? Maybe something like (make up some numbers, notwithstanding
|
||||
some buggy DNS resolvers, TTL's, caches etc) ~3 minutes for ~90% of
|
||||
clients to re-issue DNS lookups and reconnect to a new cluster when
|
||||
their home cluster fails is good enough for most Kubernetes users
|
||||
(or at least way better than the status quo), given that these sorts
|
||||
of failure only happen a small number of times a year?
|
||||
|
||||
**How does dynamic load balancing across clusters work, if at all?**
|
||||
One simple starting point might be "it doesn't". i.e. if a service
|
||||
in a cluster is deemed to be "up", it receives as much traffic as is
|
||||
generated "nearby" (even if it overloads). If the service is deemed
|
||||
to "be down" in a given cluster, "all" nearby traffic is redirected
|
||||
to some other cluster within some number of seconds (failover could
|
||||
be automatic or manual). Failover is essentially binary. An
|
||||
improvement would be to detect when a service in a cluster reaches
|
||||
maximum serving capacity, and dynamically divert additional traffic
|
||||
to other clusters. But how exactly does all of this work, and how
|
||||
much of it is provided by Kubernetes, as opposed to something else
|
||||
bolted on top (e.g. external monitoring and manipulation of GeoDNS)?
|
||||
|
||||
**How does this tie in with auto-scaling of services?** More
|
||||
specifically, if I run my service across _n_ clusters globally, and
|
||||
one (or more) of them fail, how do I ensure that the remaining _n-1_
|
||||
clusters have enough capacity to serve the additional, failed-over
|
||||
traffic? Either:
|
||||
|
||||
1. I constantly over-provision all clusters by 1/n (potentially expensive), or
|
||||
1. I "manually" update my replica count configurations in the
|
||||
remaining clusters by 1/n when the failure occurs, and Kubernetes
|
||||
takes care of the rest for me, or
|
||||
1. Auto-scaling (not yet available) in the remaining clusters takes
|
||||
care of it for me automagically as the additional failed-over
|
||||
traffic arrives (with some latency).
|
||||
1. I manually specify "additional resources to be provisioned" per
|
||||
remaining cluster, possibly proportional to both the remaining functioning resources
|
||||
and the unavailable resources in the failed cluster(s).
|
||||
(All the benefits of over-provisioning, without expensive idle resources.)
|
||||
|
||||
Doing nothing (i.e. forcing users to choose between 1 and 2 on their
|
||||
own) is probably an OK starting point. Kubernetes autoscaling can get
|
||||
us to 3 at some later date.
|
||||
|
||||
Up to this point, this use case ("Unavailability Zones") seems materially different from all the others above. It does not require dynamic cross-cluster service migration (we assume that the service is already running in more than one cluster when the failure occurs). Nor does it necessarily involve cross-cluster service discovery or location affinity. As a result, I propose that we address this use case somewhat independently of the others (although I strongly suspect that it will become substantially easier once we've solved the others).
|
||||
|
||||
All of the above (regarding "Unavailibility Zones") refers primarily
|
||||
to already-running user-facing services, and minimizing the impact on
|
||||
end users of those services becoming unavailable in a given cluster.
|
||||
What about the people and systems that deploy Kubernetes services
|
||||
(devops etc)? Should they be automatically shielded from the impact
|
||||
of the cluster outage? i.e. have their new resource creation requests
|
||||
automatically diverted to another cluster during the outage? While
|
||||
this specific requirement seems non-critical (manual fail-over seems
|
||||
relatively non-arduous, ignoring the user-facing issues above), it
|
||||
smells a lot like the first three use cases listed above ("Capacity
|
||||
Overflow, Sensitive Services, Vendor lock-in..."), so if we address
|
||||
those, we probably get this one free of charge.
|
||||
|
||||
## Core Challenges of Cluster Federation
|
||||
|
||||
As we saw above, a few common challenges fall out of most of the use
|
||||
cases considered above, namely:
|
||||
|
||||
## Location Affinity
|
||||
|
||||
Can the pods comprising a single distributed application be
|
||||
partitioned across more than one cluster? More generally, how far
|
||||
apart, in network terms, can a given client and server within a
|
||||
distributed application reasonably be? A server need not necessarily
|
||||
be a pod, but could instead be a persistent disk housing data, or some
|
||||
other stateful network service. What is tolerable is typically
|
||||
application-dependent, primarily influenced by network bandwidth
|
||||
consumption, latency requirements and cost sensitivity.
|
||||
|
||||
For simplicity, lets assume that all Kubernetes distributed
|
||||
applications fall into one of three categories with respect to relative
|
||||
location affinity:
|
||||
|
||||
1. **"Strictly Coupled"**: Those applications that strictly cannot be
|
||||
partitioned between clusters. They simply fail if they are
|
||||
partitioned. When scheduled, all pods _must_ be scheduled to the
|
||||
same cluster. To move them, we need to shut the whole distributed
|
||||
application down (all pods) in one cluster, possibly move some
|
||||
data, and then bring the up all of the pods in another cluster. To
|
||||
avoid downtime, we might bring up the replacement cluster and
|
||||
divert traffic there before turning down the original, but the
|
||||
principle is much the same. In some cases moving the data might be
|
||||
prohibitively expensive or time-consuming, in which case these
|
||||
applications may be effectively _immovable_.
|
||||
1. **"Strictly Decoupled"**: Those applications that can be
|
||||
indefinitely partitioned across more than one cluster, to no
|
||||
disadvantage. An embarrassingly parallel YouTube porn detector,
|
||||
where each pod repeatedly dequeues a video URL from a remote work
|
||||
queue, downloads and chews on the video for a few hours, and
|
||||
arrives at a binary verdict, might be one such example. The pods
|
||||
derive no benefit from being close to each other, or anything else
|
||||
(other than the source of YouTube videos, which is assumed to be
|
||||
equally remote from all clusters in this example). Each pod can be
|
||||
scheduled independently, in any cluster, and moved at any time.
|
||||
1. **"Preferentially Coupled"**: Somewhere between Coupled and Decoupled. These applications prefer to have all of their pods located in the same cluster (e.g. for failure correlation, network latency or bandwidth cost reasons), but can tolerate being partitioned for "short" periods of time (for example while migrating the application from one cluster to another). Most small to medium sized LAMP stacks with not-very-strict latency goals probably fall into this category (provided that they use sane service discovery and reconnect-on-fail, which they need to do anyway to run effectively, even in a single Kubernetes cluster).
|
||||
|
||||
And then there's what I'll call _absolute_ location affinity. Some
|
||||
applications are required to run in bounded geographical or network
|
||||
topology locations. The reasons for this are typically
|
||||
political/legislative (data privacy laws etc), or driven by network
|
||||
proximity to consumers (or data providers) of the application ("most
|
||||
of our users are in Western Europe, U.S. West Coast" etc).
|
||||
|
||||
**Proposal:** First tackle Strictly Decoupled applications (which can
|
||||
be trivially scheduled, partitioned or moved, one pod at a time).
|
||||
Then tackle Preferentially Coupled applications (which must be
|
||||
scheduled in totality in a single cluster, and can be moved, but
|
||||
ultimately in total, and necessarily within some bounded time).
|
||||
Leave strictly coupled applications to be manually moved between
|
||||
clusters as required for the foreseeable future.
|
||||
|
||||
## Cross-cluster service discovery
|
||||
|
||||
I propose having pods use standard discovery methods used by external clients of Kubernetes applications (i.e. DNS). DNS might resolve to a public endpoint in the local or a remote cluster. Other than Strictly Coupled applications, software should be largely oblivious of which of the two occurs.
|
||||
_Aside:_ How do we avoid "tromboning" through an external VIP when DNS
|
||||
resolves to a public IP on the local cluster? Strictly speaking this
|
||||
would be an optimization, and probably only matters to high bandwidth,
|
||||
low latency communications. We could potentially eliminate the
|
||||
trombone with some kube-proxy magic if necessary. More detail to be
|
||||
added here, but feel free to shoot down the basic DNS idea in the mean
|
||||
time.
|
||||
|
||||
## Cross-cluster Scheduling
|
||||
|
||||
This is closely related to location affinity above, and also discussed
|
||||
there. The basic idea is that some controller, logically outside of
|
||||
the basic kubernetes control plane of the clusters in question, needs
|
||||
to be able to:
|
||||
|
||||
1. Receive "global" resource creation requests.
|
||||
1. Make policy-based decisions as to which cluster(s) should be used
|
||||
to fulfill each given resource request. In a simple case, the
|
||||
request is just redirected to one cluster. In a more complex case,
|
||||
the request is "demultiplexed" into multiple sub-requests, each to
|
||||
a different cluster. Knowledge of the (albeit approximate)
|
||||
available capacity in each cluster will be required by the
|
||||
controller to sanely split the request. Similarly, knowledge of
|
||||
the properties of the application (Location Affinity class --
|
||||
Strictly Coupled, Strictly Decoupled etc, privacy class etc) will
|
||||
be required.
|
||||
1. Multiplex the responses from the individual clusters into an
|
||||
aggregate response.
|
||||
|
||||
## Cross-cluster Migration
|
||||
|
||||
Again this is closely related to location affinity discussed above,
|
||||
and is in some sense an extension of Cross-cluster Scheduling. When
|
||||
certain events occur, it becomes necessary or desirable for the
|
||||
cluster federation system to proactively move distributed applications
|
||||
(either in part or in whole) from one cluster to another. Examples of
|
||||
such events include:
|
||||
|
||||
1. A low capacity event in a cluster (or a cluster failure).
|
||||
1. A change of scheduling policy ("we no longer use cloud provider X").
|
||||
1. A change of resource pricing ("cloud provider Y dropped their prices - lets migrate there").
|
||||
|
||||
Strictly Decoupled applications can be trivially moved, in part or in whole, one pod at a time, to one or more clusters.
|
||||
For Preferentially Decoupled applications, the federation system must first locate a single cluster with sufficient capacity to accommodate the entire application, then reserve that capacity, and incrementally move the application, one (or more) resources at a time, over to the new cluster, within some bounded time period (and possibly within a predefined "maintenance" window).
|
||||
Strictly Coupled applications (with the exception of those deemed
|
||||
completely immovable) require the federation system to:
|
||||
|
||||
1. start up an entire replica application in the destination cluster
|
||||
1. copy persistent data to the new application instance
|
||||
1. switch traffic across
|
||||
1. tear down the original application instance
|
||||
|
||||
It is proposed that support for automated migration of Strictly Coupled applications be
|
||||
deferred to a later date.
|
||||
|
||||
## Other Requirements
|
||||
|
||||
These are often left implicit by customers, but are worth calling out explicitly:
|
||||
|
||||
1. Software failure isolation between Kubernetes clusters should be
|
||||
retained as far as is practically possible. The federation system
|
||||
should not materially increase the failure correlation across
|
||||
clusters. For this reason the federation system should ideally be
|
||||
completely independent of the Kubernetes cluster control software,
|
||||
and look just like any other Kubernetes API client, with no special
|
||||
treatment. If the federation system fails catastrophically, the
|
||||
underlying Kubernetes clusters should remain independently usable.
|
||||
1. Unified monitoring, alerting and auditing across federated Kubernetes clusters.
|
||||
1. Unified authentication, authorization and quota management across
|
||||
clusters (this is in direct conflict with failure isolation above,
|
||||
so there are some tough trade-offs to be made here).
|
||||
|
||||
## Proposed High-Level Architecture
|
||||
|
||||
TBD: All very hand-wavey still, but some initial thoughts to get the conversation going...
|
||||
|
||||

|
||||
|
||||
## Ubernetes API
|
||||
|
||||
This looks a lot like the existing Kubernetes API but is explicitly multi-cluster.
|
||||
|
||||
+ Clusters become first class objects, which can be registered, listed, described, deregistered etc via the API.
|
||||
+ Compute resources can be explicitly requested in specific clusters, or automatically scheduled to the "best" cluster by Ubernetes (by a pluggable Policy Engine).
|
||||
+ There is a federated equivalent of a replication controller type, which is multicluster-aware, and delegates to cluster-specific replication controllers as required (e.g. a federated RC for n replicas might simply spawn multiple replication controllers in different clusters to do the hard work).
|
||||
+ These federated replication controllers (and in fact all the
|
||||
services comprising the Ubernetes Control Plane) have to run
|
||||
somewhere. For high availability Ubernetes deployments, these
|
||||
services may run in a dedicated Kubernetes cluster, not physically
|
||||
co-located with any of the federated clusters. But for simpler
|
||||
deployments, they may be run in one of the federated clusters (but
|
||||
when that cluster goes down, Ubernetes is down, obviously).
|
||||
|
||||
## Policy Engine and Migration/Replication Controllers
|
||||
|
||||
The Policy Engine decides which parts of each application go into each
|
||||
cluster at any point in time, and stores this desired state in the
|
||||
Desired Federation State store (an etcd or
|
||||
similar). Migration/Replication Controllers reconcile this against the
|
||||
desired states stored in the underlying Kubernetes clusters (by
|
||||
watching both, and creating or updating the underlying Replication
|
||||
Controllers and related Services accordingly).
|
||||
|
||||
## Authentication and Authorization
|
||||
|
||||
This should ideally be delegated to some external auth system, shared
|
||||
by the underlying clusters, to avoid duplication and inconsistency.
|
||||
Either that, or we end up with multilevel auth. Local readonly
|
||||
eventually consistent auth slaves in each cluster and in Ubernetes
|
||||
could potentially cache auth, to mitigate an SPOF auth system.
|
||||
|
||||
## Proposed Next Steps
|
||||
|
||||
Identify concrete applications of each use case and configure a proof
|
||||
of concept service that exercises the use case. For example, cluster
|
||||
failure tolerance seems popular, so set up an apache frontend with
|
||||
replicas in each of three availability zones with either an Amazon Elastic
|
||||
Load Balancer or Google Cloud Load Balancer pointing at them? What
|
||||
does the zookeeper config look like for N=3 across 3 AZs -- and how
|
||||
does each replica find the other replicas and how do clients find
|
||||
their primary zookeeper replica? And now how do I do a shared, highly
|
||||
available redis database?
|
||||
|
||||
|
||||
[]()
|
||||
|
||||
|
||||
[]()
|
52
release-0.19.0/docs/proposals/high-availability.md
Normal file
52
release-0.19.0/docs/proposals/high-availability.md
Normal file
@@ -0,0 +1,52 @@
|
||||
# High Availability of Scheduling and Controller Components in Kubernetes
|
||||
This document serves as a proposal for high availability of the scheduler and controller components in kubernetes. This proposal is intended to provide a simple High Availability api for kubernetes components with the potential to extend to services running on kubernetes. Those services would be subject to their own constraints.
|
||||
|
||||
## Design Options
|
||||
For complete reference see [this](https://www.ibm.com/developerworks/community/blogs/RohitShetty/entry/high_availability_cold_warm_hot?lang=en)
|
||||
|
||||
1. Hot Standby: In this scenario, data and state are shared between the two components such that an immediate failure in one component causes the the standby deamon to take over exactly where the failed component had left off. This would be an ideal solution for kubernetes, however it poses a series of challenges in the case of controllers where component-state is cached locally and not persisted in a transactional way to a storage facility. This would also introduce additional load on the apiserver, which is not desirable. As a result, we are **NOT** planning on this approach at this time.
|
||||
|
||||
2. **Warm Standby**: In this scenario there is only one active component acting as the master and additional components running but not providing service or responding to requests. Data and state are not shared between the active and standby components. When a failure occurs, the standby component that becomes the master must determine the current state of the system before resuming functionality. This is the apprach that this proposal will leverage.
|
||||
|
||||
3. Active-Active (Load Balanced): Clients can simply load-balance across any number of servers that are currently running. Their general availability can be continuously updated, or published, such that load balancing only occurs across active participants. This aspect of HA is outside of the scope of *this* proposal because there is already a partial implementation in the apiserver.
|
||||
|
||||
## Design Discussion Notes on Leader Election
|
||||
Implementation References:
|
||||
* [zookeeper](http://zookeeper.apache.org/doc/trunk/recipes.html#sc_leaderElection)
|
||||
* [etcd](https://groups.google.com/forum/#!topic/etcd-dev/EbAa4fjypb4)
|
||||
* [initialPOC](https://github.com/rrati/etcd-ha)
|
||||
|
||||
In HA, the apiserver will provide an api for sets of replicated clients to do master election: acquire the lease, renew the lease, and release the lease. This api is component agnostic, so a client will need to provide the component type and the lease duration when attemping to become master. The lease duration should be tuned per component. The apiserver will attempt to create a key in etcd based on the component type that contains the client's hostname/ip and port information. This key will be created with a ttl from the lease duration provided in the request. Failure to create this key means there is already a master of that component type, and the error from etcd will propigate to the client. Successfully creating the key means the client making the request is the master. Only the current master can renew the lease. When renewing the lease, the apiserver will update the existing key with a new ttl. The location in etcd for the HA keys is TBD.
|
||||
|
||||
The first component to request leadership will become the master. All other components of that type will fail until the current leader releases the lease, or fails to renew the lease within the expiration time. On startup, all components should attempt to become master. The component that succeeds becomes the master, and should perform all functions of that component. The components that fail to become the master should not perform any tasks and sleep for their lease duration and then attempt to become the master again. A clean shutdown of the leader will cause a release of the lease and a new master will be elected.
|
||||
|
||||
The component that becomes master should create a thread to manage the lease. This thread should be created with a channel that the main process can use to release the master lease. The master should release the lease in cases of an unrecoverable error and clean shutdown. Otherwise, this process will renew the lease and sleep, waiting for the next renewal time or notification to release the lease. If there is a failure to renew the lease, this process should force the entire component to exit. Daemon exit is meant to prevent potential split-brain conditions. Daemon restart is implied in this scenario, by either the init system (systemd), or possible watchdog processes. (See Design Discussion Notes)
|
||||
|
||||
## Options added to components with HA functionality
|
||||
Some command line options would be added to components that can do HA:
|
||||
|
||||
* Lease Duration - How long a component can be master
|
||||
|
||||
## Design Discussion Notes
|
||||
Some components may run numerous threads in order to perform tasks in parallel. Upon losing master status, such components should exit instantly instead of attempting to gracefully shut down such threads. This is to ensure that, in the case there's some propagation delay in informing the threads they should stop, the lame-duck threads won't interfere with the new master. The component should exit with an exit code indicating that the component is not the master. Since all components will be run by systemd or some other monitoring system, this will just result in a restart.
|
||||
|
||||
There is a short window after a new master acquires the lease, during which data from the old master might be committed. This is because there is currently no way to condition a write on its source being the master. Having the daemons exit shortens this window but does not eliminate it. A proper solution for this problem will be addressed at a later date. The proposed solution is:
|
||||
|
||||
1. This requires transaction support in etcd (which is already planned - see [coreos/etcd#2675](https://github.com/coreos/etcd/pull/2675))
|
||||
|
||||
2. The entry in etcd that is tracking the lease for a given component (the "current master" entry) would have as its value the host:port of the lease-holder (as described earlier) and a sequence number. The sequence number is incremented whenever a new master gets the lease.
|
||||
|
||||
3. Master replica is aware of the latest sequence number.
|
||||
|
||||
4. Whenever master replica sends a mutating operation to the API server, it includes the sequence number.
|
||||
|
||||
5. When the API server makes the corresponding write to etcd, it includes it in a transaction that does a compare-and-swap on the "current master" entry (old value == new value == host:port and sequence number from the replica that sent the mutating operation). This basically guarantees that if we elect the new master, all transactions coming from the old master will fail. You can think of this as the master attaching a "precondition" of its belief about who is the latest master.
|
||||
|
||||
## Open Questions:
|
||||
* Is there a desire to keep track of all nodes for a specific component type?
|
||||
|
||||
|
||||
[]()
|
||||
|
||||
|
||||
[]()
|
Reference in New Issue
Block a user