Federated replica set design doc

This commit is contained in:
Marcin Wielgus 2016-10-05 13:24:59 +02:00 committed by Marcin
parent 50e12ff5a2
commit c44897f97d

View File

@ -0,0 +1,542 @@
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
<!-- BEGIN STRIP_FOR_RELEASE -->
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
width="25" height="25">
<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
If you are using a released version of Kubernetes, you should
refer to the docs that go with that version.
Documentation for other releases can be found at
[releases.k8s.io](http://releases.k8s.io).
</strong>
--
<!-- END STRIP_FOR_RELEASE -->
<!-- END MUNGE: UNVERSIONED_WARNING -->
# Federated ReplicaSets
# Requirements & Design Document
This document is a markdown version converted from a working [Google Doc](https://docs.google.com/a/google.com/document/d/1C1HEHQ1fwWtEhyl9JYu6wOiIUJffSmFmZgkGta4720I/edit?usp=sharing). Please refer to the original for extended commentary and discussion.
Author: Marcin Wielgus [mwielgus@google.com](mailto:mwielgus@google.com)
Based on discussions with
Quinton Hoole [quinton@google.com](mailto:quinton@google.com), Wojtek Tyczyński [wojtekt@google.com](mailto:wojtekt@google.com)
## Overview
### Summary & Vision
When running a global application on a federation of Kubernetes
clusters the owner currently has to start it in multiple clusters and
control whether he has both enough application replicas running
locally in each of the clusters (so that, for example, users are
handled by a nearby cluster, with low latency) and globally (so that
there is always enough capacity to handle all traffic). If one of the
clusters has issues or hasnt enough capacity to run the given set of
replicas the replicas should be automatically moved to some other
cluster to keep the application responsive.
In single cluster Kubernetes there is a concept of ReplicaSet that
manages the replicas locally. We want to expand this concept to the
federation level.
### Goals
+ Win large enterprise customers who want to easily run applications
across multiple clusters
+ Create a reference controller implementation to facilitate bringing
other Kubernetes concepts to Federated Kubernetes.
## Glossary
Federation Cluster - a cluster that is a member of federation.
Local ReplicaSet (LRS) - ReplicaSet defined and running on a cluster
that is a member of federation.
Federated ReplicaSet (FRS) - ReplicaSet defined and running inside of Federated K8S server.
Federated ReplicaSet Controller (FRSC) - A controller running inside
of Federated K8S server that controlls FRS.
## User Experience
### Critical User Journeys
+ [CUJ1] User wants to create a ReplicaSet in each of the federation
cluster. They create a definition of federated ReplicaSet on the
federated master and (local) ReplicaSets are automatically created
in each of the federation clusters. The number of replicas is each
of the Local ReplicaSets is (perheps indirectly) configurable by
the user.
+ [CUJ2] When the current number of replicas in a cluster drops below
the desired number and new replicas cannot be scheduled then they
should be started in some other cluster.
### Features Enabling Critical User Journeys
Feature #1 -> CUJ1:
A component which looks for newly created Federated ReplicaSets and
creates the appropriate Local ReplicaSet definitions in the federated
clusters.
Feature #2 -> CUJ2:
A component that checks how many replicas are actually running in each
of the subclusters and if the number matches to the
FederatedReplicaSet preferences (by default spread replicas evenly
across the clusters but custom preferences are allowed - see
below). If it doesnt and the situation is unlikely to improve soon
then the replicas should be moved to other subclusters.
### API and CLI
All interaction with FederatedReplicaSet will be done by issuing
kubectl commands pointing on the Federated Master API Server. All the
commands would behave in a similar way as on the regular master,
however in the next versions (1.5+) some of the commands may give
slightly different output. For example kubectl describe on federated
replica set should also give some information about the subclusters.
Moreover, for safety, some defaults will be different. For example for
kubectl delete federatedreplicaset cascade will be set to false.
FederatedReplicaSet would have the same object as local ReplicaSet
(although it will be accessible in a different part of the
api). Scheduling preferences (how many replicas in which cluster) will
be passed as annotations.
### FederateReplicaSet preferences
The preferences are expressed by the following structure, passed as a
serialized json inside annotations.
```
type FederatedReplicaSetPreferences struct {
// If set to true then already scheduled and running replicas may be moved to other clusters to
// in order to bring cluster replicasets towards a desired state. Otherwise, if set to false,
// up and running replicas will not be moved.
Rebalance bool `json:"rebalance,omitempty"`
// Map from cluster name to preferences for that cluster. It is assumed that if a cluster
// doesnt have a matching entry then it should not have local replica. The cluster matches
// to "*" if there is no entry with the real cluster name.
Clusters map[string]LocalReplicaSetPreferences
}
// Preferences regarding number of replicas assigned to a cluster replicaset within a federated replicaset.
type ClusterReplicaSetPreferences struct {
// Minimum number of replicas that should be assigned to this Local ReplicaSet. 0 by default.
MinReplicas int64 `json:"minReplicas,omitempty"`
// Maximum number of replicas that should be assigned to this Local ReplicaSet. Unbounded if no value provided (default).
MaxReplicas *int64 `json:"maxReplicas,omitempty"`
// A number expressing the preference to put an additional replica to this LocalReplicaSet. 0 by default.
Weight int64
}
```
How this works in practice:
**Scenario 1**. I want to spread my 50 replicas evenly across all available clusters. Config:
```
FederatedReplicaSetPreferences {
Rebalance : true
Clusters : map[string]LocalReplicaSet {
"*" : LocalReplicaSet{ Weight: 1}
}
}
```
Example:
+ Clusters A,B,C, all have capacity.
Replica layout: A=16 B=17 C=17.
+ Clusters A,B,C and C has capacity for 6 replicas.
Replica layout: A=22 B=22 C=6
+ Clusters A,B,C. B and C are offline:
Replica layout: A=50
**Scenario 2**. I want to have only 2 replicas in each of the clusters.
```
FederatedReplicaSetPreferences {
Rebalance : true
Clusters : map[string]LocalReplicaSet {
"*" : LocalReplicaSet{ MaxReplicas: 2; Weight: 1}
}
}
```
Or
```
FederatedReplicaSetPreferences {
Rebalance : true
Clusters : map[string]LocalReplicaSet {
"*" : LocalReplicaSet{ MinReplicas: 2; Weight: 0 }
}
}
```
Or
```
FederatedReplicaSetPreferences {
Rebalance : true
Clusters : map[string]LocalReplicaSet {
"*" : LocalReplicaSet{ MinReplicas: 2; MaxReplicas: 2}
}
}
```
There is a global target for 50, however if there are 3 clusters there will be only 6 replicas running.
**Scenario 3**. I want to have 20 replicas in each of 3 clusters.
```
FederatedReplicaSetPreferences {
Rebalance : true
Clusters : map[string]LocalReplicaSet {
"*" : LocalReplicaSet{ MinReplicas: 20; Weight: 0}
}
}
```
There is a global target for 50, however clusters require 60. So some clusters will have less replicas.
Replica layout: A=20 B=20 C=10.
**Scenario 4**. I want to have equal number of replicas in clusters A,B,C, however dont put more than 20 replicas to cluster C.
```
FederatedReplicaSetPreferences {
Rebalance : true
Clusters : map[string]LocalReplicaSet {
"*" : LocalReplicaSet{ Weight: 1}
“C” : LocalReplicaSet{ MaxReplicas: 20, Weight: 1}
}
}
```
Example:
+ All have capacity.
Replica layout: A=16 B=17 C=17.
+ B is offline/has no capacity
Replica layout: A=30 B=0 C=20
+ A and B are offline:
Replica layout: C=20
**Scenario 5**. I want to run my application in cluster A, however if there are troubles FRS can also use clusters B and C, equally.
```
FederatedReplicaSetPreferences {
Clusters : map[string]LocalReplicaSet {
“A” : LocalReplicaSet{ Weight: 1000000}
“B” : LocalReplicaSet{ Weight: 1}
“C” : LocalReplicaSet{ Weight: 1}
}
}
```
Example:
+ All have capacity.
Replica layout: A=50 B=0 C=0.
+ A has capacity for only 40 replicas
Replica layout: A=40 B=5 C=5
**Scenario 6**. I want to run my application in clusters A, B and C. Cluster A gets twice the QPS than other clusters.
```
FederatedReplicaSetPreferences {
Clusters : map[string]LocalReplicaSet {
“A” : LocalReplicaSet{ Weight: 2}
“B” : LocalReplicaSet{ Weight: 1}
“C” : LocalReplicaSet{ Weight: 1}
}
}
```
**Scenario 7**. I want to spread my 50 replicas evenly across all available clusters, but if there
are already some replicas, please do not move them. Config:
```
FederatedReplicaSetPreferences {
Rebalance : false
Clusters : map[string]LocalReplicaSet {
"*" : LocalReplicaSet{ Weight: 1}
}
}
```
Example:
+ Clusters A,B,C, all have capacity, but A already has 20 replicas
Replica layout: A=20 B=15 C=15.
+ Clusters A,B,C and C has capacity for 6 replicas, A has already 20 replicas.
Replica layout: A=22 B=22 C=6
+ Clusters A,B,C and C has capacity for 6 replicas, A has already 30 replicas.
Replica layout: A=30 B=14 C=6
## The Idea
A new federated controller - Federated Replica Set Controller (FRSC)
will be created inside federated controller manager. Below are
enumerated the key idea elements:
+ [I0] It is considered OK to have slightly higher number of replicas
globally for some time.
+ [I1] FRSC starts an informer on the FederatedReplicaSet that listens
on FRS being created, updated or deleted. On each create/update the
scheduling code will be started to calculate where to put the
replicas. The default behavior is to start the same amount of
replicas in each of the cluster. While creating LocalReplicaSets
(LRS) the following errors/issues can occur:
+ [E1] Master rejects LRS creation (for known or unknown
reason). In this case another attempt to create a LRS should be
attempted in 1m or so. This action can be tied with
[[I5]](#heading=h.ififs95k9rng). Until the the LRS is created
the situation is the same as [E5]. If this happens multiple
times all due replicas should be moved elsewhere and later moved
back once the LRS is created.
+ [E2] LRS with the same name but different configuration already
exists. The LRS is then overwritten and an appropriate event
created to explain what happened. Pods under the control of the
old LRS are left intact and the new LRS may adopt them if they
match the selector.
+ [E3] LRS is new but the pods that match the selector exist. The
pods are adopted by the RS (if not owned by some other
RS). However they may have a different image, configuration
etc. Just like with regular LRS.
+ [I2] For each of the cluster FRSC starts a store and an informer on
LRS that will listen for status updates. These status changes are
only interesting in case of troubles. Otherwise it is assumed that
LRS runs trouble free and there is always the right number of pod
created but possibly not scheduled.
+ [E4] LRS is manually deleted from the local cluster. In this case
a new LRS should be created. It is the same case as
[[E1]](#heading=h.wn3dfsyc4yuh). Any pods that were left behind
wont be killed and will be adopted after the LRS is recreated.
+ [E5] LRS fails to create (not necessary schedule) the desired
number of pods due to master troubles, admission control
etc. This should be considered as the same situation as replicas
unable to schedule (see [[I4]](#heading=h.dqalbelvn1pv)).
+ [E6] It is impossible to tell that an informer lost connection
with a remote cluster or has other synchronization problem so it
should be handled by cluster liveness probe and deletion
[[I6]](#heading=h.z90979gc2216).
+ [I3] For each of the cluster start an store and informer to monitor
whether the created pods are eventually scheduled and what is the
current number of correctly running ready pods. Errors:
+ [E7] It is impossible to tell that an informer lost connection
with a remote cluster or has other synchronization problem so it
should be handled by cluster liveness probe and deletion
[[I6]](#heading=h.z90979gc2216)
+ [I4] It is assumed that a not scheduled pod is a normal situation
and can last up to X min if there is a huge traffic on the
cluster. However if the replicas are not scheduled in that time then
FRSC should consider moving most of the unscheduled replicas
elsewhere. For that purpose FRSC will maintain a data structure
where for each FRS controlled LRS we store a list of pods belonging
to that LRS along with their current status and status change timestamp.
+ [I5] If a new cluster is added to the federation then it doesnt
have a LRS and the situation is equal to
[[E1]](#heading=h.wn3dfsyc4yuh)/[[E4]](#heading=h.vlyovyh7eef).
+ [I6] If a cluster is removed from the federation then the situation
is equal to multiple [E4]. It is assumed that if a connection with
a cluster is lost completely then the cluster is removed from the
the cluster list (or marked accordingly) so
[[E6]](#heading=h.in6ove1c1s8f) and [[E7]](#heading=h.37bnbvwjxeda)
dont need to be handled.
+ [I7] All ToBeChecked FRS are browsed every 1 min (configurable),
checked against the current list of clusters, and all missing LRS
are created. This will be executed in combination with [I8].
+ [I8] All pods from ToBeChecked FRS/LRS are browsed every 1 min
(configurable) to check whether some replica move between clusters
is needed or not.
+ FRSC never moves replicas to LRS that have not scheduled/running
pods or that has pods that failed to be created.
+ When FRSC notices that a number of pods are not scheduler/running
or not_even_created in one LRS for more than Y minutes it takes
most of them from LRS, leaving couple still waiting so that once
they are scheduled FRSC will know that it is ok to put some more
replicas to that cluster.
+ [I9] FRS becomes ToBeChecked if:
+ It is newly created
+ Some replica set inside changed its status
+ Some pods inside cluster changed their status
+ Some cluster is added or deleted.
> FRS stops ToBeChecked if is in desired configuration (or is stable enough).
## (RE)Scheduling algorithm
To calculate the (re)scheduling moves for a given FRS:
1. For each cluster FRSC calculates the number of replicas that are placed
(not necessary up and running) in the cluster and the number of replicas that
failed to be scheduled. Cluster capacity is the difference between the
the placed and failed to be scheduled.
2. Order all clusters by their weight and hash of the name so that every time
we process the same replica-set we process the clusters in the same order.
Include federated replica set name in the cluster name hash so that we get
slightly different ordering for different RS. So that not all RS of size 1
end up on the same cluster.
3. Assign minimum prefered number of replicas to each of the clusters, if
there is enough replicas and capacity.
4. If rebalance = false, assign the previously present replicas to the clusters,
remember the number of extra replicas added (ER). Of course if there
is enough replicas and capacity.
5. Distribute the remaining replicas with regard to weights and cluster capacity.
In multiple iterations calculate how many of the replicas should end up in the cluster.
For each of the cluster cap the number of assigned replicas by max number of replicas and
cluster capacity. If there were some extra replicas added to the cluster in step
4, don't really add the replicas but balance them gains ER from 4.
## Goroutines layout
+ [GR1] Involved in FRS informer (see
[[I1]]). Whenever a FRS is created and
updated it puts the new/updated FRS on FRS_TO_CHECK_QUEUE with
delay 0.
+ [GR2_1...GR2_N] Involved in informers/store on LRS (see
[[I2]]). On all changes the FRS is put on
FRS_TO_CHECK_QUEUE with delay 1min.
+ [GR3_1...GR3_N] Involved in informers/store on Pods
(see [[I3]] and [[I4]]). They maintain the status store
so that for each of the LRS we know the number of pods that are
actually running and ready in O(1) time. They also put the
corresponding FRS on FRS_TO_CHECK_QUEUE with delay 1min.
+ [GR4] Involved in cluster informer (see
[[I5]] and [[I6]] ). It puts all FRS on FRS_TO_CHECK_QUEUE
with delay 0.
+ [GR5_*] Go routines handling FRS_TO_CHECK_QUEUE that put FRS on
FRS_CHANNEL after the given delay (and remove from
FRS_TO_CHECK_QUEUE). Every time an already present FRS is added to
FRS_TO_CHECK_QUEUE the delays are compared and updated so that the
shorter delay is used.
+ [GR6] Contains a selector that listens on a FRS_CHANNEL. Whenever
a FRS is received it is put to a work queue. Work queue has no delay
and makes sure that a single replica set is process is processed by
only one goroutine.
+ [GR7_*] Goroutines related to workqueue. They fire DoFrsCheck on the FRS.
Multiple replica set can be processed in parallel. Two Goroutines cannot
process the same FRS at the same time.
## Func DoFrsCheck
The function does [[I7]] and[[I8]]. It is assumed that it is run on a
single thread/goroutine so we check and evaluate the same FRS on many
goroutines (however if needed the function can be parallelized for
different FRS). It takes data only from store maintained by GR2_* and
GR3_*. The external communication is only required to:
+ Create LRS. If a LRS doesnt exist it is created after the
rescheduling, when we know how much replicas should it have.
+ Update LRS replica targets.
If FRS is not in the desired state then it is put to
FRS_TO_CHECK_QUEUE with delay 1min (possibly increasing).
## Monitoring and status reporting
FRCS should expose a number of metrics form the run, like
+ FRSC -> LRS communication latency
+ Total times spent in various elements of DoFrsCheck
FRSC should also expose the status of FRS as an annotation on FRS and
as events.
## Workflow
Here is the sequence of tasks that need to be done in order for a
typical FRS to be split into a number of LRSs and to be created in
the underlying federated clusters.
Note a: the reason the workflow would be helpful at this phase is that
for every one or two steps we can create PRs accordingly to start with
the development.
Note b: we assume that the federation is already in place and the
federated clusters are added to the federation.
Step 1. the client sends an RS create request to the
federation-apiserver
Step 2. federation-apiserver persists an FRS into the federation etcd
Note c: federation-apiserver populates the clusterid field in the FRS
before persisting it into the federation etcd
Step 3: the federation-level “informer” in FRSC watches federation
etcd for new/modified FRSs, with empty clusterid or clusterid equal
to federation ID, and if detected, it calls the scheduling code
Step 4.
Note d: scheduler populates the clusterid field in the LRS with the
IDs of target clusters
Note e: at this point let us assume that it only does the even
distribution, i.e., equal weights for all of the underlying clusters
Step 5. As soon as the scheduler function returns the control to FRSC,
the FRSC starts a number of cluster-level “informer”s, one per every
target cluster, to watch changes in every target cluster etcd
regarding the posted LRSs and if any violation from the scheduled
number of replicase is detected the scheduling code is re-called for
re-scheduling purposes.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/federated-replicasets.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->