Federated replica set design doc

2025-07-28 14:07:14 +00:00 · 2016-10-05 13:24:59 +02:00 · 2016-10-05 13:24:59 +02:00 · c44897f97d
commit c44897f97d
parent 50e12ff5a2
1 changed files with 542 additions and 0 deletions
--- a/docs/design/federated-replicasets.md
+++ b/docs/design/federated-replicasets.md
@ -0,0 +1,542 @@
+<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
+
+<!-- BEGIN STRIP_FOR_RELEASE -->
+
+<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
+     width="25" height="25">
+
+<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
+
+If you are using a released version of Kubernetes, you should
+refer to the docs that go with that version.
+
+Documentation for other releases can be found at
+[releases.k8s.io](http://releases.k8s.io).
+</strong>
+--
+
+<!-- END STRIP_FOR_RELEASE -->
+
+<!-- END MUNGE: UNVERSIONED_WARNING -->
+
+# Federated ReplicaSets
+
+# Requirements & Design Document
+
+This document is a markdown version converted from a working [Google Doc](https://docs.google.com/a/google.com/document/d/1C1HEHQ1fwWtEhyl9JYu6wOiIUJffSmFmZgkGta4720I/edit?usp=sharing). Please refer to the original for extended commentary and discussion.
+
+Author: Marcin Wielgus [mwielgus@google.com](mailto:mwielgus@google.com)
+Based on discussions with
+Quinton Hoole [quinton@google.com](mailto:quinton@google.com), Wojtek Tyczyński [wojtekt@google.com](mailto:wojtekt@google.com)
+
+## Overview
+
+### Summary & Vision
+
+When running a global application on a federation of Kubernetes
+clusters the owner currently has to start it in multiple clusters and
+control whether he has both enough application replicas running
+locally in each of the clusters (so that, for example, users are
+handled by a nearby cluster, with low latency) and globally (so that
+there is always enough capacity to handle all traffic). If one of the
+clusters has issues or hasn’t enough capacity to run the given set of
+replicas the replicas should be automatically moved to some other
+cluster to keep the application responsive.
+
+In single cluster Kubernetes there is a concept of ReplicaSet that
+manages the replicas locally. We want to expand this concept to the
+federation level.
+
+### Goals
+
+ Win large enterprise customers who want to easily run applications
+   across multiple clusters
+ Create a reference controller implementation to facilitate bringing
+   other Kubernetes concepts to Federated Kubernetes.
+
+## Glossary
+
+Federation Cluster - a cluster that is a member of federation.
+
+Local ReplicaSet (LRS) - ReplicaSet defined and running on a cluster
+that is a member of federation.
+
+Federated ReplicaSet (FRS) - ReplicaSet defined and running inside of Federated K8S server.
+
+Federated ReplicaSet Controller (FRSC) - A controller running inside
+of Federated K8S server that controlls FRS.
+
+## User Experience
+
+### Critical User Journeys
+
+ [CUJ1] User wants to create a ReplicaSet in each of the federation
+   cluster. They create a definition of federated ReplicaSet on the
+   federated master and (local) ReplicaSets are automatically created
+   in each of the federation clusters. The number of replicas is each
+   of the Local ReplicaSets is (perheps indirectly) configurable by
+   the user.
+ [CUJ2] When the current number of replicas in a cluster drops below
+   the desired number and new replicas cannot be scheduled then they
+   should be started in some other cluster.
+
+### Features Enabling Critical User Journeys
+
+Feature #1 -> CUJ1:
+A component which looks for newly created Federated ReplicaSets and
+creates the appropriate Local ReplicaSet definitions in the federated
+clusters.
+
+Feature #2 -> CUJ2:
+A component that checks how many replicas are actually running in each
+of the subclusters and if the number matches to the
+FederatedReplicaSet preferences (by default spread replicas evenly
+across the clusters but custom preferences are allowed - see
+below). If it doesn’t and the situation is unlikely to improve soon
+then the replicas should be moved to other subclusters.
+
+### API and CLI
+
+All interaction with FederatedReplicaSet will be done by issuing
+kubectl commands pointing on the Federated Master API Server. All the
+commands would behave in a similar way as on the regular master,
+however in the next versions (1.5+) some of the commands may give
+slightly different output. For example kubectl describe on federated
+replica set should also give some information about the subclusters.
+
+Moreover, for safety, some defaults will be different. For example for
+kubectl delete federatedreplicaset cascade will be set to false.
+
+FederatedReplicaSet would have the same object as local ReplicaSet
+(although it will be accessible in a different part of the
+api). Scheduling preferences (how many replicas in which cluster) will
+be passed as annotations.
+
+### FederateReplicaSet preferences
+
+The preferences are expressed by the following structure, passed as a
+serialized json inside annotations.
+
+```
+type FederatedReplicaSetPreferences struct {  
+    // If set to true then already scheduled and running replicas may be moved to other clusters to
+    // in order to bring cluster replicasets towards a desired state. Otherwise, if set to false,
+    // up and running replicas will not be moved.
+    Rebalance bool `json:"rebalance,omitempty"`
+
+    // Map from cluster name to preferences for that cluster. It is assumed that if a cluster   
+    // doesn’t have a matching entry then it should not have local replica. The cluster matches   
+    // to "*" if there is no entry with the real cluster name.   
+    Clusters map[string]LocalReplicaSetPreferences  
+}
+
+// Preferences regarding number of replicas assigned to a cluster replicaset within a federated replicaset.
+type ClusterReplicaSetPreferences struct {
+    // Minimum number of replicas that should be assigned to this Local ReplicaSet. 0 by default.
+    MinReplicas int64 `json:"minReplicas,omitempty"`
+
+    // Maximum number of replicas that should be assigned to this Local ReplicaSet. Unbounded if no value provided (default).
+    MaxReplicas *int64 `json:"maxReplicas,omitempty"`
+
+    // A number expressing the preference to put an additional replica to this LocalReplicaSet. 0 by default.
+    Weight int64
+}
+```
+
+How this works in practice:
+
+**Scenario 1**. I want to spread my 50 replicas evenly across all available clusters. Config:
+
+```
+FederatedReplicaSetPreferences {
+   Rebalance : true
+   Clusters : map[string]LocalReplicaSet {
+     "*" : LocalReplicaSet{ Weight: 1}
+   } 
+}
+```
+
+Example:
+
+  Clusters A,B,C,  all have capacity.
+   Replica layout: A=16 B=17 C=17.
+  Clusters A,B,C and C has capacity for 6 replicas.
+   Replica layout: A=22 B=22 C=6
+  Clusters A,B,C. B and C are offline:
+   Replica layout: A=50
+
+**Scenario 2**. I want to have only 2 replicas in each of the clusters.
+
+```
+FederatedReplicaSetPreferences {
+   Rebalance : true
+   Clusters : map[string]LocalReplicaSet {
+     "*" : LocalReplicaSet{ MaxReplicas: 2; Weight: 1}
+   } 
+}
+```
+
+Or
+
+```
+FederatedReplicaSetPreferences {
+   Rebalance : true
+   Clusters : map[string]LocalReplicaSet {
+     "*" : LocalReplicaSet{ MinReplicas: 2; Weight: 0 }
+	 }
+ }
+
+```
+
+Or
+
+```
+FederatedReplicaSetPreferences {  
+   Rebalance : true
+   Clusters : map[string]LocalReplicaSet {  
+     "*" : LocalReplicaSet{ MinReplicas: 2; MaxReplicas: 2}   
+   }  
+}
+```
+
+There is a global target for 50, however if there are 3 clusters there will be only 6 replicas running.
+
+**Scenario 3**. I want to have 20 replicas in each of 3 clusters.
+
+```
+FederatedReplicaSetPreferences {  
+   Rebalance : true
+   Clusters : map[string]LocalReplicaSet {  
+     "*" : LocalReplicaSet{ MinReplicas: 20; Weight: 0}  
+   }  
+}
+```
+
+There is a global target for 50, however clusters require 60. So some clusters will have less replicas.
+ Replica layout: A=20 B=20 C=10.
+
+**Scenario 4**. I want to have equal number of replicas in clusters A,B,C, however don’t put more than 20 replicas to cluster C.
+
+```
+FederatedReplicaSetPreferences { 
+   Rebalance : true 
+   Clusters : map[string]LocalReplicaSet {  
+     "*" : LocalReplicaSet{ Weight: 1}  
+     “C” : LocalReplicaSet{ MaxReplicas: 20,  Weight: 1}  
+   }  
+}
+```
+
+Example:
+
+  All have capacity.
+   Replica layout: A=16 B=17 C=17.
+  B is offline/has no capacity
+   Replica layout: A=30 B=0 C=20
+  A and B are offline:
+   Replica layout: C=20
+
+**Scenario 5**. I want to run my application in cluster A, however if there are troubles FRS can also use clusters B and C, equally.
+
+```
+FederatedReplicaSetPreferences {  
+   Clusters : map[string]LocalReplicaSet {  
+     “A” : LocalReplicaSet{ Weight: 1000000}  
+     “B” : LocalReplicaSet{ Weight: 1}  
+     “C” : LocalReplicaSet{ Weight: 1}  
+   }  
+}
+```
+
+Example:
+
+  All have capacity.
+   Replica layout: A=50 B=0 C=0.
+  A has capacity for only 40 replicas
+   Replica layout: A=40 B=5 C=5
+
+**Scenario 6**. I want to run my application in clusters A, B and C. Cluster A gets twice the QPS than other clusters.
+
+```
+FederatedReplicaSetPreferences {  
+   Clusters : map[string]LocalReplicaSet {  
+     “A” : LocalReplicaSet{ Weight: 2}  
+     “B” : LocalReplicaSet{ Weight: 1}  
+     “C” : LocalReplicaSet{ Weight: 1}  
+   }  
+}
+```
+
+**Scenario 7**. I want to spread my 50 replicas evenly across all available clusters, but if there
+are already some replicas, please do not move them. Config:
+
+```
+FederatedReplicaSetPreferences {
+   Rebalance : false
+   Clusters : map[string]LocalReplicaSet {
+     "*" : LocalReplicaSet{ Weight: 1}
+   } 
+}
+```
+
+Example:
+
+  Clusters A,B,C, all have capacity, but A already has 20 replicas
+   Replica layout: A=20 B=15 C=15.
+  Clusters A,B,C and C has capacity for 6 replicas, A has already 20 replicas.
+   Replica layout: A=22 B=22 C=6
+  Clusters A,B,C and C has capacity for 6 replicas, A has already 30 replicas.
+   Replica layout: A=30 B=14 C=6
+
+## The Idea
+
+A new federated controller - Federated Replica Set Controller (FRSC)
+will be created inside federated controller manager. Below are
+enumerated the key idea elements:
+
+ [I0] It is considered OK to have slightly higher number of replicas
+   globally for some time.
+
+ [I1] FRSC starts an informer on the FederatedReplicaSet that listens
+   on FRS being created, updated or deleted. On each create/update the
+   scheduling code will be started to calculate where to put the
+   replicas. The default behavior is to start the same amount of
+   replicas in each of the cluster. While creating LocalReplicaSets
+   (LRS) the following errors/issues can occur:
+
+   + [E1] Master rejects LRS creation (for known or unknown
+      reason). In this case another attempt to create a LRS should be
+      attempted in 1m or so. This action can be tied with
+      [[I5]](#heading=h.ififs95k9rng). Until the the LRS is created
+      the situation is the same as [E5]. If this happens multiple
+      times all due replicas should be moved elsewhere and later moved
+      back once the LRS is created.
+
+   + [E2] LRS with the same name but different configuration already
+      exists. The LRS is then overwritten and an appropriate event
+      created to explain what happened. Pods under the control of the
+      old LRS are left intact and the new LRS may adopt them if they
+      match the selector.
+
+   + [E3] LRS is new but the pods that match the selector exist. The
+      pods are adopted by the RS (if not owned by some other
+      RS). However they may have a different image, configuration
+      etc. Just like with regular LRS.
+
+ [I2] For each of the cluster FRSC starts a store and an informer on
+   LRS that will listen for status updates. These status changes are
+   only interesting in case of troubles. Otherwise it is assumed that
+   LRS runs trouble free and there is always the right number of pod
+   created but possibly not scheduled.
+
+
+   + [E4] LRS is manually deleted from the local cluster. In this case
+      a new LRS should be created. It is the same case as
+      [[E1]](#heading=h.wn3dfsyc4yuh). Any pods that were left behind
+      won’t be killed and will be adopted after the LRS is recreated.
+
+   + [E5] LRS fails to create (not necessary schedule) the desired
+      number of pods due to master troubles, admission control
+      etc. This should be considered as the same situation as replicas
+      unable to schedule (see [[I4]](#heading=h.dqalbelvn1pv)).
+
+   + [E6] It is impossible to tell that an informer lost connection
+      with a remote cluster or has other synchronization problem so it
+      should be handled by cluster liveness probe and deletion
+      [[I6]](#heading=h.z90979gc2216).
+
+ [I3] For each of the cluster start an store and informer to monitor
+   whether the created pods are eventually scheduled and what is the
+   current number of correctly running ready pods. Errors:
+
+   + [E7] It is impossible to tell that an informer lost connection
+      with a remote cluster or has other synchronization problem so it
+      should be handled by cluster liveness probe and deletion
+      [[I6]](#heading=h.z90979gc2216)
+
+ [I4] It is assumed that a not scheduled pod is a normal situation
+and can last up to X min if there is a huge traffic on the
+cluster. However if the replicas are not scheduled in that time then
+FRSC should consider moving most of the unscheduled replicas
+elsewhere. For that purpose FRSC will maintain a data structure
+where for each FRS controlled LRS we store a list of pods belonging
+to that LRS along with their current status and status change timestamp.
+
+ [I5] If a new cluster is added to the federation then it doesn’t
+   have a LRS and the situation is equal to
+   [[E1]](#heading=h.wn3dfsyc4yuh)/[[E4]](#heading=h.vlyovyh7eef).
+
+ [I6] If a cluster is removed from the federation then the situation
+   is equal to multiple [E4]. It is assumed that if a connection with
+   a cluster is lost completely then the cluster is removed from the
+   the cluster list (or marked accordingly) so
+   [[E6]](#heading=h.in6ove1c1s8f) and [[E7]](#heading=h.37bnbvwjxeda)
+   don’t need to be handled.
+
+ [I7] All ToBeChecked FRS are browsed every 1 min (configurable),
+   checked against the current list of clusters, and all missing LRS
+   are created. This will be executed in combination with [I8].
+
+ [I8] All pods from ToBeChecked FRS/LRS are browsed every 1 min
+   (configurable) to check whether some replica move between clusters
+   is needed or not.
+
+  FRSC never moves replicas to LRS that have not scheduled/running
+pods or that has pods that failed to be created.
+
+   + When FRSC notices that a number of pods are not scheduler/running
+      or not_even_created in one LRS for more than Y minutes it takes
+      most of them from LRS, leaving couple still waiting so that once
+      they are scheduled FRSC will know that it is ok to put some more
+      replicas to that cluster.
+
+   [I9] FRS becomes ToBeChecked if:
+   +  It is newly created
+   +  Some replica set inside changed its status
+   +  Some pods inside cluster changed their status
+   +  Some cluster is added or deleted.
+> FRS stops ToBeChecked if is in desired configuration (or is stable enough).
+
+## (RE)Scheduling algorithm
+
+To calculate the (re)scheduling moves for a given FRS:
+
+1. For each cluster FRSC calculates the number of replicas that are placed
+(not necessary up and running) in the cluster and the number of replicas that
+failed to be scheduled. Cluster capacity is the difference between the
+the placed and failed to be scheduled.
+
+2. Order all clusters by their weight and hash of the name so that every time
+we process the same replica-set we process the clusters in the same order.
+Include federated replica set name in the cluster name hash so that we get
+slightly different ordering for different RS. So that not all RS of size 1
+end up on the same cluster.
+
+3. Assign minimum prefered number of replicas to each of the clusters, if
+there is enough replicas and capacity.
+
+4. If rebalance = false, assign the previously present replicas to the clusters,
+remember the number of extra replicas added (ER). Of course if there
+is enough replicas and capacity.
+
+5. Distribute the remaining replicas with regard to weights and cluster capacity.
+In multiple iterations calculate how many of the replicas should end up in the cluster.
+For each of the cluster cap the number of assigned replicas by max number of replicas and
+cluster capacity. If there were some extra replicas added to the cluster in step
+4, don't really add the replicas but balance them gains ER from 4.
+
+## Goroutines layout
+
+ [GR1] Involved in FRS informer (see
+   [[I1]]). Whenever a FRS is created and
+   updated it puts the new/updated FRS on FRS_TO_CHECK_QUEUE with
+   delay 0.
+
+ [GR2_1...GR2_N] Involved in informers/store on LRS (see
+   [[I2]]). On all changes the FRS is put on
+   FRS_TO_CHECK_QUEUE with delay 1min.
+
+ [GR3_1...GR3_N] Involved in informers/store on Pods
+   (see [[I3]] and [[I4]]). They maintain the status store
+   so that for each of the LRS we know the number of pods that are
+   actually running and ready in O(1) time. They also put the
+   corresponding FRS on FRS_TO_CHECK_QUEUE with delay 1min.
+
+ [GR4] Involved in cluster informer (see
+   [[I5]] and [[I6]] ). It puts all FRS on FRS_TO_CHECK_QUEUE
+   with delay 0.
+
+ [GR5_*] Go routines handling FRS_TO_CHECK_QUEUE that put FRS on
+   FRS_CHANNEL after the given delay (and remove from
+   FRS_TO_CHECK_QUEUE). Every time an already present FRS is added to
+   FRS_TO_CHECK_QUEUE the delays are compared and updated so that the
+   shorter delay is used.
+
+ [GR6] Contains a selector that listens on a FRS_CHANNEL. Whenever
+   a FRS is received it is put to a work queue. Work queue has no delay
+   and makes sure that a single replica set is process is processed by
+   only one goroutine.
+
+ [GR7_*] Goroutines related to workqueue. They fire DoFrsCheck on the FRS.
+   Multiple replica set can be processed in parallel. Two Goroutines cannot
+   process the same FRS at the same time.
+
+
+## Func DoFrsCheck
+
+The function does [[I7]] and[[I8]]. It is assumed that it is run on a
+single thread/goroutine so we check and evaluate the same FRS on many
+goroutines (however if needed the function can be parallelized for
+different FRS). It takes data only from store maintained by GR2_* and
+GR3_*. The external communication is only required to:
+
+ Create LRS. If a LRS doesn’t exist it is created after the
+   rescheduling, when we know how much replicas should it have.
+
+ Update LRS replica targets.
+
+If FRS is not in the desired state then it is put to
+FRS_TO_CHECK_QUEUE with delay 1min (possibly increasing).
+
+## Monitoring and status reporting
+
+FRCS should expose a number of metrics form the run, like
+
+ FRSC -> LRS communication latency
+ Total times spent in various elements of DoFrsCheck
+
+FRSC should also expose the status of FRS as an annotation on FRS and
+as events.
+
+## Workflow
+
+Here is the sequence of tasks that need to be done in order for a
+typical FRS to be split into a number of LRS’s and to be created in
+the underlying federated clusters.
+
+Note a: the reason the workflow would be helpful at this phase is that
+for every one or two steps we can create PRs accordingly to start with
+the development.
+
+Note b: we assume that the federation is already in place and the
+federated clusters are added to the federation.
+
+Step 1. the client sends an RS create request to the
+federation-apiserver
+
+Step 2. federation-apiserver persists an FRS into the federation etcd
+
+Note c: federation-apiserver populates the clusterid field in the FRS
+before persisting it into the federation etcd
+
+Step 3: the federation-level “informer” in FRSC watches federation
+etcd for new/modified FRS’s, with empty clusterid or clusterid equal
+to federation ID, and if detected, it calls the scheduling code
+
+Step 4.
+
+Note d: scheduler populates the clusterid field in the LRS with the
+IDs of target clusters
+
+Note e: at this point let us assume that it only does the even
+distribution, i.e., equal weights for all of the underlying clusters
+
+Step 5. As soon as the scheduler function returns the control to FRSC,
+the FRSC starts a number of cluster-level “informer”s, one per every
+target cluster, to watch changes in every target cluster etcd
+regarding the posted LRS’s and if any violation from the scheduled
+number of replicase is detected the scheduling code is re-called for
+re-scheduling purposes.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/federated-replicasets.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->