diff --git a/docs/design/federated-replicasets.md b/docs/design/federated-replicasets.md new file mode 100644 index 00000000000..16db0379014 --- /dev/null +++ b/docs/design/federated-replicasets.md @@ -0,0 +1,542 @@ + + + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + + + + +# Federated ReplicaSets + +# Requirements & Design Document + +This document is a markdown version converted from a working [Google Doc](https://docs.google.com/a/google.com/document/d/1C1HEHQ1fwWtEhyl9JYu6wOiIUJffSmFmZgkGta4720I/edit?usp=sharing). Please refer to the original for extended commentary and discussion. + +Author: Marcin Wielgus [mwielgus@google.com](mailto:mwielgus@google.com) +Based on discussions with +Quinton Hoole [quinton@google.com](mailto:quinton@google.com), Wojtek Tyczyński [wojtekt@google.com](mailto:wojtekt@google.com) + +## Overview + +### Summary & Vision + +When running a global application on a federation of Kubernetes +clusters the owner currently has to start it in multiple clusters and +control whether he has both enough application replicas running +locally in each of the clusters (so that, for example, users are +handled by a nearby cluster, with low latency) and globally (so that +there is always enough capacity to handle all traffic). If one of the +clusters has issues or hasn’t enough capacity to run the given set of +replicas the replicas should be automatically moved to some other +cluster to keep the application responsive. + +In single cluster Kubernetes there is a concept of ReplicaSet that +manages the replicas locally. We want to expand this concept to the +federation level. + +### Goals + ++ Win large enterprise customers who want to easily run applications + across multiple clusters ++ Create a reference controller implementation to facilitate bringing + other Kubernetes concepts to Federated Kubernetes. + +## Glossary + +Federation Cluster - a cluster that is a member of federation. + +Local ReplicaSet (LRS) - ReplicaSet defined and running on a cluster +that is a member of federation. + +Federated ReplicaSet (FRS) - ReplicaSet defined and running inside of Federated K8S server. + +Federated ReplicaSet Controller (FRSC) - A controller running inside +of Federated K8S server that controlls FRS. + +## User Experience + +### Critical User Journeys + ++ [CUJ1] User wants to create a ReplicaSet in each of the federation + cluster. They create a definition of federated ReplicaSet on the + federated master and (local) ReplicaSets are automatically created + in each of the federation clusters. The number of replicas is each + of the Local ReplicaSets is (perheps indirectly) configurable by + the user. ++ [CUJ2] When the current number of replicas in a cluster drops below + the desired number and new replicas cannot be scheduled then they + should be started in some other cluster. + +### Features Enabling Critical User Journeys + +Feature #1 -> CUJ1: +A component which looks for newly created Federated ReplicaSets and +creates the appropriate Local ReplicaSet definitions in the federated +clusters. + +Feature #2 -> CUJ2: +A component that checks how many replicas are actually running in each +of the subclusters and if the number matches to the +FederatedReplicaSet preferences (by default spread replicas evenly +across the clusters but custom preferences are allowed - see +below). If it doesn’t and the situation is unlikely to improve soon +then the replicas should be moved to other subclusters. + +### API and CLI + +All interaction with FederatedReplicaSet will be done by issuing +kubectl commands pointing on the Federated Master API Server. All the +commands would behave in a similar way as on the regular master, +however in the next versions (1.5+) some of the commands may give +slightly different output. For example kubectl describe on federated +replica set should also give some information about the subclusters. + +Moreover, for safety, some defaults will be different. For example for +kubectl delete federatedreplicaset cascade will be set to false. + +FederatedReplicaSet would have the same object as local ReplicaSet +(although it will be accessible in a different part of the +api). Scheduling preferences (how many replicas in which cluster) will +be passed as annotations. + +### FederateReplicaSet preferences + +The preferences are expressed by the following structure, passed as a +serialized json inside annotations. + +``` +type FederatedReplicaSetPreferences struct { + // If set to true then already scheduled and running replicas may be moved to other clusters to + // in order to bring cluster replicasets towards a desired state. Otherwise, if set to false, + // up and running replicas will not be moved. + Rebalance bool `json:"rebalance,omitempty"` + + // Map from cluster name to preferences for that cluster. It is assumed that if a cluster + // doesn’t have a matching entry then it should not have local replica. The cluster matches + // to "*" if there is no entry with the real cluster name. + Clusters map[string]LocalReplicaSetPreferences +} + +// Preferences regarding number of replicas assigned to a cluster replicaset within a federated replicaset. +type ClusterReplicaSetPreferences struct { + // Minimum number of replicas that should be assigned to this Local ReplicaSet. 0 by default. + MinReplicas int64 `json:"minReplicas,omitempty"` + + // Maximum number of replicas that should be assigned to this Local ReplicaSet. Unbounded if no value provided (default). + MaxReplicas *int64 `json:"maxReplicas,omitempty"` + + // A number expressing the preference to put an additional replica to this LocalReplicaSet. 0 by default. + Weight int64 +} +``` + +How this works in practice: + +**Scenario 1**. I want to spread my 50 replicas evenly across all available clusters. Config: + +``` +FederatedReplicaSetPreferences { + Rebalance : true + Clusters : map[string]LocalReplicaSet { + "*" : LocalReplicaSet{ Weight: 1} + } +} +``` + +Example: + ++ Clusters A,B,C, all have capacity. + Replica layout: A=16 B=17 C=17. ++ Clusters A,B,C and C has capacity for 6 replicas. + Replica layout: A=22 B=22 C=6 ++ Clusters A,B,C. B and C are offline: + Replica layout: A=50 + +**Scenario 2**. I want to have only 2 replicas in each of the clusters. + +``` +FederatedReplicaSetPreferences { + Rebalance : true + Clusters : map[string]LocalReplicaSet { + "*" : LocalReplicaSet{ MaxReplicas: 2; Weight: 1} + } +} +``` + +Or + +``` +FederatedReplicaSetPreferences { + Rebalance : true + Clusters : map[string]LocalReplicaSet { + "*" : LocalReplicaSet{ MinReplicas: 2; Weight: 0 } + } + } + +``` + +Or + +``` +FederatedReplicaSetPreferences { + Rebalance : true + Clusters : map[string]LocalReplicaSet { + "*" : LocalReplicaSet{ MinReplicas: 2; MaxReplicas: 2} + } +} +``` + +There is a global target for 50, however if there are 3 clusters there will be only 6 replicas running. + +**Scenario 3**. I want to have 20 replicas in each of 3 clusters. + +``` +FederatedReplicaSetPreferences { + Rebalance : true + Clusters : map[string]LocalReplicaSet { + "*" : LocalReplicaSet{ MinReplicas: 20; Weight: 0} + } +} +``` + +There is a global target for 50, however clusters require 60. So some clusters will have less replicas. + Replica layout: A=20 B=20 C=10. + +**Scenario 4**. I want to have equal number of replicas in clusters A,B,C, however don’t put more than 20 replicas to cluster C. + +``` +FederatedReplicaSetPreferences { + Rebalance : true + Clusters : map[string]LocalReplicaSet { + "*" : LocalReplicaSet{ Weight: 1} + “C” : LocalReplicaSet{ MaxReplicas: 20, Weight: 1} + } +} +``` + +Example: + ++ All have capacity. + Replica layout: A=16 B=17 C=17. ++ B is offline/has no capacity + Replica layout: A=30 B=0 C=20 ++ A and B are offline: + Replica layout: C=20 + +**Scenario 5**. I want to run my application in cluster A, however if there are troubles FRS can also use clusters B and C, equally. + +``` +FederatedReplicaSetPreferences { + Clusters : map[string]LocalReplicaSet { + “A” : LocalReplicaSet{ Weight: 1000000} + “B” : LocalReplicaSet{ Weight: 1} + “C” : LocalReplicaSet{ Weight: 1} + } +} +``` + +Example: + ++ All have capacity. + Replica layout: A=50 B=0 C=0. ++ A has capacity for only 40 replicas + Replica layout: A=40 B=5 C=5 + +**Scenario 6**. I want to run my application in clusters A, B and C. Cluster A gets twice the QPS than other clusters. + +``` +FederatedReplicaSetPreferences { + Clusters : map[string]LocalReplicaSet { + “A” : LocalReplicaSet{ Weight: 2} + “B” : LocalReplicaSet{ Weight: 1} + “C” : LocalReplicaSet{ Weight: 1} + } +} +``` + +**Scenario 7**. I want to spread my 50 replicas evenly across all available clusters, but if there +are already some replicas, please do not move them. Config: + +``` +FederatedReplicaSetPreferences { + Rebalance : false + Clusters : map[string]LocalReplicaSet { + "*" : LocalReplicaSet{ Weight: 1} + } +} +``` + +Example: + ++ Clusters A,B,C, all have capacity, but A already has 20 replicas + Replica layout: A=20 B=15 C=15. ++ Clusters A,B,C and C has capacity for 6 replicas, A has already 20 replicas. + Replica layout: A=22 B=22 C=6 ++ Clusters A,B,C and C has capacity for 6 replicas, A has already 30 replicas. + Replica layout: A=30 B=14 C=6 + +## The Idea + +A new federated controller - Federated Replica Set Controller (FRSC) +will be created inside federated controller manager. Below are +enumerated the key idea elements: + ++ [I0] It is considered OK to have slightly higher number of replicas + globally for some time. + ++ [I1] FRSC starts an informer on the FederatedReplicaSet that listens + on FRS being created, updated or deleted. On each create/update the + scheduling code will be started to calculate where to put the + replicas. The default behavior is to start the same amount of + replicas in each of the cluster. While creating LocalReplicaSets + (LRS) the following errors/issues can occur: + + + [E1] Master rejects LRS creation (for known or unknown + reason). In this case another attempt to create a LRS should be + attempted in 1m or so. This action can be tied with + [[I5]](#heading=h.ififs95k9rng). Until the the LRS is created + the situation is the same as [E5]. If this happens multiple + times all due replicas should be moved elsewhere and later moved + back once the LRS is created. + + + [E2] LRS with the same name but different configuration already + exists. The LRS is then overwritten and an appropriate event + created to explain what happened. Pods under the control of the + old LRS are left intact and the new LRS may adopt them if they + match the selector. + + + [E3] LRS is new but the pods that match the selector exist. The + pods are adopted by the RS (if not owned by some other + RS). However they may have a different image, configuration + etc. Just like with regular LRS. + ++ [I2] For each of the cluster FRSC starts a store and an informer on + LRS that will listen for status updates. These status changes are + only interesting in case of troubles. Otherwise it is assumed that + LRS runs trouble free and there is always the right number of pod + created but possibly not scheduled. + + + + [E4] LRS is manually deleted from the local cluster. In this case + a new LRS should be created. It is the same case as + [[E1]](#heading=h.wn3dfsyc4yuh). Any pods that were left behind + won’t be killed and will be adopted after the LRS is recreated. + + + [E5] LRS fails to create (not necessary schedule) the desired + number of pods due to master troubles, admission control + etc. This should be considered as the same situation as replicas + unable to schedule (see [[I4]](#heading=h.dqalbelvn1pv)). + + + [E6] It is impossible to tell that an informer lost connection + with a remote cluster or has other synchronization problem so it + should be handled by cluster liveness probe and deletion + [[I6]](#heading=h.z90979gc2216). + ++ [I3] For each of the cluster start an store and informer to monitor + whether the created pods are eventually scheduled and what is the + current number of correctly running ready pods. Errors: + + + [E7] It is impossible to tell that an informer lost connection + with a remote cluster or has other synchronization problem so it + should be handled by cluster liveness probe and deletion + [[I6]](#heading=h.z90979gc2216) + ++ [I4] It is assumed that a not scheduled pod is a normal situation +and can last up to X min if there is a huge traffic on the +cluster. However if the replicas are not scheduled in that time then +FRSC should consider moving most of the unscheduled replicas +elsewhere. For that purpose FRSC will maintain a data structure +where for each FRS controlled LRS we store a list of pods belonging +to that LRS along with their current status and status change timestamp. + ++ [I5] If a new cluster is added to the federation then it doesn’t + have a LRS and the situation is equal to + [[E1]](#heading=h.wn3dfsyc4yuh)/[[E4]](#heading=h.vlyovyh7eef). + ++ [I6] If a cluster is removed from the federation then the situation + is equal to multiple [E4]. It is assumed that if a connection with + a cluster is lost completely then the cluster is removed from the + the cluster list (or marked accordingly) so + [[E6]](#heading=h.in6ove1c1s8f) and [[E7]](#heading=h.37bnbvwjxeda) + don’t need to be handled. + ++ [I7] All ToBeChecked FRS are browsed every 1 min (configurable), + checked against the current list of clusters, and all missing LRS + are created. This will be executed in combination with [I8]. + ++ [I8] All pods from ToBeChecked FRS/LRS are browsed every 1 min + (configurable) to check whether some replica move between clusters + is needed or not. + ++ FRSC never moves replicas to LRS that have not scheduled/running +pods or that has pods that failed to be created. + + + When FRSC notices that a number of pods are not scheduler/running + or not_even_created in one LRS for more than Y minutes it takes + most of them from LRS, leaving couple still waiting so that once + they are scheduled FRSC will know that it is ok to put some more + replicas to that cluster. + ++ [I9] FRS becomes ToBeChecked if: + + It is newly created + + Some replica set inside changed its status + + Some pods inside cluster changed their status + + Some cluster is added or deleted. +> FRS stops ToBeChecked if is in desired configuration (or is stable enough). + +## (RE)Scheduling algorithm + +To calculate the (re)scheduling moves for a given FRS: + +1. For each cluster FRSC calculates the number of replicas that are placed +(not necessary up and running) in the cluster and the number of replicas that +failed to be scheduled. Cluster capacity is the difference between the +the placed and failed to be scheduled. + +2. Order all clusters by their weight and hash of the name so that every time +we process the same replica-set we process the clusters in the same order. +Include federated replica set name in the cluster name hash so that we get +slightly different ordering for different RS. So that not all RS of size 1 +end up on the same cluster. + +3. Assign minimum prefered number of replicas to each of the clusters, if +there is enough replicas and capacity. + +4. If rebalance = false, assign the previously present replicas to the clusters, +remember the number of extra replicas added (ER). Of course if there +is enough replicas and capacity. + +5. Distribute the remaining replicas with regard to weights and cluster capacity. +In multiple iterations calculate how many of the replicas should end up in the cluster. +For each of the cluster cap the number of assigned replicas by max number of replicas and +cluster capacity. If there were some extra replicas added to the cluster in step +4, don't really add the replicas but balance them gains ER from 4. + +## Goroutines layout + ++ [GR1] Involved in FRS informer (see + [[I1]]). Whenever a FRS is created and + updated it puts the new/updated FRS on FRS_TO_CHECK_QUEUE with + delay 0. + ++ [GR2_1...GR2_N] Involved in informers/store on LRS (see + [[I2]]). On all changes the FRS is put on + FRS_TO_CHECK_QUEUE with delay 1min. + ++ [GR3_1...GR3_N] Involved in informers/store on Pods + (see [[I3]] and [[I4]]). They maintain the status store + so that for each of the LRS we know the number of pods that are + actually running and ready in O(1) time. They also put the + corresponding FRS on FRS_TO_CHECK_QUEUE with delay 1min. + ++ [GR4] Involved in cluster informer (see + [[I5]] and [[I6]] ). It puts all FRS on FRS_TO_CHECK_QUEUE + with delay 0. + ++ [GR5_*] Go routines handling FRS_TO_CHECK_QUEUE that put FRS on + FRS_CHANNEL after the given delay (and remove from + FRS_TO_CHECK_QUEUE). Every time an already present FRS is added to + FRS_TO_CHECK_QUEUE the delays are compared and updated so that the + shorter delay is used. + ++ [GR6] Contains a selector that listens on a FRS_CHANNEL. Whenever + a FRS is received it is put to a work queue. Work queue has no delay + and makes sure that a single replica set is process is processed by + only one goroutine. + ++ [GR7_*] Goroutines related to workqueue. They fire DoFrsCheck on the FRS. + Multiple replica set can be processed in parallel. Two Goroutines cannot + process the same FRS at the same time. + + +## Func DoFrsCheck + +The function does [[I7]] and[[I8]]. It is assumed that it is run on a +single thread/goroutine so we check and evaluate the same FRS on many +goroutines (however if needed the function can be parallelized for +different FRS). It takes data only from store maintained by GR2_* and +GR3_*. The external communication is only required to: + ++ Create LRS. If a LRS doesn’t exist it is created after the + rescheduling, when we know how much replicas should it have. + ++ Update LRS replica targets. + +If FRS is not in the desired state then it is put to +FRS_TO_CHECK_QUEUE with delay 1min (possibly increasing). + +## Monitoring and status reporting + +FRCS should expose a number of metrics form the run, like + ++ FRSC -> LRS communication latency ++ Total times spent in various elements of DoFrsCheck + +FRSC should also expose the status of FRS as an annotation on FRS and +as events. + +## Workflow + +Here is the sequence of tasks that need to be done in order for a +typical FRS to be split into a number of LRS’s and to be created in +the underlying federated clusters. + +Note a: the reason the workflow would be helpful at this phase is that +for every one or two steps we can create PRs accordingly to start with +the development. + +Note b: we assume that the federation is already in place and the +federated clusters are added to the federation. + +Step 1. the client sends an RS create request to the +federation-apiserver + +Step 2. federation-apiserver persists an FRS into the federation etcd + +Note c: federation-apiserver populates the clusterid field in the FRS +before persisting it into the federation etcd + +Step 3: the federation-level “informer” in FRSC watches federation +etcd for new/modified FRS’s, with empty clusterid or clusterid equal +to federation ID, and if detected, it calls the scheduling code + +Step 4. + +Note d: scheduler populates the clusterid field in the LRS with the +IDs of target clusters + +Note e: at this point let us assume that it only does the even +distribution, i.e., equal weights for all of the underlying clusters + +Step 5. As soon as the scheduler function returns the control to FRSC, +the FRSC starts a number of cluster-level “informer”s, one per every +target cluster, to watch changes in every target cluster etcd +regarding the posted LRS’s and if any violation from the scheduled +number of replicase is detected the scheduling code is re-called for +re-scheduling purposes. + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/federated-replicasets.md?pixel)]() +