From b5390a991e0be16b8deb45528813902b1a3daa35 Mon Sep 17 00:00:00 2001 From: Robert Rati Date: Fri, 17 Apr 2015 14:28:25 -0400 Subject: [PATCH 1/6] Proposal for High Availability of Daemons #6993 --- docs/proposals/high-availability.md | 34 +++++++++++++++++++++++++++++ 1 file changed, 34 insertions(+) create mode 100644 docs/proposals/high-availability.md diff --git a/docs/proposals/high-availability.md b/docs/proposals/high-availability.md new file mode 100644 index 00000000000..afd12a9f5e6 --- /dev/null +++ b/docs/proposals/high-availability.md @@ -0,0 +1,34 @@ +# High Availability of Daemons in Kubernetes +This document serves as a proposal for high availability of the master daemons in kubernetes. + +## Design Options +1. Hot Standby Daemons: In this scenario, data and state are shared between the two deamons such that an immediate failure in one daemon causes the the standby deamon to take over exactly where the failed daemon had left off. This would be an ideal solution for kubernetes, however it poses a series of challenges in the case of controllers where daemon-state is cached locally and not persisted in a transactional way to a storage facility. As a result, we are **NOT** planning on this approach. + +2. **Cold Standby Daemons**: In this scenario there is only one active daemon acting as the master and additional daemons in a standby mode. Data and state are not shared between the active and standby daemons, so when a failure occurs the standby daemon that becomes the master must determine the current state of the system before resuming functionality. + +3. Stateless load-balanced Daemons: Stateless daemons, such as the apiserver, can simply load-balance across any number of servers that are currently running. Their general availability can be continuously updated, or published, such that load balancing only occurs across active participants. This aspect of HA is outside of the scope of *this* proposal because there is already a partial implementation in the apiserver. + + +## Design Discussion Notes on Leader Election +For a very simple example of proposed behavior see: +* https://github.com/rrati/etcd-ha +* go get github.com/rrati/etcd-ha + +In HA, the apiserver will be a gateway to etcd. It will provide an api for becoming master, updating the master lease, and releasing the lease. This api is daemon agnostic, so to become the master the client will need to provide the daemon type and the lease duration when attemping to become master. The apiserver will attempt to create a key in etcd based on the daemon type that contains the client's hostname/ip and port information. This key will be created with a ttl from the lease duration provided in the request. Failure to create this key means there is already a master of that daemon type, and the error from etcd will propigate to the client. Successfully creating the key means the client making the request is the master. When updating the lease, the apiserver will update the existing key with a new ttl. The location in etcd for the HA keys is TBD. + +Leader election is first come, first serve. The first daemon of a specific type to request leadership will become the master. All other daemons of that type will fail until the current leader releases the lease or fails to update the lease within the expiration time. On startup, all daemons should attempt to become master. The daemon that succeeds is the master and should perform all functions of that daemon. The daemons that fail to become the master should not perform any tasks and sleep for their lease duration and then attempt to become the master again. + +The daemon that becomes master should create a Go routine to manage the lease. This process should be created with a channel that the main daemon process can use to release the master lease. Otherwise, this process will update the lease and sleep, waiting for the next update time or notification to release the lease. If there is a failure to update the lease, this process should force the entire daemon to exit. Daemon exit is meant to prevent potential split-brain conditions. Daemon restart is implied in this scenario, by either the init system (systemd), or possible watchdog processes. (See Design Discussion Notes) + +## Options added to daemons with HA functionality +Some command line options would be added to daemons that can do HA: + +* Lease Duration - How long a daemon can be master + +* Number of Missed Lease Updates - How many updates can be missed before the lease as the master is lost + +## Design Discussion Notes on Scheduler/Controller +Some daemons, such as the controller-manager, may fork numerous go routines to perform tasks in parallel. Trying to keep track of all these processes and shut them down cleanly is untenable. If a master daemon loses leadership then the whole daemon should exit with an exit code indicating that the daemon is not the master. The daemon should be restarted by a monitoring system, such as systemd, or a software watchdog. + +## Open Questions: +* Is there a desire to keep track of all nodes for a specific daemon type? From c3469dd23640c03a68056f06db9125e88b7c0e4b Mon Sep 17 00:00:00 2001 From: Robert Rati Date: Tue, 21 Apr 2015 16:58:45 -0400 Subject: [PATCH 2/6] Updated HA proposal based upon comments. #6993 --- docs/proposals/high-availability.md | 38 ++++++++++++++--------------- 1 file changed, 19 insertions(+), 19 deletions(-) diff --git a/docs/proposals/high-availability.md b/docs/proposals/high-availability.md index afd12a9f5e6..9a2367bd3c1 100644 --- a/docs/proposals/high-availability.md +++ b/docs/proposals/high-availability.md @@ -1,34 +1,34 @@ -# High Availability of Daemons in Kubernetes -This document serves as a proposal for high availability of the master daemons in kubernetes. +# High Availability of Scheduling and Controller Components in Kubernetes +This document serves as a proposal for high availability of the scheduler and controller components in kubernetes. This proposal is intended to provide a simple High Availability api for kubernertes components only. Extensibility beyond that scope will be subject to other constraints. ## Design Options -1. Hot Standby Daemons: In this scenario, data and state are shared between the two deamons such that an immediate failure in one daemon causes the the standby deamon to take over exactly where the failed daemon had left off. This would be an ideal solution for kubernetes, however it poses a series of challenges in the case of controllers where daemon-state is cached locally and not persisted in a transactional way to a storage facility. As a result, we are **NOT** planning on this approach. +For complete reference see [this](https://www.ibm.com/developerworks/community/blogs/RohitShetty/entry/high_availability_cold_warm_hot?lang=en) -2. **Cold Standby Daemons**: In this scenario there is only one active daemon acting as the master and additional daemons in a standby mode. Data and state are not shared between the active and standby daemons, so when a failure occurs the standby daemon that becomes the master must determine the current state of the system before resuming functionality. +1. Hot Standby: In this scenario, data and state are shared between the two components such that an immediate failure in one component causes the the standby deamon to take over exactly where the failed component had left off. This would be an ideal solution for kubernetes, however it poses a series of challenges in the case of controllers where component-state is cached locally and not persisted in a transactional way to a storage facility. This would also introduce additional load on the apiserver, which is not desireable. As a result, we are **NOT** planning on this approach at this time. -3. Stateless load-balanced Daemons: Stateless daemons, such as the apiserver, can simply load-balance across any number of servers that are currently running. Their general availability can be continuously updated, or published, such that load balancing only occurs across active participants. This aspect of HA is outside of the scope of *this* proposal because there is already a partial implementation in the apiserver. +2. **Warm Standby**: In this scenario there is only one active component acting as the master and additional components running by not providing service or responding to requests. Data and state are not shared between the active and standby components. When a failure occurs, the standby component that becomes the master must determine the current state of the system before resuming functionality. +3. Active-Active (Load Balanced): Components, such as the apiserver, can simply load-balance across any number of servers that are currently running. Their general availability can be continuously updated, or published, such that load balancing only occurs across active participants. This aspect of HA is outside of the scope of *this* proposal because there is already a partial implementation in the apiserver. ## Design Discussion Notes on Leader Election -For a very simple example of proposed behavior see: -* https://github.com/rrati/etcd-ha -* go get github.com/rrati/etcd-ha +Implementation References: +* [zookeeper](http://zookeeper.apache.org/doc/trunk/recipes.html#sc_leaderElection) +* [etcd](https://groups.google.com/forum/#!topic/etcd-dev/EbAa4fjypb4) +* [initialPOC](https://github.com/rrati/etcd-ha) -In HA, the apiserver will be a gateway to etcd. It will provide an api for becoming master, updating the master lease, and releasing the lease. This api is daemon agnostic, so to become the master the client will need to provide the daemon type and the lease duration when attemping to become master. The apiserver will attempt to create a key in etcd based on the daemon type that contains the client's hostname/ip and port information. This key will be created with a ttl from the lease duration provided in the request. Failure to create this key means there is already a master of that daemon type, and the error from etcd will propigate to the client. Successfully creating the key means the client making the request is the master. When updating the lease, the apiserver will update the existing key with a new ttl. The location in etcd for the HA keys is TBD. +In HA, the apiserver will provide an api for sets of replicated clients to do master election: become master, update the lease, and release the lease. This api is component agnostic, so a client will need to provide the component type and the lease duration when attemping to become master. The lease duration should be tuned per component. The apiserver will attempt to create a key in etcd based on the component type that contains the client's hostname/ip and port information. This key will be created with a ttl from the lease duration provided in the request. Failure to create this key means there is already a master of that component type, and the error from etcd will propigate to the client. Successfully creating the key means the client making the request is the master. When updating the lease, the apiserver will update the existing key with a new ttl. The location in etcd for the HA keys is TBD. -Leader election is first come, first serve. The first daemon of a specific type to request leadership will become the master. All other daemons of that type will fail until the current leader releases the lease or fails to update the lease within the expiration time. On startup, all daemons should attempt to become master. The daemon that succeeds is the master and should perform all functions of that daemon. The daemons that fail to become the master should not perform any tasks and sleep for their lease duration and then attempt to become the master again. +The first component to request leadership will become the master. All other components of that type will fail until the current leader releases the lease, or fails to update the lease within the expiration time. On startup, all components should attempt to become master. The component that succeeds becomes the master, and should perform all functions of that component. The components that fail to become the master should not perform any tasks and sleep for their lease duration and then attempt to become the master again. A clean shutdown of the leader will cause a release of the lease and a new master will be elected. -The daemon that becomes master should create a Go routine to manage the lease. This process should be created with a channel that the main daemon process can use to release the master lease. Otherwise, this process will update the lease and sleep, waiting for the next update time or notification to release the lease. If there is a failure to update the lease, this process should force the entire daemon to exit. Daemon exit is meant to prevent potential split-brain conditions. Daemon restart is implied in this scenario, by either the init system (systemd), or possible watchdog processes. (See Design Discussion Notes) +The component that becomes master should create a thread to manage the lease. This thread should be created with a channel that the main process can use to release the master lease. The master should release the lease in cases of an unrecoverable error and clean shutdown. Otherwise, this process will update the lease and sleep, waiting for the next update time or notification to release the lease. If there is a failure to update the lease, this process should force the entire component to exit. Daemon exit is meant to prevent potential split-brain conditions. Daemon restart is implied in this scenario, by either the init system (systemd), or possible watchdog processes. (See Design Discussion Notes) -## Options added to daemons with HA functionality -Some command line options would be added to daemons that can do HA: +## Options added to components with HA functionality +Some command line options would be added to components that can do HA: -* Lease Duration - How long a daemon can be master +* Lease Duration - How long a component can be master -* Number of Missed Lease Updates - How many updates can be missed before the lease as the master is lost - -## Design Discussion Notes on Scheduler/Controller -Some daemons, such as the controller-manager, may fork numerous go routines to perform tasks in parallel. Trying to keep track of all these processes and shut them down cleanly is untenable. If a master daemon loses leadership then the whole daemon should exit with an exit code indicating that the daemon is not the master. The daemon should be restarted by a monitoring system, such as systemd, or a software watchdog. +## Design Discussion Notes +Some components may run numerous threads in order to perform tasks in parallel. Upon losing master status, such components should exit instantly instead of attempting to gracefully shut down such threads. This is to ensure that, in the case there's some propagation delay in informing the threads they should stop, the lame-duck threads won't interfere with the new master. The component should exit with an exit code indicating that the component is not the master. Since all components will be run by systemd or some other monitoring system, this will just result in a restart. ## Open Questions: -* Is there a desire to keep track of all nodes for a specific daemon type? +* Is there a desire to keep track of all nodes for a specific component type? From 03ad9af16ac9a58b9f996c66fe2b14d9abd4a6b7 Mon Sep 17 00:00:00 2001 From: Robert Rati Date: Wed, 22 Apr 2015 08:18:17 -0400 Subject: [PATCH 3/6] More updates based on feedback. #6993 --- docs/proposals/high-availability.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/proposals/high-availability.md b/docs/proposals/high-availability.md index 9a2367bd3c1..2bfa6dc087e 100644 --- a/docs/proposals/high-availability.md +++ b/docs/proposals/high-availability.md @@ -1,5 +1,5 @@ # High Availability of Scheduling and Controller Components in Kubernetes -This document serves as a proposal for high availability of the scheduler and controller components in kubernetes. This proposal is intended to provide a simple High Availability api for kubernertes components only. Extensibility beyond that scope will be subject to other constraints. +This document serves as a proposal for high availability of the scheduler and controller components in kubernetes. This proposal is intended to provide a simple High Availability api for kubernetes components with the potential to extend to services running on kubernetes. Those services would be subject to their own constraints. ## Design Options For complete reference see [this](https://www.ibm.com/developerworks/community/blogs/RohitShetty/entry/high_availability_cold_warm_hot?lang=en) @@ -8,7 +8,7 @@ For complete reference see [this](https://www.ibm.com/developerworks/community/b 2. **Warm Standby**: In this scenario there is only one active component acting as the master and additional components running by not providing service or responding to requests. Data and state are not shared between the active and standby components. When a failure occurs, the standby component that becomes the master must determine the current state of the system before resuming functionality. -3. Active-Active (Load Balanced): Components, such as the apiserver, can simply load-balance across any number of servers that are currently running. Their general availability can be continuously updated, or published, such that load balancing only occurs across active participants. This aspect of HA is outside of the scope of *this* proposal because there is already a partial implementation in the apiserver. +3. Active-Active (Load Balanced): Clients can simply load-balance across any number of servers that are currently running. Their general availability can be continuously updated, or published, such that load balancing only occurs across active participants. This aspect of HA is outside of the scope of *this* proposal because there is already a partial implementation in the apiserver. ## Design Discussion Notes on Leader Election Implementation References: @@ -16,11 +16,11 @@ Implementation References: * [etcd](https://groups.google.com/forum/#!topic/etcd-dev/EbAa4fjypb4) * [initialPOC](https://github.com/rrati/etcd-ha) -In HA, the apiserver will provide an api for sets of replicated clients to do master election: become master, update the lease, and release the lease. This api is component agnostic, so a client will need to provide the component type and the lease duration when attemping to become master. The lease duration should be tuned per component. The apiserver will attempt to create a key in etcd based on the component type that contains the client's hostname/ip and port information. This key will be created with a ttl from the lease duration provided in the request. Failure to create this key means there is already a master of that component type, and the error from etcd will propigate to the client. Successfully creating the key means the client making the request is the master. When updating the lease, the apiserver will update the existing key with a new ttl. The location in etcd for the HA keys is TBD. +In HA, the apiserver will provide an api for sets of replicated clients to do master election: acquire the lease, renew the lease, and release the lease. This api is component agnostic, so a client will need to provide the component type and the lease duration when attemping to become master. The lease duration should be tuned per component. The apiserver will attempt to create a key in etcd based on the component type that contains the client's hostname/ip and port information. This key will be created with a ttl from the lease duration provided in the request. Failure to create this key means there is already a master of that component type, and the error from etcd will propigate to the client. Successfully creating the key means the client making the request is the master. Only the current master can renew the lease. When renewing the lease, the apiserver will update the existing key with a new ttl. The location in etcd for the HA keys is TBD. -The first component to request leadership will become the master. All other components of that type will fail until the current leader releases the lease, or fails to update the lease within the expiration time. On startup, all components should attempt to become master. The component that succeeds becomes the master, and should perform all functions of that component. The components that fail to become the master should not perform any tasks and sleep for their lease duration and then attempt to become the master again. A clean shutdown of the leader will cause a release of the lease and a new master will be elected. +The first component to request leadership will become the master. All other components of that type will fail until the current leader releases the lease, or fails to renew the lease within the expiration time. On startup, all components should attempt to become master. The component that succeeds becomes the master, and should perform all functions of that component. The components that fail to become the master should not perform any tasks and sleep for their lease duration and then attempt to become the master again. A clean shutdown of the leader will cause a release of the lease and a new master will be elected. -The component that becomes master should create a thread to manage the lease. This thread should be created with a channel that the main process can use to release the master lease. The master should release the lease in cases of an unrecoverable error and clean shutdown. Otherwise, this process will update the lease and sleep, waiting for the next update time or notification to release the lease. If there is a failure to update the lease, this process should force the entire component to exit. Daemon exit is meant to prevent potential split-brain conditions. Daemon restart is implied in this scenario, by either the init system (systemd), or possible watchdog processes. (See Design Discussion Notes) +The component that becomes master should create a thread to manage the lease. This thread should be created with a channel that the main process can use to release the master lease. The master should release the lease in cases of an unrecoverable error and clean shutdown. Otherwise, this process will renew the lease and sleep, waiting for the next renewal time or notification to release the lease. If there is a failure to renew the lease, this process should force the entire component to exit. Daemon exit is meant to prevent potential split-brain conditions. Daemon restart is implied in this scenario, by either the init system (systemd), or possible watchdog processes. (See Design Discussion Notes) ## Options added to components with HA functionality Some command line options would be added to components that can do HA: From 4e6a3291217e7b4931bda3b07e6564dcb0bbd3c0 Mon Sep 17 00:00:00 2001 From: Robert Rati Date: Fri, 24 Apr 2015 14:55:42 -0400 Subject: [PATCH 4/6] More updates from feedback. #6993 --- docs/proposals/high-availability.md | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/docs/proposals/high-availability.md b/docs/proposals/high-availability.md index 2bfa6dc087e..37a5eb09eb2 100644 --- a/docs/proposals/high-availability.md +++ b/docs/proposals/high-availability.md @@ -6,7 +6,7 @@ For complete reference see [this](https://www.ibm.com/developerworks/community/b 1. Hot Standby: In this scenario, data and state are shared between the two components such that an immediate failure in one component causes the the standby deamon to take over exactly where the failed component had left off. This would be an ideal solution for kubernetes, however it poses a series of challenges in the case of controllers where component-state is cached locally and not persisted in a transactional way to a storage facility. This would also introduce additional load on the apiserver, which is not desireable. As a result, we are **NOT** planning on this approach at this time. -2. **Warm Standby**: In this scenario there is only one active component acting as the master and additional components running by not providing service or responding to requests. Data and state are not shared between the active and standby components. When a failure occurs, the standby component that becomes the master must determine the current state of the system before resuming functionality. +2. **Warm Standby**: In this scenario there is only one active component acting as the master and additional components running but not providing service or responding to requests. Data and state are not shared between the active and standby components. When a failure occurs, the standby component that becomes the master must determine the current state of the system before resuming functionality. This is the apprach that this proposal will leverage. 3. Active-Active (Load Balanced): Clients can simply load-balance across any number of servers that are currently running. Their general availability can be continuously updated, or published, such that load balancing only occurs across active participants. This aspect of HA is outside of the scope of *this* proposal because there is already a partial implementation in the apiserver. @@ -30,5 +30,11 @@ Some command line options would be added to components that can do HA: ## Design Discussion Notes Some components may run numerous threads in order to perform tasks in parallel. Upon losing master status, such components should exit instantly instead of attempting to gracefully shut down such threads. This is to ensure that, in the case there's some propagation delay in informing the threads they should stop, the lame-duck threads won't interfere with the new master. The component should exit with an exit code indicating that the component is not the master. Since all components will be run by systemd or some other monitoring system, this will just result in a restart. +There is a short window for a split brain condition because we cannot gate operations at the apiserver. Having the daemons exit shortens this window but does not eliminate it. A proper solution for this problem will be addressed at a later date. The proposed solution is: +1. This requires transaction support in etcd (which is already planned - see [coreos/etcd#2675](https://github.com/coreos/etcd/pull/2675)) +2. Apart from the entry in etcd that is tracking the lease for a given component and is periodically refreshed, we introduce another entry (per component) that is changed only when the master is changing - let's call it "current master" entry (we don't refresh it). +3. Master replica is aware of a version of its "current master" etcd entry. +4. Whenever a master replica is trying to write something, it also attaches a "precondition" for the version of its "current master" entry [the whole transaction cannot succeed if the version of the corresponding "current master" entry in etcd has changed]. This basically guarantees that if we elect the new master, all transactions coming from the old master will fail. + ## Open Questions: * Is there a desire to keep track of all nodes for a specific component type? From ed2f3c6b1f23ac8bb3c252b4e5c1d21e47c5b6a2 Mon Sep 17 00:00:00 2001 From: Robert Rati Date: Wed, 29 Apr 2015 14:50:13 -0400 Subject: [PATCH 5/6] More updates based on feedback. #6993 --- docs/proposals/high-availability.md | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/docs/proposals/high-availability.md b/docs/proposals/high-availability.md index 37a5eb09eb2..04d206fdc2d 100644 --- a/docs/proposals/high-availability.md +++ b/docs/proposals/high-availability.md @@ -28,13 +28,14 @@ Some command line options would be added to components that can do HA: * Lease Duration - How long a component can be master ## Design Discussion Notes -Some components may run numerous threads in order to perform tasks in parallel. Upon losing master status, such components should exit instantly instead of attempting to gracefully shut down such threads. This is to ensure that, in the case there's some propagation delay in informing the threads they should stop, the lame-duck threads won't interfere with the new master. The component should exit with an exit code indicating that the component is not the master. Since all components will be run by systemd or some other monitoring system, this will just result in a restart. +Some components may run numerous threads in order to perform tasks in parallel. Upon losing master status, such components should exit instantly instead of attempting to gracefully shut down such threads. This is to ensure that, in the case there's some propagation delay in informing the threads they should stop, the lame-duck threads won't interfere with the new master. The component should exit with an exit code indicating that the component is not the master. Since all components will be run by systemd or some other monitoring system, this will just result in a restart. -There is a short window for a split brain condition because we cannot gate operations at the apiserver. Having the daemons exit shortens this window but does not eliminate it. A proper solution for this problem will be addressed at a later date. The proposed solution is: +There is a short window after a new master acquires the lease, during which data from the old master might be committed. This is because there is currently no way to condition a write on its source being the master. Having the daemons exit shortens this window but does not eliminate it. A proper solution for this problem will be addressed at a later date. The proposed solution is: 1. This requires transaction support in etcd (which is already planned - see [coreos/etcd#2675](https://github.com/coreos/etcd/pull/2675)) -2. Apart from the entry in etcd that is tracking the lease for a given component and is periodically refreshed, we introduce another entry (per component) that is changed only when the master is changing - let's call it "current master" entry (we don't refresh it). -3. Master replica is aware of a version of its "current master" etcd entry. -4. Whenever a master replica is trying to write something, it also attaches a "precondition" for the version of its "current master" entry [the whole transaction cannot succeed if the version of the corresponding "current master" entry in etcd has changed]. This basically guarantees that if we elect the new master, all transactions coming from the old master will fail. +2. The entry in etcd that is tracking the lease for a given component (the "current master" entry) would have as its value the host:port of the lease-holder (as described earlier) and a sequence number. The sequence number is incremented whenever a new master gets the lease. +3. Master replica is aware of the latest sequence number. +4. Whenever master replica sends a mutating operation to the API server, it includes the sequence number. +5. When the API server makes the corresponding write to etcd, it includes it in a transaction that does a compare-and-swap on the "current master" entry (old value == new value == host:port and sequence number from the replica that sent the mutating operation). This basically guarantees that if we elect the new master, all transactions coming from the old master will fail. You can think of this as the master attaching a "precondition" of its belief about who is the latest master. ## Open Questions: * Is there a desire to keep track of all nodes for a specific component type? From 0beb72729e39b490396e7cdabe0639c2bf1ee69a Mon Sep 17 00:00:00 2001 From: Robert Rati Date: Fri, 1 May 2015 10:02:56 -0400 Subject: [PATCH 6/6] Fixed list formatting in the design discussion notes. #6993 --- docs/proposals/high-availability.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/docs/proposals/high-availability.md b/docs/proposals/high-availability.md index 04d206fdc2d..647c95621a0 100644 --- a/docs/proposals/high-availability.md +++ b/docs/proposals/high-availability.md @@ -31,10 +31,15 @@ Some command line options would be added to components that can do HA: Some components may run numerous threads in order to perform tasks in parallel. Upon losing master status, such components should exit instantly instead of attempting to gracefully shut down such threads. This is to ensure that, in the case there's some propagation delay in informing the threads they should stop, the lame-duck threads won't interfere with the new master. The component should exit with an exit code indicating that the component is not the master. Since all components will be run by systemd or some other monitoring system, this will just result in a restart. There is a short window after a new master acquires the lease, during which data from the old master might be committed. This is because there is currently no way to condition a write on its source being the master. Having the daemons exit shortens this window but does not eliminate it. A proper solution for this problem will be addressed at a later date. The proposed solution is: + 1. This requires transaction support in etcd (which is already planned - see [coreos/etcd#2675](https://github.com/coreos/etcd/pull/2675)) + 2. The entry in etcd that is tracking the lease for a given component (the "current master" entry) would have as its value the host:port of the lease-holder (as described earlier) and a sequence number. The sequence number is incremented whenever a new master gets the lease. + 3. Master replica is aware of the latest sequence number. + 4. Whenever master replica sends a mutating operation to the API server, it includes the sequence number. + 5. When the API server makes the corresponding write to etcd, it includes it in a transaction that does a compare-and-swap on the "current master" entry (old value == new value == host:port and sequence number from the replica that sent the mutating operation). This basically guarantees that if we elect the new master, all transactions coming from the old master will fail. You can think of this as the master attaching a "precondition" of its belief about who is the latest master. ## Open Questions: