mirror of
https://github.com/k3s-io/kubernetes.git
synced 2025-07-30 06:54:01 +00:00
More updates based on feedback. #6993
This commit is contained in:
parent
4e6a329121
commit
ed2f3c6b1f
@ -28,13 +28,14 @@ Some command line options would be added to components that can do HA:
|
||||
* Lease Duration - How long a component can be master
|
||||
|
||||
## Design Discussion Notes
|
||||
Some components may run numerous threads in order to perform tasks in parallel. Upon losing master status, such components should exit instantly instead of attempting to gracefully shut down such threads. This is to ensure that, in the case there's some propagation delay in informing the threads they should stop, the lame-duck threads won't interfere with the new master. The component should exit with an exit code indicating that the component is not the master. Since all components will be run by systemd or some other monitoring system, this will just result in a restart.
|
||||
Some components may run numerous threads in order to perform tasks in parallel. Upon losing master status, such components should exit instantly instead of attempting to gracefully shut down such threads. This is to ensure that, in the case there's some propagation delay in informing the threads they should stop, the lame-duck threads won't interfere with the new master. The component should exit with an exit code indicating that the component is not the master. Since all components will be run by systemd or some other monitoring system, this will just result in a restart.
|
||||
|
||||
There is a short window for a split brain condition because we cannot gate operations at the apiserver. Having the daemons exit shortens this window but does not eliminate it. A proper solution for this problem will be addressed at a later date. The proposed solution is:
|
||||
There is a short window after a new master acquires the lease, during which data from the old master might be committed. This is because there is currently no way to condition a write on its source being the master. Having the daemons exit shortens this window but does not eliminate it. A proper solution for this problem will be addressed at a later date. The proposed solution is:
|
||||
1. This requires transaction support in etcd (which is already planned - see [coreos/etcd#2675](https://github.com/coreos/etcd/pull/2675))
|
||||
2. Apart from the entry in etcd that is tracking the lease for a given component and is periodically refreshed, we introduce another entry (per component) that is changed only when the master is changing - let's call it "current master" entry (we don't refresh it).
|
||||
3. Master replica is aware of a version of its "current master" etcd entry.
|
||||
4. Whenever a master replica is trying to write something, it also attaches a "precondition" for the version of its "current master" entry [the whole transaction cannot succeed if the version of the corresponding "current master" entry in etcd has changed]. This basically guarantees that if we elect the new master, all transactions coming from the old master will fail.
|
||||
2. The entry in etcd that is tracking the lease for a given component (the "current master" entry) would have as its value the host:port of the lease-holder (as described earlier) and a sequence number. The sequence number is incremented whenever a new master gets the lease.
|
||||
3. Master replica is aware of the latest sequence number.
|
||||
4. Whenever master replica sends a mutating operation to the API server, it includes the sequence number.
|
||||
5. When the API server makes the corresponding write to etcd, it includes it in a transaction that does a compare-and-swap on the "current master" entry (old value == new value == host:port and sequence number from the replica that sent the mutating operation). This basically guarantees that if we elect the new master, all transactions coming from the old master will fail. You can think of this as the master attaching a "precondition" of its belief about who is the latest master.
|
||||
|
||||
## Open Questions:
|
||||
* Is there a desire to keep track of all nodes for a specific component type?
|
||||
|
Loading…
Reference in New Issue
Block a user