From 5d9e3615f05957d0747638ac5036e9ed794e82ef Mon Sep 17 00:00:00 2001 From: Chao Xu Date: Tue, 13 Sep 2016 17:30:41 -0700 Subject: [PATCH 01/12] add a proposal for synchronous garbage collection --- .../synchronous-garbage-collection.md | 59 +++++++++++++++++++ 1 file changed, 59 insertions(+) create mode 100644 docs/proposals/synchronous-garbage-collection.md diff --git a/docs/proposals/synchronous-garbage-collection.md b/docs/proposals/synchronous-garbage-collection.md new file mode 100644 index 00000000000..587fe42ab23 --- /dev/null +++ b/docs/proposals/synchronous-garbage-collection.md @@ -0,0 +1,59 @@ +# Overview + +Some users of the server-side garbage collection need to tell if the garbage collection is done. Synchronous Garbage Collection is a best-effort (see [unhandled cases](#unhandled-cases)) mechanism to enable such use cases: after the API server receives a deletion request of an owning object, the object keeps existing in the key-value store until all its dependents are deleted by the garbage collector. + +Tracking issue: https://github.com/kubernetes/kubernetes/issues/29891 + +# Required code modification + +We need to make changes in the API, the API Server, and the garbage collector to support synchronous garbage collection. + +## API changes +```go +DeleteOptions { + … + // If SynchronousGarbageCollection is set, the object will not be deleted immediately. Instead, a GarbageCollectionInProgress finalizer will be placed on the object. The garbage collector will remove the finalizer from the object when all depdendents are deleted. + // SynchronousGarbageCollection is ignored if OrphanDependents is true or nil. + // SynchronousGarbageCollection default to false. + // SynchronousGarbageCollection is cascading, i.e., the object’s dependents will be deleted with the same SynchronousGarbageCollection. + SynchronousGarbageCollection *bool +} +``` + +We will introduce a new standard finalizer: const GCFinalizer string = “GarbageCollectionInProgress” + +## Components changes + +### API Server + +Delete() function needs to check the DeleteOptions.SynchronousGarbageCollection. + +* The option is ignored if DeleteOptions.OrphanDependents is true or nil. +* If the option is set, the API server will update the object instead of deleting it, add the finalizer, and set the `ObjectMeta.DeletionTimestamp`. + +### Garbage Collector + +**Modifications to processEvent()** + +* Needs to handle Add or Update events where `obj.Finalizers.Has(GCFinalizer) && obj.DeletionTimestamp != nil`. The object will be added into the `synchronousGC queue`. The object will be marked as “GC in progress” in GC’s internal owner-dependency relationship graph, `uidToNode`. +* Upon receiving the deletion event of an object, put its owner into the `synchronousGC queue`. This is to force the `GCFinalizer` (described next) to re-check if all dependents of the owner is deleted. + +**Addition of GCFinalizer() routine** + +* Pops an object from the `synchronousGC queue`. +* Ignores the object if it doesn’t exist in `uidToNode`, or if the object is not marked as “GC in progress” in `uidToNode`. +* To avoid racing with another controller, it requeues the object if `observedGeneration < Generation`. This is best-effort, see [unhandled cases](#unhandled-cases). +* Checks if the object has dependents + * If not, send a PUT request to remove the GCFinalizer + * If so, then add all dependents to the dirtryQueue; we need bookkeeping to avoid adding the dependents repeatedly if the owner gets in the `synchronousGC queue` multiple times. + +**Modifications to processItem()** + +* Treat owner whose DeletionTimestamp is non-nil as “not exist”. +* When deleting dependents, it should use the same `DeleteOptions.SynchronousGC` as the owner’s finalizers suggest. + +**Note** +We cannot have circular dependencies anymore. Otherwise SynchronousGC will enter a deadlock. + +## Unhandled cases +* If the GC observes the owning object with the GCFinalizer before it observes the creation of all the dependents, GC will remove the finalizer from the owning object before all dependents are gone. Hence, “Synchronous GC” is best-effort, though we guarantee that the dependents will be deleted eventually. From d67eb53db2c872e4f22c104a606c0e0daa4adb4a Mon Sep 17 00:00:00 2001 From: Chao Xu Date: Tue, 13 Sep 2016 21:12:17 -0700 Subject: [PATCH 02/12] addressing comments --- .../synchronous-garbage-collection.md | 28 ++++++++++++++----- 1 file changed, 21 insertions(+), 7 deletions(-) diff --git a/docs/proposals/synchronous-garbage-collection.md b/docs/proposals/synchronous-garbage-collection.md index 587fe42ab23..494365d4867 100644 --- a/docs/proposals/synchronous-garbage-collection.md +++ b/docs/proposals/synchronous-garbage-collection.md @@ -35,7 +35,9 @@ Delete() function needs to check the DeleteOptions.SynchronousGarbageCollection. **Modifications to processEvent()** -* Needs to handle Add or Update events where `obj.Finalizers.Has(GCFinalizer) && obj.DeletionTimestamp != nil`. The object will be added into the `synchronousGC queue`. The object will be marked as “GC in progress” in GC’s internal owner-dependency relationship graph, `uidToNode`. +`processEvent()` manages GC’s internal owner-dependency relationship graph, `uidToNode`. It updates `uidToNode` according to the Add/Update/Delete events in the cluster. To support synchronous GC, it has to: + +* handle Add or Update events where `obj.Finalizers.Has(GCFinalizer) && obj.DeletionTimestamp != nil`. The object will be added into the `synchronousGC queue`. The object will be marked as “GC in progress” in `uidToNode`. * Upon receiving the deletion event of an object, put its owner into the `synchronousGC queue`. This is to force the `GCFinalizer` (described next) to re-check if all dependents of the owner is deleted. **Addition of GCFinalizer() routine** @@ -45,15 +47,27 @@ Delete() function needs to check the DeleteOptions.SynchronousGarbageCollection. * To avoid racing with another controller, it requeues the object if `observedGeneration < Generation`. This is best-effort, see [unhandled cases](#unhandled-cases). * Checks if the object has dependents * If not, send a PUT request to remove the GCFinalizer - * If so, then add all dependents to the dirtryQueue; we need bookkeeping to avoid adding the dependents repeatedly if the owner gets in the `synchronousGC queue` multiple times. + * If so, then add all dependents to the `dirtryQueue`; we need bookkeeping to avoid adding the dependents repeatedly if the owner gets in the `synchronousGC queue` multiple times. **Modifications to processItem()** -* Treat owner whose DeletionTimestamp is non-nil as “not exist”. -* When deleting dependents, it should use the same `DeleteOptions.SynchronousGC` as the owner’s finalizers suggest. +`processItem()` consumes the `dirtyQueue`, requests the API server to delete an item if all of its owners do not exist. To support synchronous GC, it has to: -**Note** -We cannot have circular dependencies anymore. Otherwise SynchronousGC will enter a deadlock. +* treat an owner as "not exist" if `owner.DeletionTimestamp != nil && !owner.Finalizers.Has(OrphanFinalizer)`, otherwise Synchronous GC will not progress because the owner keeps existing in the key-value store. +* when deleting dependents, it should use the same `DeleteOptions.SynchronousGC` as the owner’s finalizers suggest. + +**Handling circular dependencies** + +SynchronousGC will enter a deadlock in the presence of circular dependencies. To break the deadlock, we need to timeout a GCFinalizer. To implement the timeout, GC adds an object that has a GCFinalizer into a [delaying queue](../../pkg/util/workqueue/delaying_queue.go) when it's observed, and removes the GCFinalizer from it when the time is up. The timeout value should be proportional to the number of dependents, including indirect ones. ## Unhandled cases -* If the GC observes the owning object with the GCFinalizer before it observes the creation of all the dependents, GC will remove the finalizer from the owning object before all dependents are gone. Hence, “Synchronous GC” is best-effort, though we guarantee that the dependents will be deleted eventually. +* If the GC observes the owning object with the GCFinalizer before it observes the creation of all the dependents, GC will remove the finalizer from the owning object before all dependents are gone. Hence, “Synchronous GC” is best-effort, though we guarantee that the dependents will be deleted eventually. We face a similar case when handling OrphanFinalizer, see [GC known issues](https://github.com/kubernetes/kubernetes/issues/26120). + + +## Implications to existing clients + +**Namespce controller** can [handle finalizers](https://github.com/kubernetes/kubernetes/pull/32524), so it can properly delete a namespace if there is synchronous GC going on in the namespace. Also, we can convert namespace controller to use synchronous GC. + +We should be able to convert **kubectl delete** reapers to use synchronous GC. + +For other clients, they are able to work with synchronous GC as long as they can cope with finalizers in general. From b58746f603ac158865adaf13838e46c50e8f74bb Mon Sep 17 00:00:00 2001 From: Chao Xu Date: Tue, 20 Sep 2016 11:22:05 -0700 Subject: [PATCH 03/12] more --- .../synchronous-garbage-collection.md | 37 +++++++++++++++---- 1 file changed, 29 insertions(+), 8 deletions(-) diff --git a/docs/proposals/synchronous-garbage-collection.md b/docs/proposals/synchronous-garbage-collection.md index 494365d4867..9f94fa6e887 100644 --- a/docs/proposals/synchronous-garbage-collection.md +++ b/docs/proposals/synchronous-garbage-collection.md @@ -1,6 +1,6 @@ # Overview -Some users of the server-side garbage collection need to tell if the garbage collection is done. Synchronous Garbage Collection is a best-effort (see [unhandled cases](#unhandled-cases)) mechanism to enable such use cases: after the API server receives a deletion request of an owning object, the object keeps existing in the key-value store until all its dependents are deleted by the garbage collector. +Some users of the server-side garbage collection need to tell if the garbage collection is done ([example](https://github.com/kubernetes/kubernetes/issues/19701#issuecomment-236997077)). Synchronous Garbage Collection is a best-effort (see [unhandled cases](#unhandled-cases)) mechanism to enable such use cases: after the API server receives a deletion request of an owning object, the object keeps existing in the key-value store until all its dependents are deleted from the key-value store by the garbage collector. Tracking issue: https://github.com/kubernetes/kubernetes/issues/29891 @@ -13,7 +13,7 @@ We need to make changes in the API, the API Server, and the garbage collector to DeleteOptions { … // If SynchronousGarbageCollection is set, the object will not be deleted immediately. Instead, a GarbageCollectionInProgress finalizer will be placed on the object. The garbage collector will remove the finalizer from the object when all depdendents are deleted. - // SynchronousGarbageCollection is ignored if OrphanDependents is true or nil. + // SynchronousGarbageCollection and OrphanDependents are exclusive. // SynchronousGarbageCollection default to false. // SynchronousGarbageCollection is cascading, i.e., the object’s dependents will be deleted with the same SynchronousGarbageCollection. SynchronousGarbageCollection *bool @@ -46,7 +46,7 @@ Delete() function needs to check the DeleteOptions.SynchronousGarbageCollection. * Ignores the object if it doesn’t exist in `uidToNode`, or if the object is not marked as “GC in progress” in `uidToNode`. * To avoid racing with another controller, it requeues the object if `observedGeneration < Generation`. This is best-effort, see [unhandled cases](#unhandled-cases). * Checks if the object has dependents - * If not, send a PUT request to remove the GCFinalizer + * If not, send a PUT request to remove the `GCFinalizer` * If so, then add all dependents to the `dirtryQueue`; we need bookkeeping to avoid adding the dependents repeatedly if the owner gets in the `synchronousGC queue` multiple times. **Modifications to processItem()** @@ -55,19 +55,40 @@ Delete() function needs to check the DeleteOptions.SynchronousGarbageCollection. * treat an owner as "not exist" if `owner.DeletionTimestamp != nil && !owner.Finalizers.Has(OrphanFinalizer)`, otherwise Synchronous GC will not progress because the owner keeps existing in the key-value store. * when deleting dependents, it should use the same `DeleteOptions.SynchronousGC` as the owner’s finalizers suggest. +* if an object has multiple owners, some owners still exit while other owners are in the synchronous GC stage, then according to the existing logic of GC, the object wouldn't be deleted. To unblock the synchronous GC of those owners, `processItem()` has to remove the ownerReferences pointing to them. **Handling circular dependencies** -SynchronousGC will enter a deadlock in the presence of circular dependencies. To break the deadlock, we need to timeout a GCFinalizer. To implement the timeout, GC adds an object that has a GCFinalizer into a [delaying queue](../../pkg/util/workqueue/delaying_queue.go) when it's observed, and removes the GCFinalizer from it when the time is up. The timeout value should be proportional to the number of dependents, including indirect ones. +SynchronousGC will enter a deadlock in the presence of circular dependencies. Here are two alternative approaches to break the deadlock: + +1. Timeout a `GCFinalizer `: To implement the timeout, GC adds an object that has a `GCFinalizer` into a [delaying queue](../../pkg/util/workqueue/delaying_queue.go) when it's observed, and removes the `GCFinalizer` from it when the time is up. The timeout value should be proportional to the number of dependents, including indirect ones. + +2. Lazily detecting circular dependencies: when `processItem()` processes an object, if it finds the object and all of its owners have the `GCFinalizer`, it searches the internal owner-dependency relationship graph (`uidToNode`) to check if the object and any of its owner are in a circle where all objects have the `GCFinalizer`. If so, it removes the `GCFinzlier` from the object to break the circle. ## Unhandled cases -* If the GC observes the owning object with the GCFinalizer before it observes the creation of all the dependents, GC will remove the finalizer from the owning object before all dependents are gone. Hence, “Synchronous GC” is best-effort, though we guarantee that the dependents will be deleted eventually. We face a similar case when handling OrphanFinalizer, see [GC known issues](https://github.com/kubernetes/kubernetes/issues/26120). +* If the GC observes the owning object with the `GCFinalizer` before it observes the creation of all the dependents, GC will remove the finalizer from the owning object before all dependents are gone. Hence, “Synchronous GC” is best-effort, though we guarantee that the dependents will be deleted eventually. We face a similar case when handling OrphanFinalizer, see [GC known issues](https://github.com/kubernetes/kubernetes/issues/26120). ## Implications to existing clients -**Namespce controller** can [handle finalizers](https://github.com/kubernetes/kubernetes/pull/32524), so it can properly delete a namespace if there is synchronous GC going on in the namespace. Also, we can convert namespace controller to use synchronous GC. +Finalizer breaks an assumption that many Kubernetes components have: a deletion request with `grace period=0` will immediately remove the object from the key-value store. This is not true if an object has pending finalizers, the object will continue to exist, and currently the API server will not return an error in this case. -We should be able to convert **kubectl delete** reapers to use synchronous GC. +**Namespce controller** suffered from this [problem](https://github.com/kubernetes/kubernetes/issues/32519) and was fixed in [#32524](https://github.com/kubernetes/kubernetes/pull/32524) by retrying every 15s if there are objects with pending finalizers to be removed from the key-value store. Object with pending `GCFinalizer` might take arbitrary long time be deleted, so namespace deletion might time out. -For other clients, they are able to work with synchronous GC as long as they can cope with finalizers in general. +**kubelet** deletes the pod from the key-value store after all its containers are terminated ([code](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/status/status_manager.go#L441-L443)). It also assumes if API server does not return an error, the pod is removed from the key-value store. Breaking the assumption will not break `kubelet` though, because the `pod` must have already been in the terminated `phase`, `kubelet` will not care to manage it. + +**Node controller** forcefully deletes pod if the pod is scheduled to a node that does not exist ([code](https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/node/nodecontroller.go#L474)). The pod will continue to exist if it has pending finalizers. The node controller will futilely retry the deletion. The `node controller` forcefully deletes pods before deleting the node ([code](https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/node/nodecontroller.go#L592)). If the pods have pending finalizers, the `node controller` will go ahead deleting the node, leaving those pods behind. Other components will take care of the pending finalizers. + +**Podgc** deletes terminated pods if there are too many of them in the cluster. `Podgc` should remove any pending finalizers to make sure the pods are deleted. + +**Deployment controller** adopts existing `ReplicaSet` (RS) if its template matches. If a matching RS has a pending `GCFinalizer`, deployment shouldn't adopt it, because the RS controller will scale up/down a RS that's being deleted. Hence, `deployment controller` needs to check if a RS is being deleted before adopting it. If the RS is being deleted, then the `deployment controller` should wait for the status of the RS showing 0 replicas, and then create a new RS. + +**Replication controller manager**, **Job controller**, and **ReplicaSet controller** send deletion request of pods. `kubelet` will drive these pods to the terminated phase, so the pods will be ignored by the controllers even if they keep existing in the key-value store because of pending finalizers. + +**Endpoints controller** deletes endpoints. It does not double check if the endpoint is gone, so + +One usage of the synchronous GC is to replace the **kubectl delete** reapers. Currently `kubectl delete` blocks until all dependents and the owner are deleted. To maintain this behavior, after switched to using synchronous GC, *kubectl delete* needs to poll on the removal of the owner object. + +## Security implications + +A user who are authorized to update one object can affect the synchronous GC behavior of another object. Specifically, even if a user is only authorized to update a pod, he can set another object as the pod's owner, and set a very long grace termination period for the pod, then he makes the synchronous GC of the owner takes long time. From 0298cd3a2567719c0106cb9eac800bdfe42b7aef Mon Sep 17 00:00:00 2001 From: Chao Xu Date: Tue, 20 Sep 2016 22:25:20 -0700 Subject: [PATCH 04/12] addressing comments --- .../synchronous-garbage-collection.md | 68 ++++++++++++++----- 1 file changed, 50 insertions(+), 18 deletions(-) diff --git a/docs/proposals/synchronous-garbage-collection.md b/docs/proposals/synchronous-garbage-collection.md index 9f94fa6e887..110b41229fc 100644 --- a/docs/proposals/synchronous-garbage-collection.md +++ b/docs/proposals/synchronous-garbage-collection.md @@ -1,3 +1,32 @@ + + + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + + + + # Overview Some users of the server-side garbage collection need to tell if the garbage collection is done ([example](https://github.com/kubernetes/kubernetes/issues/19701#issuecomment-236997077)). Synchronous Garbage Collection is a best-effort (see [unhandled cases](#unhandled-cases)) mechanism to enable such use cases: after the API server receives a deletion request of an owning object, the object keeps existing in the key-value store until all its dependents are deleted from the key-value store by the garbage collector. @@ -9,6 +38,7 @@ Tracking issue: https://github.com/kubernetes/kubernetes/issues/29891 We need to make changes in the API, the API Server, and the garbage collector to support synchronous garbage collection. ## API changes + ```go DeleteOptions { … @@ -26,9 +56,9 @@ We will introduce a new standard finalizer: const GCFinalizer string = “Garbag ### API Server -Delete() function needs to check the DeleteOptions.SynchronousGarbageCollection. +Delete() function needs to check the DeleteOptions.SynchronousGarbageCollection. -* The option is ignored if DeleteOptions.OrphanDependents is true or nil. +* The option is ignored if DeleteOptions.OrphanDependents is true or nil. * If the option is set, the API server will update the object instead of deleting it, add the finalizer, and set the `ObjectMeta.DeletionTimestamp`. ### Garbage Collector @@ -47,25 +77,22 @@ Delete() function needs to check the DeleteOptions.SynchronousGarbageCollection. * To avoid racing with another controller, it requeues the object if `observedGeneration < Generation`. This is best-effort, see [unhandled cases](#unhandled-cases). * Checks if the object has dependents * If not, send a PUT request to remove the `GCFinalizer` - * If so, then add all dependents to the `dirtryQueue`; we need bookkeeping to avoid adding the dependents repeatedly if the owner gets in the `synchronousGC queue` multiple times. + * If so, then add all dependents to the `dirtryQueue`; we need bookkeeping to avoid adding the dependents repeatedly if the owner gets in the `synchronousGC queue` multiple times. **Modifications to processItem()** `processItem()` consumes the `dirtyQueue`, requests the API server to delete an item if all of its owners do not exist. To support synchronous GC, it has to: -* treat an owner as "not exist" if `owner.DeletionTimestamp != nil && !owner.Finalizers.Has(OrphanFinalizer)`, otherwise Synchronous GC will not progress because the owner keeps existing in the key-value store. +* treat an owner as "not exist" if `owner.DeletionTimestamp != nil && !owner.Finalizers.Has(OrphanFinalizer)`, otherwise Synchronous GC will not progress because the owner keeps existing in the key-value store. * when deleting dependents, it should use the same `DeleteOptions.SynchronousGC` as the owner’s finalizers suggest. -* if an object has multiple owners, some owners still exit while other owners are in the synchronous GC stage, then according to the existing logic of GC, the object wouldn't be deleted. To unblock the synchronous GC of those owners, `processItem()` has to remove the ownerReferences pointing to them. +* if an object has multiple owners, some owners still exit while other owners are in the synchronous GC stage, then according to the existing logic of GC, the object wouldn't be deleted. To unblock the synchronous GC of those owners, `processItem()` has to remove the ownerReferences pointing to them. **Handling circular dependencies** -SynchronousGC will enter a deadlock in the presence of circular dependencies. Here are two alternative approaches to break the deadlock: - -1. Timeout a `GCFinalizer `: To implement the timeout, GC adds an object that has a `GCFinalizer` into a [delaying queue](../../pkg/util/workqueue/delaying_queue.go) when it's observed, and removes the `GCFinalizer` from it when the time is up. The timeout value should be proportional to the number of dependents, including indirect ones. - -2. Lazily detecting circular dependencies: when `processItem()` processes an object, if it finds the object and all of its owners have the `GCFinalizer`, it searches the internal owner-dependency relationship graph (`uidToNode`) to check if the object and any of its owner are in a circle where all objects have the `GCFinalizer`. If so, it removes the `GCFinzlier` from the object to break the circle. +SynchronousGC will enter a deadlock in the presence of circular dependencies. The garbage collector can break the circle by lazily detecting circular dependencies: when `processItem()` processes an object, if it finds the object and all of its owners have the `GCFinalizer`, it searches the internal owner-dependency relationship graph (`uidToNode`) to check if the object and any of its owner are in a circle where all objects have the `GCFinalizer`. If so, it removes the `GCFinzlier` from the object to break the circle. ## Unhandled cases + * If the GC observes the owning object with the `GCFinalizer` before it observes the creation of all the dependents, GC will remove the finalizer from the owning object before all dependents are gone. Hence, “Synchronous GC” is best-effort, though we guarantee that the dependents will be deleted eventually. We face a similar case when handling OrphanFinalizer, see [GC known issues](https://github.com/kubernetes/kubernetes/issues/26120). @@ -73,22 +100,27 @@ SynchronousGC will enter a deadlock in the presence of circular dependencies. He Finalizer breaks an assumption that many Kubernetes components have: a deletion request with `grace period=0` will immediately remove the object from the key-value store. This is not true if an object has pending finalizers, the object will continue to exist, and currently the API server will not return an error in this case. -**Namespce controller** suffered from this [problem](https://github.com/kubernetes/kubernetes/issues/32519) and was fixed in [#32524](https://github.com/kubernetes/kubernetes/pull/32524) by retrying every 15s if there are objects with pending finalizers to be removed from the key-value store. Object with pending `GCFinalizer` might take arbitrary long time be deleted, so namespace deletion might time out. +**Namespace controller** suffered from this [problem](https://github.com/kubernetes/kubernetes/issues/32519) and was fixed in [#32524](https://github.com/kubernetes/kubernetes/pull/32524) by retrying every 15s if there are objects with pending finalizers to be removed from the key-value store. Object with pending `GCFinalizer` might take arbitrary long time be deleted, so namespace deletion might time out. -**kubelet** deletes the pod from the key-value store after all its containers are terminated ([code](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/status/status_manager.go#L441-L443)). It also assumes if API server does not return an error, the pod is removed from the key-value store. Breaking the assumption will not break `kubelet` though, because the `pod` must have already been in the terminated `phase`, `kubelet` will not care to manage it. +**kubelet** deletes the pod from the key-value store after all its containers are terminated ([code](../../pkg/kubelet/status/status_manager.go#L441-L443)). It also assumes that if the API server does not return an error, the pod is removed from the key-value store. Breaking the assumption will not break `kubelet` though, because the `pod` must have already been in the terminated `phase`, `kubelet` will not care to manage it. -**Node controller** forcefully deletes pod if the pod is scheduled to a node that does not exist ([code](https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/node/nodecontroller.go#L474)). The pod will continue to exist if it has pending finalizers. The node controller will futilely retry the deletion. The `node controller` forcefully deletes pods before deleting the node ([code](https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/node/nodecontroller.go#L592)). If the pods have pending finalizers, the `node controller` will go ahead deleting the node, leaving those pods behind. Other components will take care of the pending finalizers. +**Node controller** forcefully deletes pod if the pod is scheduled to a node that does not exist ([code](../../pkg/controller/node/nodecontroller.go#L474)). The pod will continue to exist if it has pending finalizers. The node controller will futilely retry the deletion. Also, the `node controller` forcefully deletes pods before deleting the node ([code](../../pkg/controller/node/nodecontroller.go#L592)). If the pods have pending finalizers, the `node controller` will go ahead deleting the node, leaving those pods behind. These pods will be deleted from the key-value store when the pending finalizers are removed. **Podgc** deletes terminated pods if there are too many of them in the cluster. `Podgc` should remove any pending finalizers to make sure the pods are deleted. -**Deployment controller** adopts existing `ReplicaSet` (RS) if its template matches. If a matching RS has a pending `GCFinalizer`, deployment shouldn't adopt it, because the RS controller will scale up/down a RS that's being deleted. Hence, `deployment controller` needs to check if a RS is being deleted before adopting it. If the RS is being deleted, then the `deployment controller` should wait for the status of the RS showing 0 replicas, and then create a new RS. +**Deployment controller** adopts existing `ReplicaSet` (RS) if its template matches. If a matching RS has a pending `GCFinalizer`, deployment shouldn't adopt it, because the RS controller will not scale up/down a RS that's being deleted. Hence, `deployment controller` needs to check if a RS is being deleted before adopting it. If the RS is being deleted, then the `deployment controller` should wait for the status of the RS to show 0 replicas to avoid creating extra pods, then create a new RS. -**Replication controller manager**, **Job controller**, and **ReplicaSet controller** send deletion request of pods. `kubelet` will drive these pods to the terminated phase, so the pods will be ignored by the controllers even if they keep existing in the key-value store because of pending finalizers. +**Replication controller manager**, **Job controller**, and **ReplicaSet controller** ignore pods in terminated phase, so pods with pending finalizers will not block these controllers. -**Endpoints controller** deletes endpoints. It does not double check if the endpoint is gone, so +**PetSet controller** will be blocked by a pod with pending finalizers, so Synchronous GC might slow down its progress. -One usage of the synchronous GC is to replace the **kubectl delete** reapers. Currently `kubectl delete` blocks until all dependents and the owner are deleted. To maintain this behavior, after switched to using synchronous GC, *kubectl delete* needs to poll on the removal of the owner object. +**kubectl**: synchronous GC can replace the **kubectl delete** reapers. Currently `kubectl delete` blocks until all dependents and the owner are deleted. To maintain this behavior, after switched to using synchronous GC, *kubectl delete* needs to poll on the removal of the owner object. ## Security implications -A user who are authorized to update one object can affect the synchronous GC behavior of another object. Specifically, even if a user is only authorized to update a pod, he can set another object as the pod's owner, and set a very long grace termination period for the pod, then he makes the synchronous GC of the owner takes long time. +A user who is authorized to update one object can affect the synchronous GC behavior of another object. Specifically, by setting an object as a pod's owner, and setting a very long grace termination period for the pod, a user can make the synchronous GC of the owner to take long time. + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/synchronous-garbage-collection.md?pixel)]() + From 8d1039280a33cba7934c1f32cd2cfd8883b310a7 Mon Sep 17 00:00:00 2001 From: Chao Xu Date: Mon, 26 Sep 2016 09:54:10 -0700 Subject: [PATCH 05/12] addressing more comments --- .../synchronous-garbage-collection.md | 38 +++++++++++++------ 1 file changed, 26 insertions(+), 12 deletions(-) diff --git a/docs/proposals/synchronous-garbage-collection.md b/docs/proposals/synchronous-garbage-collection.md index 110b41229fc..13fba422390 100644 --- a/docs/proposals/synchronous-garbage-collection.md +++ b/docs/proposals/synchronous-garbage-collection.md @@ -29,7 +29,13 @@ Documentation for other releases can be found at # Overview -Some users of the server-side garbage collection need to tell if the garbage collection is done ([example](https://github.com/kubernetes/kubernetes/issues/19701#issuecomment-236997077)). Synchronous Garbage Collection is a best-effort (see [unhandled cases](#unhandled-cases)) mechanism to enable such use cases: after the API server receives a deletion request of an owning object, the object keeps existing in the key-value store until all its dependents are deleted from the key-value store by the garbage collector. +Users of the server-side garbage collection need to determine if the garbage collection is done. For example: +* Currently `kubectl delete rc` blocks until all the pods are terminating. To convert to use server-side garbage collection, kubectl has to be able to determine if the garbage collection is done. +* [#19701](https://github.com/kubernetes/kubernetes/issues/19701#issuecomment-236997077) is a use case where the user needs to wait for all service dependencies garbage collected and their names released, before she recreates the dependencies. + +We define the garbage collection as "done" when all the dependents are deleted from the key-value store, rather than merely in the terminating state. There are two reasons: *i)* for `Pod`s, the most usual garbage, only when they are deleted from the key-value store, we know kubelet has released their resources; *ii)* some users need to recreate objects with the same names, they need to wait for the old objects to be deleted from the key-value store. (This limitation is because we index objects by their names in the key-value store today.) + +Synchronous Garbage Collection is a best-effort (see [unhandled cases](#unhandled-cases)) mechanism that allows user to determine if the garbage collection is done: after the API server receives a deletion request of an owning object, the object keeps existing in the key-value store until all its dependents are deleted from the key-value store by the garbage collector. Tracking issue: https://github.com/kubernetes/kubernetes/issues/29891 @@ -42,23 +48,23 @@ We need to make changes in the API, the API Server, and the garbage collector to ```go DeleteOptions { … - // If SynchronousGarbageCollection is set, the object will not be deleted immediately. Instead, a GarbageCollectionInProgress finalizer will be placed on the object. The garbage collector will remove the finalizer from the object when all depdendents are deleted. + // If SynchronousGarbageCollection is set, the object will not be deleted immediately. Instead, a CollectingGarbage finalizer will be placed on the object. The garbage collector will remove the finalizer from the object when all depdendents are deleted. // SynchronousGarbageCollection and OrphanDependents are exclusive. - // SynchronousGarbageCollection default to false. + // SynchronousGarbageCollection defaults to false. // SynchronousGarbageCollection is cascading, i.e., the object’s dependents will be deleted with the same SynchronousGarbageCollection. SynchronousGarbageCollection *bool } ``` -We will introduce a new standard finalizer: const GCFinalizer string = “GarbageCollectionInProgress” +We will introduce a new standard finalizer: const GCFinalizer string = “CollectingGarbage” ## Components changes ### API Server -Delete() function needs to check the DeleteOptions.SynchronousGarbageCollection. +Delete() function needs to check the `DeleteOptions.SynchronousGarbageCollection`. -* The option is ignored if DeleteOptions.OrphanDependents is true or nil. +* The request is rejected with 400 if both `DeleteOptions.SynchronousGarbageCollection` and `DeleteOptions.OrphanDependents` are true (or default to true, see [DefaultGarbageCollectionPolicy](https://github.com/kubernetes/kubernetes/blob/release-1.4/pkg/registry/generic/registry/store.go#L500)). * If the option is set, the API server will update the object instead of deleting it, add the finalizer, and set the `ObjectMeta.DeletionTimestamp`. ### Garbage Collector @@ -85,11 +91,15 @@ Delete() function needs to check the DeleteOptions.SynchronousGarbageCollection. * treat an owner as "not exist" if `owner.DeletionTimestamp != nil && !owner.Finalizers.Has(OrphanFinalizer)`, otherwise Synchronous GC will not progress because the owner keeps existing in the key-value store. * when deleting dependents, it should use the same `DeleteOptions.SynchronousGC` as the owner’s finalizers suggest. -* if an object has multiple owners, some owners still exit while other owners are in the synchronous GC stage, then according to the existing logic of GC, the object wouldn't be deleted. To unblock the synchronous GC of those owners, `processItem()` has to remove the ownerReferences pointing to them. +* if an object has multiple owners, some owners still exist while other owners are in the synchronous GC stage, then according to the existing logic of GC, the object wouldn't be deleted. To unblock the synchronous GC of owners, `processItem()` has to remove the ownerReferences pointing to them. -**Handling circular dependencies** +## Handling circular dependencies -SynchronousGC will enter a deadlock in the presence of circular dependencies. The garbage collector can break the circle by lazily detecting circular dependencies: when `processItem()` processes an object, if it finds the object and all of its owners have the `GCFinalizer`, it searches the internal owner-dependency relationship graph (`uidToNode`) to check if the object and any of its owner are in a circle where all objects have the `GCFinalizer`. If so, it removes the `GCFinzlier` from the object to break the circle. +SynchronousGC will enter a deadlock in the presence of circular dependencies. The garbage collector can break the circle by lazily breaking circular dependencies: when `processItem()` processes an object, if it finds the object and all of its owners have the `GCFinalizer`, it removes the `GCFinalizer` from the object. + +Note that the approach is not rigorous and thus having false positives. For example, if a user first sends a SynchronousGC delete request for an object, then sends the delete request for its owner, then `processItem()` will be fooled to believe there is a circle. We expect user not to do this. We can make the circle detection more rigorous if needed. + +Circular dependencies are regarded as user error. If needed, we can add more guarantees to handle such cases later. ## Unhandled cases @@ -106,15 +116,19 @@ Finalizer breaks an assumption that many Kubernetes components have: a deletion **Node controller** forcefully deletes pod if the pod is scheduled to a node that does not exist ([code](../../pkg/controller/node/nodecontroller.go#L474)). The pod will continue to exist if it has pending finalizers. The node controller will futilely retry the deletion. Also, the `node controller` forcefully deletes pods before deleting the node ([code](../../pkg/controller/node/nodecontroller.go#L592)). If the pods have pending finalizers, the `node controller` will go ahead deleting the node, leaving those pods behind. These pods will be deleted from the key-value store when the pending finalizers are removed. -**Podgc** deletes terminated pods if there are too many of them in the cluster. `Podgc` should remove any pending finalizers to make sure the pods are deleted. +**Podgc** deletes terminated pods if there are too many of them in the cluster. We need to make sure finalizers on Pods are taken off quickly enough so that the progress of `Podgc` is not affected. -**Deployment controller** adopts existing `ReplicaSet` (RS) if its template matches. If a matching RS has a pending `GCFinalizer`, deployment shouldn't adopt it, because the RS controller will not scale up/down a RS that's being deleted. Hence, `deployment controller` needs to check if a RS is being deleted before adopting it. If the RS is being deleted, then the `deployment controller` should wait for the status of the RS to show 0 replicas to avoid creating extra pods, then create a new RS. +**Deployment controller** adopts existing `ReplicaSet` (RS) if its template matches. If a matching RS has a pending `GCFinalizer`, deployment should adopt it, take its pods into account, but shouldn't try to mutate it, because the RS controller will ignore a RS that's being deleted. Hence, `deployment controller` should wait for the RS to be deleted, and then create a new one. **Replication controller manager**, **Job controller**, and **ReplicaSet controller** ignore pods in terminated phase, so pods with pending finalizers will not block these controllers. **PetSet controller** will be blocked by a pod with pending finalizers, so Synchronous GC might slow down its progress. -**kubectl**: synchronous GC can replace the **kubectl delete** reapers. Currently `kubectl delete` blocks until all dependents and the owner are deleted. To maintain this behavior, after switched to using synchronous GC, *kubectl delete* needs to poll on the removal of the owner object. +**kubectl**: synchronous GC can simplify the **kubectl delete** reapers. Let's take the `deployment reaper` as an example, since it's the most complicated one. Currently, the reaper finds all `RS` with matching labels, scales them down, polls until `RS.Status.Replica` reaches 0, deletes the `RS`es, and finally deletes the `deployment`. If using the synchronous GC, `kubectl delete deployment` is as easy as sending a synchronous GC delete request for the deployment, and polls until the deployment is deleted from the key-value store. + +Note that this **changes the behavior** of `kubectl delete`. The command will be blocked until all pods are deleted from the key-value store, instead of being blocked until pods are in the terminating state. This means `kubectl delete` blocks for longer time, but it has the benefit that the resources used by the pods are released when the `kubectl delete` returns. + +Old kubectl is not affected, because `DeleteOptions.SynchronousGarbageCollection` defaults to false. ## Security implications From 7e2ebef2aa23545fd534d92b9bc45c45295e8c09 Mon Sep 17 00:00:00 2001 From: Chao Xu Date: Tue, 27 Sep 2016 16:26:39 -0700 Subject: [PATCH 06/12] API server makes generous assumption about DeleteOptions.OrphanDependents and DeleteOptions.SynchronousGarbageCollection; using only one queue; address kubectl backward compatibility. --- .../synchronous-garbage-collection.md | 33 ++++++++++--------- 1 file changed, 17 insertions(+), 16 deletions(-) diff --git a/docs/proposals/synchronous-garbage-collection.md b/docs/proposals/synchronous-garbage-collection.md index 13fba422390..b4124d28afd 100644 --- a/docs/proposals/synchronous-garbage-collection.md +++ b/docs/proposals/synchronous-garbage-collection.md @@ -33,7 +33,7 @@ Users of the server-side garbage collection need to determine if the garbage col * Currently `kubectl delete rc` blocks until all the pods are terminating. To convert to use server-side garbage collection, kubectl has to be able to determine if the garbage collection is done. * [#19701](https://github.com/kubernetes/kubernetes/issues/19701#issuecomment-236997077) is a use case where the user needs to wait for all service dependencies garbage collected and their names released, before she recreates the dependencies. -We define the garbage collection as "done" when all the dependents are deleted from the key-value store, rather than merely in the terminating state. There are two reasons: *i)* for `Pod`s, the most usual garbage, only when they are deleted from the key-value store, we know kubelet has released their resources; *ii)* some users need to recreate objects with the same names, they need to wait for the old objects to be deleted from the key-value store. (This limitation is because we index objects by their names in the key-value store today.) +We define the garbage collection as "done" when all the dependents are deleted from the key-value store, rather than merely in the terminating state. There are two reasons: *i)* for `Pod`s, the most usual garbage, only when they are deleted from the key-value store, we know kubelet has released resources they occupy; *ii)* some users need to recreate objects with the same names, they need to wait for the old objects to be deleted from the key-value store. (This limitation is because we index objects by their names in the key-value store today.) Synchronous Garbage Collection is a best-effort (see [unhandled cases](#unhandled-cases)) mechanism that allows user to determine if the garbage collection is done: after the API server receives a deletion request of an owning object, the object keeps existing in the key-value store until all its dependents are deleted from the key-value store by the garbage collector. @@ -64,35 +64,34 @@ We will introduce a new standard finalizer: const GCFinalizer string = “Collec Delete() function needs to check the `DeleteOptions.SynchronousGarbageCollection`. -* The request is rejected with 400 if both `DeleteOptions.SynchronousGarbageCollection` and `DeleteOptions.OrphanDependents` are true (or default to true, see [DefaultGarbageCollectionPolicy](https://github.com/kubernetes/kubernetes/blob/release-1.4/pkg/registry/generic/registry/store.go#L500)). +* The request is rejected with 400 if both `DeleteOptions.SynchronousGarbageCollection` and `DeleteOptions.OrphanDependents` are true. +* If `DeleteOptions.SynchronousGarbageCollection` is explicitly set to true and `DeleteOptions.OrphanDependents` is nil, the API server will default `DeleteOptions.OrphanDependents` to false, regardless of the [default orphaning policy](https://github.com/kubernetes/kubernetes/blob/release-1.4/pkg/registry/generic/registry/store.go#L500) of the resource. * If the option is set, the API server will update the object instead of deleting it, add the finalizer, and set the `ObjectMeta.DeletionTimestamp`. ### Garbage Collector **Modifications to processEvent()** -`processEvent()` manages GC’s internal owner-dependency relationship graph, `uidToNode`. It updates `uidToNode` according to the Add/Update/Delete events in the cluster. To support synchronous GC, it has to: +Currently `processEvent()` manages GC’s internal owner-dependency relationship graph, `uidToNode`. It updates `uidToNode` according to the Add/Update/Delete events in the cluster. To support synchronous GC, it has to: -* handle Add or Update events where `obj.Finalizers.Has(GCFinalizer) && obj.DeletionTimestamp != nil`. The object will be added into the `synchronousGC queue`. The object will be marked as “GC in progress” in `uidToNode`. -* Upon receiving the deletion event of an object, put its owner into the `synchronousGC queue`. This is to force the `GCFinalizer` (described next) to re-check if all dependents of the owner is deleted. - -**Addition of GCFinalizer() routine** - -* Pops an object from the `synchronousGC queue`. -* Ignores the object if it doesn’t exist in `uidToNode`, or if the object is not marked as “GC in progress” in `uidToNode`. -* To avoid racing with another controller, it requeues the object if `observedGeneration < Generation`. This is best-effort, see [unhandled cases](#unhandled-cases). -* Checks if the object has dependents - * If not, send a PUT request to remove the `GCFinalizer` - * If so, then add all dependents to the `dirtryQueue`; we need bookkeeping to avoid adding the dependents repeatedly if the owner gets in the `synchronousGC queue` multiple times. +* handle Add or Update events where `obj.Finalizers.Has(GCFinalizer) && obj.DeletionTimestamp != nil`. The object will be added into the `dirtyQueue`. The object will be marked as “GC in progress” in `uidToNode`. +* Upon receiving the deletion event of an object, put its owner into the `dirtyQueue` if the owner node is marked as "GC in progress". This is to force the `processItem()` (described next) to re-check if all dependents of the owner is deleted. **Modifications to processItem()** -`processItem()` consumes the `dirtyQueue`, requests the API server to delete an item if all of its owners do not exist. To support synchronous GC, it has to: +Currently `processItem()` consumes the `dirtyQueue`, requests the API server to delete an item if all of its owners do not exist. To support synchronous GC, it has to: * treat an owner as "not exist" if `owner.DeletionTimestamp != nil && !owner.Finalizers.Has(OrphanFinalizer)`, otherwise Synchronous GC will not progress because the owner keeps existing in the key-value store. * when deleting dependents, it should use the same `DeleteOptions.SynchronousGC` as the owner’s finalizers suggest. * if an object has multiple owners, some owners still exist while other owners are in the synchronous GC stage, then according to the existing logic of GC, the object wouldn't be deleted. To unblock the synchronous GC of owners, `processItem()` has to remove the ownerReferences pointing to them. +In addition, if an object popped from `dirtyQueue` is marked as "GC in progress", `processItem()` treats it specially: + +* To avoid racing with another controller, it requeues the object if `observedGeneration < Generation`. This is best-effort, see [unhandled cases](#unhandled-cases). +* Checks if the object has dependents + * If not, send a PUT request to remove the `GCFinalizer`; + * If so, then add all dependents to the `dirtryQueue`; we need bookkeeping to avoid adding the dependents repeatedly if the owner gets in the `synchronousGC queue` multiple times. + ## Handling circular dependencies SynchronousGC will enter a deadlock in the presence of circular dependencies. The garbage collector can break the circle by lazily breaking circular dependencies: when `processItem()` processes an object, if it finds the object and all of its owners have the `GCFinalizer`, it removes the `GCFinalizer` from the object. @@ -128,7 +127,9 @@ Finalizer breaks an assumption that many Kubernetes components have: a deletion Note that this **changes the behavior** of `kubectl delete`. The command will be blocked until all pods are deleted from the key-value store, instead of being blocked until pods are in the terminating state. This means `kubectl delete` blocks for longer time, but it has the benefit that the resources used by the pods are released when the `kubectl delete` returns. -Old kubectl is not affected, because `DeleteOptions.SynchronousGarbageCollection` defaults to false. +To make the new kubectl compatible with the 1.4 and earlier masters, kubectl needs to switch to use the old reaper logic if it finds Synchronous GC is not supported by the master. + +Old kubectl is compatible with new master, because `DeleteOptions.SynchronousGarbageCollection` defaults to false. ## Security implications From 120363be72a2ae9203587cd9765dc936bbecd002 Mon Sep 17 00:00:00 2001 From: Chao Xu Date: Sun, 2 Oct 2016 22:41:07 -0700 Subject: [PATCH 07/12] add OwnerReference.BlockSynchronousGC --- .../synchronous-garbage-collection.md | 20 +++++++++++++++++-- 1 file changed, 18 insertions(+), 2 deletions(-) diff --git a/docs/proposals/synchronous-garbage-collection.md b/docs/proposals/synchronous-garbage-collection.md index b4124d28afd..264102ddeb2 100644 --- a/docs/proposals/synchronous-garbage-collection.md +++ b/docs/proposals/synchronous-garbage-collection.md @@ -45,6 +45,7 @@ We need to make changes in the API, the API Server, and the garbage collector to ## API changes +**DeleteOptions** ```go DeleteOptions { … @@ -56,18 +57,33 @@ DeleteOptions { } ``` +**OwnerReference** +```go +OwnerReference { + ... + // Should the owner be deleted from the key-value store after this object is deleted during synchronous garbage collection. + // If the user creates the OwnerReference has delete permission of the owner, BlockSynchronousGC defaults to true; otherwise the field defaults to false, also 422 will be returned if the field is set to false. + BlockSynchronousGC *bool +} +``` + +The object will be garbage collected disregard for the value of `BlockSynchronousGC`, but if `BlockSynchronousGC` is false, then during a synchronous GC, the owner object can be deleted before this object is deleted. If the user who creates the ownerReference does not have the delete permission of the owner object, then he should not be able to affect the synchronous deletion of the owner, so this field must be set to false (forced by the API server). + +**Standard Finalizers** We will introduce a new standard finalizer: const GCFinalizer string = “CollectingGarbage” ## Components changes ### API Server -Delete() function needs to check the `DeleteOptions.SynchronousGarbageCollection`. +`Delete()` function needs to check the `DeleteOptions.SynchronousGarbageCollection`. * The request is rejected with 400 if both `DeleteOptions.SynchronousGarbageCollection` and `DeleteOptions.OrphanDependents` are true. * If `DeleteOptions.SynchronousGarbageCollection` is explicitly set to true and `DeleteOptions.OrphanDependents` is nil, the API server will default `DeleteOptions.OrphanDependents` to false, regardless of the [default orphaning policy](https://github.com/kubernetes/kubernetes/blob/release-1.4/pkg/registry/generic/registry/store.go#L500) of the resource. * If the option is set, the API server will update the object instead of deleting it, add the finalizer, and set the `ObjectMeta.DeletionTimestamp`. +`validation.ValidateObjectMeta()` function needs to validate `OwnerReference.BlockSynchronousGC`. It needs to query the `Authorizer` to check if the user has delete permission of the owner object. + ### Garbage Collector **Modifications to processEvent()** @@ -88,7 +104,7 @@ Currently `processItem()` consumes the `dirtyQueue`, requests the API server to In addition, if an object popped from `dirtyQueue` is marked as "GC in progress", `processItem()` treats it specially: * To avoid racing with another controller, it requeues the object if `observedGeneration < Generation`. This is best-effort, see [unhandled cases](#unhandled-cases). -* Checks if the object has dependents +* Checks if the object has dependents with `BlockSynchronousGC==true` * If not, send a PUT request to remove the `GCFinalizer`; * If so, then add all dependents to the `dirtryQueue`; we need bookkeeping to avoid adding the dependents repeatedly if the owner gets in the `synchronousGC queue` multiple times. From 495a71e7098ccca3c3767d57c4fa31b2a35f3ad2 Mon Sep 17 00:00:00 2001 From: Chao Xu Date: Thu, 6 Oct 2016 21:37:09 -0700 Subject: [PATCH 08/12] adding alternative design --- .../synchronous-garbage-collection.md | 98 ++++++++++++++----- 1 file changed, 73 insertions(+), 25 deletions(-) diff --git a/docs/proposals/synchronous-garbage-collection.md b/docs/proposals/synchronous-garbage-collection.md index 264102ddeb2..557eac1f663 100644 --- a/docs/proposals/synchronous-garbage-collection.md +++ b/docs/proposals/synchronous-garbage-collection.md @@ -27,6 +27,29 @@ Documentation for other releases can be found at +**Table of Contents** + + +- [Overview](#overview) +- [Design I. Exposing synchronous garbage collection mode via DeleteOptions](#design-i-exposing-synchronous-garbage-collection-mode-via-deleteoptions) + - [API changes](#api-changes) + - [Components changes](#components-changes) + - [API Server](#api-server) + - [Garbage Collector](#garbage-collector) + - [Handling circular dependencies](#handling-circular-dependencies) + - [Unhandled cases](#unhandled-cases) + - [Implications to existing clients](#implications-to-existing-clients) + - [Security implications](#security-implications) +- [Design II: Exposing synchronous garbage collection mode via OwnerReferences](#design-ii-exposing-synchronous-garbage-collection-mode-via-ownerreferences) + - [API changes](#api-changes-1) + - [Components Changes](#components-changes-1) + - [API Server](#api-server-1) + - [Garbage Collector](#garbage-collector-1) + - [Controllers](#controllers) + - [Implications to existing clients](#implications-to-existing-clients-1) + + + # Overview Users of the server-side garbage collection need to determine if the garbage collection is done. For example: @@ -39,36 +62,25 @@ Synchronous Garbage Collection is a best-effort (see [unhandled cases](#unhandle Tracking issue: https://github.com/kubernetes/kubernetes/issues/29891 -# Required code modification +# Design I. Exposing synchronous garbage collection mode via DeleteOptions We need to make changes in the API, the API Server, and the garbage collector to support synchronous garbage collection. ## API changes **DeleteOptions** + ```go DeleteOptions { … - // If SynchronousGarbageCollection is set, the object will not be deleted immediately. Instead, a CollectingGarbage finalizer will be placed on the object. The garbage collector will remove the finalizer from the object when all depdendents are deleted. - // SynchronousGarbageCollection and OrphanDependents are exclusive. - // SynchronousGarbageCollection defaults to false. - // SynchronousGarbageCollection is cascading, i.e., the object’s dependents will be deleted with the same SynchronousGarbageCollection. - SynchronousGarbageCollection *bool + // If DeleteAfterDependentsDeleted is set, the object will not be deleted immediately. Instead, a CollectingGarbage finalizer will be placed on the object. The garbage collector will remove the finalizer from the object when all depdendents are deleted. + // DeleteAfterDependentsDeleted and OrphanDependents are exclusive. + // DeleteAfterDependentsDeleted defaults to false. + // DeleteAfterDependentsDeleted is cascading, i.e., the object’s dependents will be deleted with the same DeleteAfterDependentsDeleted. + DeleteAfterDependentsDeleted *bool } ``` -**OwnerReference** -```go -OwnerReference { - ... - // Should the owner be deleted from the key-value store after this object is deleted during synchronous garbage collection. - // If the user creates the OwnerReference has delete permission of the owner, BlockSynchronousGC defaults to true; otherwise the field defaults to false, also 422 will be returned if the field is set to false. - BlockSynchronousGC *bool -} -``` - -The object will be garbage collected disregard for the value of `BlockSynchronousGC`, but if `BlockSynchronousGC` is false, then during a synchronous GC, the owner object can be deleted before this object is deleted. If the user who creates the ownerReference does not have the delete permission of the owner object, then he should not be able to affect the synchronous deletion of the owner, so this field must be set to false (forced by the API server). - **Standard Finalizers** We will introduce a new standard finalizer: const GCFinalizer string = “CollectingGarbage” @@ -76,14 +88,12 @@ We will introduce a new standard finalizer: const GCFinalizer string = “Collec ### API Server -`Delete()` function needs to check the `DeleteOptions.SynchronousGarbageCollection`. +`Delete()` function needs to check the `DeleteOptions.DeleteAfterDependentsDeleted`. -* The request is rejected with 400 if both `DeleteOptions.SynchronousGarbageCollection` and `DeleteOptions.OrphanDependents` are true. -* If `DeleteOptions.SynchronousGarbageCollection` is explicitly set to true and `DeleteOptions.OrphanDependents` is nil, the API server will default `DeleteOptions.OrphanDependents` to false, regardless of the [default orphaning policy](https://github.com/kubernetes/kubernetes/blob/release-1.4/pkg/registry/generic/registry/store.go#L500) of the resource. +* The request is rejected with 400 if both `DeleteOptions.DeleteAfterDependentsDeleted` and `DeleteOptions.OrphanDependents` are true. +* If `DeleteOptions.DeleteAfterDependentsDeleted` is explicitly set to true and `DeleteOptions.OrphanDependents` is nil, the API server will default `DeleteOptions.OrphanDependents` to false, regardless of the [default orphaning policy](https://github.com/kubernetes/kubernetes/blob/release-1.4/pkg/registry/generic/registry/store.go#L500) of the resource. * If the option is set, the API server will update the object instead of deleting it, add the finalizer, and set the `ObjectMeta.DeletionTimestamp`. -`validation.ValidateObjectMeta()` function needs to validate `OwnerReference.BlockSynchronousGC`. It needs to query the `Authorizer` to check if the user has delete permission of the owner object. - ### Garbage Collector **Modifications to processEvent()** @@ -104,7 +114,7 @@ Currently `processItem()` consumes the `dirtyQueue`, requests the API server to In addition, if an object popped from `dirtyQueue` is marked as "GC in progress", `processItem()` treats it specially: * To avoid racing with another controller, it requeues the object if `observedGeneration < Generation`. This is best-effort, see [unhandled cases](#unhandled-cases). -* Checks if the object has dependents with `BlockSynchronousGC==true` +* Checks if the object has dependents * If not, send a PUT request to remove the `GCFinalizer`; * If so, then add all dependents to the `dirtryQueue`; we need bookkeeping to avoid adding the dependents repeatedly if the owner gets in the `synchronousGC queue` multiple times. @@ -145,12 +155,50 @@ Note that this **changes the behavior** of `kubectl delete`. The command will be To make the new kubectl compatible with the 1.4 and earlier masters, kubectl needs to switch to use the old reaper logic if it finds Synchronous GC is not supported by the master. -Old kubectl is compatible with new master, because `DeleteOptions.SynchronousGarbageCollection` defaults to false. +Old kubectl is compatible with new master, because `DeleteOptions.DeleteAfterDependentsDeleted` defaults to false. ## Security implications A user who is authorized to update one object can affect the synchronous GC behavior of another object. Specifically, by setting an object as a pod's owner, and setting a very long grace termination period for the pod, a user can make the synchronous GC of the owner to take long time. +# Design II: Exposing synchronous garbage collection mode via OwnerReferences + +Instead of letting the user who issues the delete request decide whether invoking synchronous garbage collection, this design leaves the decision to the creator of the ownerReferences. The benefit is that we can do proper permission check to mitigate the [security concern](#security-implications) in design I. + +## API changes + +```go +OwnerReference { + ... + // If true, the owner cannot be deleted from the key-value store until this reference is removed. + // Defaults to false. + // To set this field, a user needs "update" and "delete" permission of the owner, otherwise 422 (Unprocessable Entity) will be returned. + // If a user sets this field to true, she also needs to add the "CollectingGarbage" finalizer to the owner at any time before its deletion, otherwise its deletion will not be blocked. + BlockOwnerDeletion *bool +} +``` + +Note that setting `BlockOwnerDeletion` alone is not enough, user also needs to add the "CollectingGarbage" finalizer to the owner. Considering most ownerReferences are set by controllers (e.g., replicaset controller), we think the burden is acceptable. + +## Components Changes + +### API Server + +When validating the ownerReference, API server needs to query the `Authorizer` to check if the user has "update" and "delete" permission of the owner object. It returns 422 if the user does not have the permissions. + +### Garbage Collector + +Required changes are mostly the same as in Design I. One difference is that `processItem()` should check if any ownerReference pointing to the owner has `BlockOwnerDeletion==true`. If not, it sends a PUT request to remove the `GCFinalizer`. + +### Controllers + +To utilize the Synchronous Garbage Collection feature, controllers (e.g., replicaset controller) need to set `OwnerReference.BlockOwnerDeletion` when creating dependent objects (e.g. pods). They also need to add the "CollectingGarbage" finalizer to the owner object (e.g., replicaset). + +## Implications to existing clients + +The implications are mostly the same as Design I. One difference is that it causes behavior change of old version `kubectl delete`. Old version kubectl issues the delete request for the owner object after all dependent objects are terminating. An old version API server will delete the owner from the key-value store immediately, but a new version API server will keep the owner object around until all dependents are deleted. This can be solved by making API server remove the "CollectingGarbage" finalizer if the deletion request is issued by an old version kubectl. + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/synchronous-garbage-collection.md?pixel)]() From 9b77f7f97bd1963e9ca6cb1e8fdc9367d3539cd6 Mon Sep 17 00:00:00 2001 From: Chao Xu Date: Mon, 10 Oct 2016 20:27:35 -0700 Subject: [PATCH 09/12] Use DeleteOptions.GarbageCollectionPolicy and OwnerReference.BlockOwnerDeletion --- .../synchronous-garbage-collection.md | 160 +++++++++--------- 1 file changed, 78 insertions(+), 82 deletions(-) diff --git a/docs/proposals/synchronous-garbage-collection.md b/docs/proposals/synchronous-garbage-collection.md index 557eac1f663..672b4ef7279 100644 --- a/docs/proposals/synchronous-garbage-collection.md +++ b/docs/proposals/synchronous-garbage-collection.md @@ -31,22 +31,17 @@ Documentation for other releases can be found at - [Overview](#overview) -- [Design I. Exposing synchronous garbage collection mode via DeleteOptions](#design-i-exposing-synchronous-garbage-collection-mode-via-deleteoptions) - - [API changes](#api-changes) - - [Components changes](#components-changes) - - [API Server](#api-server) - - [Garbage Collector](#garbage-collector) - - [Handling circular dependencies](#handling-circular-dependencies) - - [Unhandled cases](#unhandled-cases) - - [Implications to existing clients](#implications-to-existing-clients) - - [Security implications](#security-implications) -- [Design II: Exposing synchronous garbage collection mode via OwnerReferences](#design-ii-exposing-synchronous-garbage-collection-mode-via-ownerreferences) - - [API changes](#api-changes-1) - - [Components Changes](#components-changes-1) - - [API Server](#api-server-1) - - [Garbage Collector](#garbage-collector-1) - - [Controllers](#controllers) - - [Implications to existing clients](#implications-to-existing-clients-1) +- [API Design](#api-design) + - [Standard Finalizers](#standard-finalizers) + - [OwnerReference](#ownerreference) + - [DeleteOptions](#deleteoptions) +- [Components changes](#components-changes) + - [API Server](#api-server) + - [Garbage Collector](#garbage-collector) + - [Controllers](#controllers) +- [Handling circular dependencies](#handling-circular-dependencies) +- [Unhandled cases](#unhandled-cases) +- [Implications to existing clients](#implications-to-existing-clients) @@ -62,39 +57,78 @@ Synchronous Garbage Collection is a best-effort (see [unhandled cases](#unhandle Tracking issue: https://github.com/kubernetes/kubernetes/issues/29891 -# Design I. Exposing synchronous garbage collection mode via DeleteOptions +# API Design -We need to make changes in the API, the API Server, and the garbage collector to support synchronous garbage collection. +## Standard Finalizers -## API changes +We will introduce a new standard finalizer: -**DeleteOptions** +```go +const GCFinalizer string = “CollectingGarbage” +``` + +This finalizer indicates the object is terminating and is waiting for its dependents whose `OwnerReference.BlockOwnerDeletion` is true get deleted. + +## OwnerReference + +```go +OwnerReference { + ... + // If true, AND if the owner has the "CollectingGarbage" finalizer, then the owner cannot be deleted from the key-value store until this reference is removed. + // Defaults to false. + // To set this field, a user needs "update" and "delete" permission of the owner, otherwise 422 (Unprocessable Entity) will be returned. + BlockOwnerDeletion *bool +} +``` + +The initial draft of the proposal did not include this field and it had a security loophole: a user who is only authorized to update one resource can set ownerReference to block the synchronous GC of other resources. Requiring users to explicitly set `BlockOwnerDeletion` allows the master to properly authorize the request. + +## DeleteOptions ```go DeleteOptions { … - // If DeleteAfterDependentsDeleted is set, the object will not be deleted immediately. Instead, a CollectingGarbage finalizer will be placed on the object. The garbage collector will remove the finalizer from the object when all depdendents are deleted. - // DeleteAfterDependentsDeleted and OrphanDependents are exclusive. - // DeleteAfterDependentsDeleted defaults to false. - // DeleteAfterDependentsDeleted is cascading, i.e., the object’s dependents will be deleted with the same DeleteAfterDependentsDeleted. - DeleteAfterDependentsDeleted *bool + // Whether and how garbage collection will be performed. + // Defaults to GarbageCollectionDefault + GarbageCollectionPolicy GarbageCollectionPolicy } + +type GarabgeCollectionPolicy string + +const ( + // Respects the existing garbage collection related finalizers on the object and the default garbage collection policy of the resource. + GarbageCollectionDefault GarbageCollectionPolicy = "Default" + // Orphans the dependents + GarbageCollectionOrphan GarbageCollectionPolicy = "Orphan" + // Deletes the object from the key-value store, the garbage collector will delete the dependents in the background. + GarbageCollectionAsynchronous GarbageCollectionPolicy = "AsynchronousGarbageCollection" + // The object exists in the key-value store until the garbage collector deletes all the dependents whose ownerReference.blockOwnerDeletion=true from the key-value store. + // API sever will put the "CollectingGarbage" finalizer on the object, and sets its deletionTimestamp. + // This policy is cascading, i.e., the dependents will be deleted with GarbageCollectionSynchronous. + GarbageCollectionSynchronous GarbageCollectionPolicy = "SynchronousGarbageCollection" +) ``` -**Standard Finalizers** -We will introduce a new standard finalizer: const GCFinalizer string = “CollectingGarbage” +`DeleteOptions.OrphanDependents *bool` will be removed. We decided not to add a `DeleteOptions.SynchronousGC *bool`, because together with `DeleteOptions.OrphanDependents` it results in 9 possible combinations and is thus confusing. This is a breaking change, so the new DeleteOptions will live in `api.k8s.io` group with version `v1`, as proposed in [#33900](https://github.com/kubernetes/kubernetes/pull/33900). -## Components changes +The conversion rules are described in the following table: -### API Server +| 1.5 | pre 1.4/1.4 | +|-------------------------------|--------------------------| +| GarbageCollectionDefault | OrphanDependents==nil | +| GarbageCollectionOrphan | *OrphanDependents==true | +| GarbageCollectionAsynchronous | *OrphanDependents==false | +| GarbageCollectionSynchronous | N/A | -`Delete()` function needs to check the `DeleteOptions.DeleteAfterDependentsDeleted`. +# Components changes -* The request is rejected with 400 if both `DeleteOptions.DeleteAfterDependentsDeleted` and `DeleteOptions.OrphanDependents` are true. -* If `DeleteOptions.DeleteAfterDependentsDeleted` is explicitly set to true and `DeleteOptions.OrphanDependents` is nil, the API server will default `DeleteOptions.OrphanDependents` to false, regardless of the [default orphaning policy](https://github.com/kubernetes/kubernetes/blob/release-1.4/pkg/registry/generic/registry/store.go#L500) of the resource. -* If the option is set, the API server will update the object instead of deleting it, add the finalizer, and set the `ObjectMeta.DeletionTimestamp`. +## API Server -### Garbage Collector +`Delete()` function checks `DeleteOptions.GarbageCollectionPolicy`. If the policy is `GarbageCollectionSynchronous`, the API server will update the object instead of deleting it, add the finalizer, and set the `ObjectMeta.DeletionTimestamp`. + +When validating the ownerReference, API server needs to query the `Authorizer` to check if the user has "update" and "delete" permission of the owner object. It returns 422 if the user does not have the permissions but intends to set `OwnerReference.BlockOwnerDeletion` to true. + +## Garbage Collector **Modifications to processEvent()** @@ -108,7 +142,7 @@ Currently `processEvent()` manages GC’s internal owner-dependency relationship Currently `processItem()` consumes the `dirtyQueue`, requests the API server to delete an item if all of its owners do not exist. To support synchronous GC, it has to: * treat an owner as "not exist" if `owner.DeletionTimestamp != nil && !owner.Finalizers.Has(OrphanFinalizer)`, otherwise Synchronous GC will not progress because the owner keeps existing in the key-value store. -* when deleting dependents, it should use the same `DeleteOptions.SynchronousGC` as the owner’s finalizers suggest. +* when deleting dependents, if the owner's finalizers include `CollectingGarbage`, it should use the `GarbageCollectionSynchronous` as GC policy. * if an object has multiple owners, some owners still exist while other owners are in the synchronous GC stage, then according to the existing logic of GC, the object wouldn't be deleted. To unblock the synchronous GC of owners, `processItem()` has to remove the ownerReferences pointing to them. In addition, if an object popped from `dirtyQueue` is marked as "GC in progress", `processItem()` treats it specially: @@ -118,7 +152,11 @@ In addition, if an object popped from `dirtyQueue` is marked as "GC in progress" * If not, send a PUT request to remove the `GCFinalizer`; * If so, then add all dependents to the `dirtryQueue`; we need bookkeeping to avoid adding the dependents repeatedly if the owner gets in the `synchronousGC queue` multiple times. -## Handling circular dependencies +## Controllers + +To utilize the Synchronous Garbage Collection feature, controllers (e.g., the replicaset controller) need to set `OwnerReference.BlockOwnerDeletion` when creating dependent objects (e.g. pods). + +# Handling circular dependencies SynchronousGC will enter a deadlock in the presence of circular dependencies. The garbage collector can break the circle by lazily breaking circular dependencies: when `processItem()` processes an object, if it finds the object and all of its owners have the `GCFinalizer`, it removes the `GCFinalizer` from the object. @@ -126,12 +164,11 @@ Note that the approach is not rigorous and thus having false positives. For exam Circular dependencies are regarded as user error. If needed, we can add more guarantees to handle such cases later. -## Unhandled cases +# Unhandled cases * If the GC observes the owning object with the `GCFinalizer` before it observes the creation of all the dependents, GC will remove the finalizer from the owning object before all dependents are gone. Hence, “Synchronous GC” is best-effort, though we guarantee that the dependents will be deleted eventually. We face a similar case when handling OrphanFinalizer, see [GC known issues](https://github.com/kubernetes/kubernetes/issues/26120). - -## Implications to existing clients +# Implications to existing clients Finalizer breaks an assumption that many Kubernetes components have: a deletion request with `grace period=0` will immediately remove the object from the key-value store. This is not true if an object has pending finalizers, the object will continue to exist, and currently the API server will not return an error in this case. @@ -155,50 +192,9 @@ Note that this **changes the behavior** of `kubectl delete`. The command will be To make the new kubectl compatible with the 1.4 and earlier masters, kubectl needs to switch to use the old reaper logic if it finds Synchronous GC is not supported by the master. -Old kubectl is compatible with new master, because `DeleteOptions.DeleteAfterDependentsDeleted` defaults to false. - -## Security implications - -A user who is authorized to update one object can affect the synchronous GC behavior of another object. Specifically, by setting an object as a pod's owner, and setting a very long grace termination period for the pod, a user can make the synchronous GC of the owner to take long time. - -# Design II: Exposing synchronous garbage collection mode via OwnerReferences - -Instead of letting the user who issues the delete request decide whether invoking synchronous garbage collection, this design leaves the decision to the creator of the ownerReferences. The benefit is that we can do proper permission check to mitigate the [security concern](#security-implications) in design I. - -## API changes - -```go -OwnerReference { - ... - // If true, the owner cannot be deleted from the key-value store until this reference is removed. - // Defaults to false. - // To set this field, a user needs "update" and "delete" permission of the owner, otherwise 422 (Unprocessable Entity) will be returned. - // If a user sets this field to true, she also needs to add the "CollectingGarbage" finalizer to the owner at any time before its deletion, otherwise its deletion will not be blocked. - BlockOwnerDeletion *bool -} -``` - -Note that setting `BlockOwnerDeletion` alone is not enough, user also needs to add the "CollectingGarbage" finalizer to the owner. Considering most ownerReferences are set by controllers (e.g., replicaset controller), we think the burden is acceptable. - -## Components Changes - -### API Server - -When validating the ownerReference, API server needs to query the `Authorizer` to check if the user has "update" and "delete" permission of the owner object. It returns 422 if the user does not have the permissions. - -### Garbage Collector - -Required changes are mostly the same as in Design I. One difference is that `processItem()` should check if any ownerReference pointing to the owner has `BlockOwnerDeletion==true`. If not, it sends a PUT request to remove the `GCFinalizer`. - -### Controllers - -To utilize the Synchronous Garbage Collection feature, controllers (e.g., replicaset controller) need to set `OwnerReference.BlockOwnerDeletion` when creating dependent objects (e.g. pods). They also need to add the "CollectingGarbage" finalizer to the owner object (e.g., replicaset). - -## Implications to existing clients - -The implications are mostly the same as Design I. One difference is that it causes behavior change of old version `kubectl delete`. Old version kubectl issues the delete request for the owner object after all dependent objects are terminating. An old version API server will delete the owner from the key-value store immediately, but a new version API server will keep the owner object around until all dependents are deleted. This can be solved by making API server remove the "CollectingGarbage" finalizer if the deletion request is issued by an old version kubectl. - +1.4 `kubectl delete rc/rs` uses `DeleteOptions.OrphanDependents=true`, which is going to be converted to `GarbageCollectionAsynchronous` (see [API Design](#api-changes)) by a 1.5 master, so its behavior keeps the same. +Pre 1.4 `kubectl delete` uses `DeleteOptions.OrphanDependents=nil`, so does the 1.4 `kubectl delete` for resources other than rc and rs. The option is going to be converted to `GarbageCollectionDefault` (see [API Design](#api-changes)) by a 1.5 master, so these commands behave the same as when working with a 1.4 master. [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/synchronous-garbage-collection.md?pixel)]() From 214bfecc6ff3892ab6899dedbdf2222cd0643b6a Mon Sep 17 00:00:00 2001 From: Chao Xu Date: Tue, 11 Oct 2016 19:55:42 -0700 Subject: [PATCH 10/12] addressing comment; rename user facing "Garbage Collection" --- .../synchronous-garbage-collection.md | 42 +++++++++---------- 1 file changed, 21 insertions(+), 21 deletions(-) diff --git a/docs/proposals/synchronous-garbage-collection.md b/docs/proposals/synchronous-garbage-collection.md index 672b4ef7279..0969a59b057 100644 --- a/docs/proposals/synchronous-garbage-collection.md +++ b/docs/proposals/synchronous-garbage-collection.md @@ -64,7 +64,7 @@ Tracking issue: https://github.com/kubernetes/kubernetes/issues/29891 We will introduce a new standard finalizer: ```go -const GCFinalizer string = “CollectingGarbage” +const GCFinalizer string = “DeletingDependents” ``` This finalizer indicates the object is terminating and is waiting for its dependents whose `OwnerReference.BlockOwnerDeletion` is true get deleted. @@ -74,7 +74,7 @@ This finalizer indicates the object is terminating and is waiting for its depend ```go OwnerReference { ... - // If true, AND if the owner has the "CollectingGarbage" finalizer, then the owner cannot be deleted from the key-value store until this reference is removed. + // If true, AND if the owner has the "DeletingDependents" finalizer, then the owner cannot be deleted from the key-value store until this reference is removed. // Defaults to false. // To set this field, a user needs "update" and "delete" permission of the owner, otherwise 422 (Unprocessable Entity) will be returned. BlockOwnerDeletion *bool @@ -89,42 +89,42 @@ The initial draft of the proposal did not include this field and it had a securi DeleteOptions { … // Whether and how garbage collection will be performed. - // Defaults to GarbageCollectionDefault - GarbageCollectionPolicy GarbageCollectionPolicy + // Defaults to DefaultPropagationPolicy + DeletionPropagationPolicy *DeletionPropagationPolicy } -type GarabgeCollectionPolicy string +type DeletionPropagationPolicy string const ( // Respects the existing garbage collection related finalizers on the object and the default garbage collection policy of the resource. - GarbageCollectionDefault GarbageCollectionPolicy = "Default" + DefaultPropagationPolicy DeletionPropagationPolicy = "Default" // Orphans the dependents - GarbageCollectionOrphan GarbageCollectionPolicy = "Orphan" + OrphanDependents DeletionPropagationPolicy = "Orphan" // Deletes the object from the key-value store, the garbage collector will delete the dependents in the background. - GarbageCollectionAsynchronous GarbageCollectionPolicy = "AsynchronousGarbageCollection" + DeleteDependentsInBackground DeletionPropagationPolicy = "DeleteDependentsInBackground" // The object exists in the key-value store until the garbage collector deletes all the dependents whose ownerReference.blockOwnerDeletion=true from the key-value store. - // API sever will put the "CollectingGarbage" finalizer on the object, and sets its deletionTimestamp. + // API sever will put the "DeletingDependents" finalizer on the object, and sets its deletionTimestamp. // This policy is cascading, i.e., the dependents will be deleted with GarbageCollectionSynchronous. - GarbageCollectionSynchronous GarbageCollectionPolicy = "SynchronousGarbageCollection" + DeleteAfterDependentsAreDeleted DeletionPropagationPolicy = "DeleteAfterDependentsAreDeleted" ) ``` -`DeleteOptions.OrphanDependents *bool` will be removed. We decided not to add a `DeleteOptions.SynchronousGC *bool`, because together with `DeleteOptions.OrphanDependents` it results in 9 possible combinations and is thus confusing. This is a breaking change, so the new DeleteOptions will live in `api.k8s.io` group with version `v1`, as proposed in [#33900](https://github.com/kubernetes/kubernetes/pull/33900). +`DeleteOptions.OrphanDependents *bool` will be marked as deprecated and will be removed in 1.7. Validation code will make sure only one of `OrphanDependents` and `DeletionPropagationPolicy` may be set. We decided not to add another `DeleteAfterDependentsDeleted *bool`, because together with `OrphanDependents`, it will result in 9 possible combinations and is thus confusing. The conversion rules are described in the following table: -| 1.5 | pre 1.4/1.4 | -|-------------------------------|--------------------------| -| GarbageCollectionDefault | OrphanDependents==nil | -| GarbageCollectionOrphan | *OrphanDependents==true | -| GarbageCollectionAsynchronous | *OrphanDependents==false | -| GarbageCollectionSynchronous | N/A | +| 1.5 | pre 1.4/1.4 | +|----------------------------------|--------------------------| +| DefaultPropagationPolicy | OrphanDependents==nil | +| OrphanDependents | *OrphanDependents==true | +| DeleteDependentsInBackground | *OrphanDependents==false | +| DeleteAfterDependentsAreDeleted | N/A | # Components changes ## API Server -`Delete()` function checks `DeleteOptions.GarbageCollectionPolicy`. If the policy is `GarbageCollectionSynchronous`, the API server will update the object instead of deleting it, add the finalizer, and set the `ObjectMeta.DeletionTimestamp`. +`Delete()` function checks `DeleteOptions.DeletionPropagationPolicy`. If the policy is `GarbageCollectionSynchronous`, the API server will update the object instead of deleting it, add the finalizer, and set the `ObjectMeta.DeletionTimestamp`. When validating the ownerReference, API server needs to query the `Authorizer` to check if the user has "update" and "delete" permission of the owner object. It returns 422 if the user does not have the permissions but intends to set `OwnerReference.BlockOwnerDeletion` to true. @@ -142,7 +142,7 @@ Currently `processEvent()` manages GC’s internal owner-dependency relationship Currently `processItem()` consumes the `dirtyQueue`, requests the API server to delete an item if all of its owners do not exist. To support synchronous GC, it has to: * treat an owner as "not exist" if `owner.DeletionTimestamp != nil && !owner.Finalizers.Has(OrphanFinalizer)`, otherwise Synchronous GC will not progress because the owner keeps existing in the key-value store. -* when deleting dependents, if the owner's finalizers include `CollectingGarbage`, it should use the `GarbageCollectionSynchronous` as GC policy. +* when deleting dependents, if the owner's finalizers include `DeletingDependents`, it should use the `GarbageCollectionSynchronous` as GC policy. * if an object has multiple owners, some owners still exist while other owners are in the synchronous GC stage, then according to the existing logic of GC, the object wouldn't be deleted. To unblock the synchronous GC of owners, `processItem()` has to remove the ownerReferences pointing to them. In addition, if an object popped from `dirtyQueue` is marked as "GC in progress", `processItem()` treats it specially: @@ -192,9 +192,9 @@ Note that this **changes the behavior** of `kubectl delete`. The command will be To make the new kubectl compatible with the 1.4 and earlier masters, kubectl needs to switch to use the old reaper logic if it finds Synchronous GC is not supported by the master. -1.4 `kubectl delete rc/rs` uses `DeleteOptions.OrphanDependents=true`, which is going to be converted to `GarbageCollectionAsynchronous` (see [API Design](#api-changes)) by a 1.5 master, so its behavior keeps the same. +1.4 `kubectl delete rc/rs` uses `DeleteOptions.OrphanDependents=true`, which is going to be converted to `DeleteDependentsInBackground` (see [API Design](#api-changes)) by a 1.5 master, so its behavior keeps the same. -Pre 1.4 `kubectl delete` uses `DeleteOptions.OrphanDependents=nil`, so does the 1.4 `kubectl delete` for resources other than rc and rs. The option is going to be converted to `GarbageCollectionDefault` (see [API Design](#api-changes)) by a 1.5 master, so these commands behave the same as when working with a 1.4 master. +Pre 1.4 `kubectl delete` uses `DeleteOptions.OrphanDependents=nil`, so does the 1.4 `kubectl delete` for resources other than rc and rs. The option is going to be converted to `DefaultPropagationPolicy` (see [API Design](#api-changes)) by a 1.5 master, so these commands behave the same as when working with a 1.4 master. [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/synchronous-garbage-collection.md?pixel)]() From 5880a6db2edaeb18f75bbafd37fb6c226bcb4014 Mon Sep 17 00:00:00 2001 From: Chao Xu Date: Wed, 12 Oct 2016 13:20:11 -0700 Subject: [PATCH 11/12] fixing nits; adding --wait to kubetl delete --- .../synchronous-garbage-collection.md | 50 ++++++++++--------- 1 file changed, 26 insertions(+), 24 deletions(-) diff --git a/docs/proposals/synchronous-garbage-collection.md b/docs/proposals/synchronous-garbage-collection.md index 0969a59b057..7a00d7cc938 100644 --- a/docs/proposals/synchronous-garbage-collection.md +++ b/docs/proposals/synchronous-garbage-collection.md @@ -76,7 +76,7 @@ OwnerReference { ... // If true, AND if the owner has the "DeletingDependents" finalizer, then the owner cannot be deleted from the key-value store until this reference is removed. // Defaults to false. - // To set this field, a user needs "update" and "delete" permission of the owner, otherwise 422 (Unprocessable Entity) will be returned. + // To set this field, a user needs "delete" permission of the owner, otherwise 422 (Unprocessable Entity) will be returned. BlockOwnerDeletion *bool } ``` @@ -90,43 +90,45 @@ DeleteOptions { … // Whether and how garbage collection will be performed. // Defaults to DefaultPropagationPolicy - DeletionPropagationPolicy *DeletionPropagationPolicy + DeletePropagationPolicy *DeletePropagationPolicy } -type DeletionPropagationPolicy string +type DeletePropagationPolicy string const ( // Respects the existing garbage collection related finalizers on the object and the default garbage collection policy of the resource. - DefaultPropagationPolicy DeletionPropagationPolicy = "Default" + DefaultPropagationPolicy DeletePropagationPolicy = "Default" // Orphans the dependents - OrphanDependents DeletionPropagationPolicy = "Orphan" + OrphanDependents DeletePropagationPolicy = "Orphan" // Deletes the object from the key-value store, the garbage collector will delete the dependents in the background. - DeleteDependentsInBackground DeletionPropagationPolicy = "DeleteDependentsInBackground" + DeleteDependentsInBackground DeletePropagationPolicy = "DeleteDependentsInBackground" // The object exists in the key-value store until the garbage collector deletes all the dependents whose ownerReference.blockOwnerDeletion=true from the key-value store. // API sever will put the "DeletingDependents" finalizer on the object, and sets its deletionTimestamp. // This policy is cascading, i.e., the dependents will be deleted with GarbageCollectionSynchronous. - DeleteAfterDependentsAreDeleted DeletionPropagationPolicy = "DeleteAfterDependentsAreDeleted" + DeleteAfterBlockingDependentsAreDeleted DeletePropagationPolicy = "DeleteAfterBlockingDependentsAreDeleted" ) ``` -`DeleteOptions.OrphanDependents *bool` will be marked as deprecated and will be removed in 1.7. Validation code will make sure only one of `OrphanDependents` and `DeletionPropagationPolicy` may be set. We decided not to add another `DeleteAfterDependentsDeleted *bool`, because together with `OrphanDependents`, it will result in 9 possible combinations and is thus confusing. +The `DeleteAfterBlockingDependentsAreDeleted` policy represents the synchronous GC mode. + +`DeleteOptions.OrphanDependents *bool` will be marked as deprecated and will be removed in 1.7. Validation code will make sure only one of `OrphanDependents` and `DeletePropagationPolicy` may be set. We decided not to add another `DeleteAfterDependentsDeleted *bool`, because together with `OrphanDependents`, it will result in 9 possible combinations and is thus confusing. The conversion rules are described in the following table: -| 1.5 | pre 1.4/1.4 | -|----------------------------------|--------------------------| -| DefaultPropagationPolicy | OrphanDependents==nil | -| OrphanDependents | *OrphanDependents==true | -| DeleteDependentsInBackground | *OrphanDependents==false | -| DeleteAfterDependentsAreDeleted | N/A | +| 1.5 | pre 1.4/1.4 | +|------------------------------------------|--------------------------| +| DefaultPropagationPolicy | OrphanDependents==nil | +| OrphanDependents | *OrphanDependents==true | +| DeleteDependentsInBackground | *OrphanDependents==false | +| DeleteAfterBlockingDependentsAreDeleted | N/A | # Components changes ## API Server -`Delete()` function checks `DeleteOptions.DeletionPropagationPolicy`. If the policy is `GarbageCollectionSynchronous`, the API server will update the object instead of deleting it, add the finalizer, and set the `ObjectMeta.DeletionTimestamp`. +`Delete()` function checks `DeleteOptions.DeletePropagationPolicy`. If the policy is `DeleteAfterBlockingDependentsAreDeleted`, the API server will update the object instead of deleting it, add the finalizer, and set the `ObjectMeta.DeletionTimestamp`. -When validating the ownerReference, API server needs to query the `Authorizer` to check if the user has "update" and "delete" permission of the owner object. It returns 422 if the user does not have the permissions but intends to set `OwnerReference.BlockOwnerDeletion` to true. +When validating the ownerReference, API server needs to query the `Authorizer` to check if the user has "delete" permission of the owner object. It returns 422 if the user does not have the permissions but intends to set `OwnerReference.BlockOwnerDeletion` to true. ## Garbage Collector @@ -141,7 +143,7 @@ Currently `processEvent()` manages GC’s internal owner-dependency relationship Currently `processItem()` consumes the `dirtyQueue`, requests the API server to delete an item if all of its owners do not exist. To support synchronous GC, it has to: -* treat an owner as "not exist" if `owner.DeletionTimestamp != nil && !owner.Finalizers.Has(OrphanFinalizer)`, otherwise Synchronous GC will not progress because the owner keeps existing in the key-value store. +* treat an owner as "not exist" if `owner.DeletionTimestamp != nil && !owner.Finalizers.Has(OrphanFinalizer)`, otherwise synchronous GC will not progress because the owner keeps existing in the key-value store. * when deleting dependents, if the owner's finalizers include `DeletingDependents`, it should use the `GarbageCollectionSynchronous` as GC policy. * if an object has multiple owners, some owners still exist while other owners are in the synchronous GC stage, then according to the existing logic of GC, the object wouldn't be deleted. To unblock the synchronous GC of owners, `processItem()` has to remove the ownerReferences pointing to them. @@ -154,7 +156,7 @@ In addition, if an object popped from `dirtyQueue` is marked as "GC in progress" ## Controllers -To utilize the Synchronous Garbage Collection feature, controllers (e.g., the replicaset controller) need to set `OwnerReference.BlockOwnerDeletion` when creating dependent objects (e.g. pods). +To utilize the synchronous garbage collection feature, controllers (e.g., the replicaset controller) need to set `OwnerReference.BlockOwnerDeletion` when creating dependent objects (e.g. pods). # Handling circular dependencies @@ -166,7 +168,7 @@ Circular dependencies are regarded as user error. If needed, we can add more gua # Unhandled cases -* If the GC observes the owning object with the `GCFinalizer` before it observes the creation of all the dependents, GC will remove the finalizer from the owning object before all dependents are gone. Hence, “Synchronous GC” is best-effort, though we guarantee that the dependents will be deleted eventually. We face a similar case when handling OrphanFinalizer, see [GC known issues](https://github.com/kubernetes/kubernetes/issues/26120). +* If the GC observes the owning object with the `GCFinalizer` before it observes the creation of all the dependents, GC will remove the finalizer from the owning object before all dependents are gone. Hence, synchronous GC is best-effort, though we guarantee that the dependents will be deleted eventually. We face a similar case when handling OrphanFinalizer, see [GC known issues](https://github.com/kubernetes/kubernetes/issues/26120). # Implications to existing clients @@ -174,7 +176,7 @@ Finalizer breaks an assumption that many Kubernetes components have: a deletion **Namespace controller** suffered from this [problem](https://github.com/kubernetes/kubernetes/issues/32519) and was fixed in [#32524](https://github.com/kubernetes/kubernetes/pull/32524) by retrying every 15s if there are objects with pending finalizers to be removed from the key-value store. Object with pending `GCFinalizer` might take arbitrary long time be deleted, so namespace deletion might time out. -**kubelet** deletes the pod from the key-value store after all its containers are terminated ([code](../../pkg/kubelet/status/status_manager.go#L441-L443)). It also assumes that if the API server does not return an error, the pod is removed from the key-value store. Breaking the assumption will not break `kubelet` though, because the `pod` must have already been in the terminated `phase`, `kubelet` will not care to manage it. +**kubelet** deletes the pod from the key-value store after all its containers are terminated ([code](../../pkg/kubelet/status/status_manager.go#L441-L443)). It also assumes that if the API server does not return an error, the pod is removed from the key-value store. Breaking the assumption will not break `kubelet` though, because the `pod` must have already been in the terminated phase, `kubelet` will not care to manage it. **Node controller** forcefully deletes pod if the pod is scheduled to a node that does not exist ([code](../../pkg/controller/node/nodecontroller.go#L474)). The pod will continue to exist if it has pending finalizers. The node controller will futilely retry the deletion. Also, the `node controller` forcefully deletes pods before deleting the node ([code](../../pkg/controller/node/nodecontroller.go#L592)). If the pods have pending finalizers, the `node controller` will go ahead deleting the node, leaving those pods behind. These pods will be deleted from the key-value store when the pending finalizers are removed. @@ -184,13 +186,13 @@ Finalizer breaks an assumption that many Kubernetes components have: a deletion **Replication controller manager**, **Job controller**, and **ReplicaSet controller** ignore pods in terminated phase, so pods with pending finalizers will not block these controllers. -**PetSet controller** will be blocked by a pod with pending finalizers, so Synchronous GC might slow down its progress. +**PetSet controller** will be blocked by a pod with pending finalizers, so synchronous GC might slow down its progress. -**kubectl**: synchronous GC can simplify the **kubectl delete** reapers. Let's take the `deployment reaper` as an example, since it's the most complicated one. Currently, the reaper finds all `RS` with matching labels, scales them down, polls until `RS.Status.Replica` reaches 0, deletes the `RS`es, and finally deletes the `deployment`. If using the synchronous GC, `kubectl delete deployment` is as easy as sending a synchronous GC delete request for the deployment, and polls until the deployment is deleted from the key-value store. +**kubectl**: synchronous GC can simplify the **kubectl delete** reapers. Let's take the `deployment reaper` as an example, since it's the most complicated one. Currently, the reaper finds all `RS` with matching labels, scales them down, polls until `RS.Status.Replica` reaches 0, deletes the `RS`es, and finally deletes the `deployment`. If using synchronous GC, `kubectl delete deployment` is as easy as sending a synchronous GC delete request for the deployment, and polls until the deployment is deleted from the key-value store. -Note that this **changes the behavior** of `kubectl delete`. The command will be blocked until all pods are deleted from the key-value store, instead of being blocked until pods are in the terminating state. This means `kubectl delete` blocks for longer time, but it has the benefit that the resources used by the pods are released when the `kubectl delete` returns. +Note that this **changes the behavior** of `kubectl delete`. The command will be blocked until all pods are deleted from the key-value store, instead of being blocked until pods are in the terminating state. This means `kubectl delete` blocks for longer time, but it has the benefit that the resources used by the pods are released when the `kubectl delete` returns. To allow kubectl user not waiting for the cleanup, we will add a `--wait` flag. It defaults to true; if it's set to `false`, `kubectl delete` will send the delete request with `DeletePropagationPolicy=DeleteDependentsInBackground` and return immediately. -To make the new kubectl compatible with the 1.4 and earlier masters, kubectl needs to switch to use the old reaper logic if it finds Synchronous GC is not supported by the master. +To make the new kubectl compatible with the 1.4 and earlier masters, kubectl needs to switch to use the old reaper logic if it finds synchronous GC is not supported by the master. 1.4 `kubectl delete rc/rs` uses `DeleteOptions.OrphanDependents=true`, which is going to be converted to `DeleteDependentsInBackground` (see [API Design](#api-changes)) by a 1.5 master, so its behavior keeps the same. From fb81c1e5730d7e39f1f826140e2c215d372293fd Mon Sep 17 00:00:00 2001 From: Chao Xu Date: Wed, 12 Oct 2016 14:22:42 -0700 Subject: [PATCH 12/12] change the field name to PropagationPolicy --- .../synchronous-garbage-collection.md | 35 ++++++++++--------- 1 file changed, 18 insertions(+), 17 deletions(-) diff --git a/docs/proposals/synchronous-garbage-collection.md b/docs/proposals/synchronous-garbage-collection.md index 7a00d7cc938..ef2108140e3 100644 --- a/docs/proposals/synchronous-garbage-collection.md +++ b/docs/proposals/synchronous-garbage-collection.md @@ -89,44 +89,45 @@ The initial draft of the proposal did not include this field and it had a securi DeleteOptions { … // Whether and how garbage collection will be performed. - // Defaults to DefaultPropagationPolicy - DeletePropagationPolicy *DeletePropagationPolicy + // Defaults to DeletePropagationDefault + // Either this field or OrphanDependents may be set, but not both. + PropagationPolicy *DeletePropagationPolicy } type DeletePropagationPolicy string const ( - // Respects the existing garbage collection related finalizers on the object and the default garbage collection policy of the resource. - DefaultPropagationPolicy DeletePropagationPolicy = "Default" + // The default depends on the existing finalizers on the object and the type of the object. + DeletePropagationDefault DeletePropagationPolicy = "DeletePropagationDefault" // Orphans the dependents - OrphanDependents DeletePropagationPolicy = "Orphan" + DeletePropagationOrphan DeletePropagationPolicy = "DeletePropagationOrphan" // Deletes the object from the key-value store, the garbage collector will delete the dependents in the background. - DeleteDependentsInBackground DeletePropagationPolicy = "DeleteDependentsInBackground" + DeletePropagationBackground DeletePropagationPolicy = "DeletePropagationBackground" // The object exists in the key-value store until the garbage collector deletes all the dependents whose ownerReference.blockOwnerDeletion=true from the key-value store. // API sever will put the "DeletingDependents" finalizer on the object, and sets its deletionTimestamp. // This policy is cascading, i.e., the dependents will be deleted with GarbageCollectionSynchronous. - DeleteAfterBlockingDependentsAreDeleted DeletePropagationPolicy = "DeleteAfterBlockingDependentsAreDeleted" + DeletePropagationForeground DeletePropagationPolicy = "DeletePropagationForeground" ) ``` -The `DeleteAfterBlockingDependentsAreDeleted` policy represents the synchronous GC mode. +The `DeletePropagationForeground` policy represents the synchronous GC mode. -`DeleteOptions.OrphanDependents *bool` will be marked as deprecated and will be removed in 1.7. Validation code will make sure only one of `OrphanDependents` and `DeletePropagationPolicy` may be set. We decided not to add another `DeleteAfterDependentsDeleted *bool`, because together with `OrphanDependents`, it will result in 9 possible combinations and is thus confusing. +`DeleteOptions.OrphanDependents *bool` will be marked as deprecated and will be removed in 1.7. Validation code will make sure only one of `OrphanDependents` and `PropagationPolicy` may be set. We decided not to add another `DeleteAfterDependentsDeleted *bool`, because together with `OrphanDependents`, it will result in 9 possible combinations and is thus confusing. The conversion rules are described in the following table: | 1.5 | pre 1.4/1.4 | |------------------------------------------|--------------------------| -| DefaultPropagationPolicy | OrphanDependents==nil | -| OrphanDependents | *OrphanDependents==true | -| DeleteDependentsInBackground | *OrphanDependents==false | -| DeleteAfterBlockingDependentsAreDeleted | N/A | +| DeletePropagationDefault | OrphanDependents==nil | +| DeletePropagationOrphan | *OrphanDependents==true | +| DeletePropagationBackground | *OrphanDependents==false | +| DeletePropagationForeground | N/A | # Components changes ## API Server -`Delete()` function checks `DeleteOptions.DeletePropagationPolicy`. If the policy is `DeleteAfterBlockingDependentsAreDeleted`, the API server will update the object instead of deleting it, add the finalizer, and set the `ObjectMeta.DeletionTimestamp`. +`Delete()` function checks `DeleteOptions.PropagationPolicy`. If the policy is `DeletePropagationForeground`, the API server will update the object instead of deleting it, add the "DeletingDependents" finalizer, remove the "OrphanDependents" finalizer if it's present, and set the `ObjectMeta.DeletionTimestamp`. When validating the ownerReference, API server needs to query the `Authorizer` to check if the user has "delete" permission of the owner object. It returns 422 if the user does not have the permissions but intends to set `OwnerReference.BlockOwnerDeletion` to true. @@ -190,13 +191,13 @@ Finalizer breaks an assumption that many Kubernetes components have: a deletion **kubectl**: synchronous GC can simplify the **kubectl delete** reapers. Let's take the `deployment reaper` as an example, since it's the most complicated one. Currently, the reaper finds all `RS` with matching labels, scales them down, polls until `RS.Status.Replica` reaches 0, deletes the `RS`es, and finally deletes the `deployment`. If using synchronous GC, `kubectl delete deployment` is as easy as sending a synchronous GC delete request for the deployment, and polls until the deployment is deleted from the key-value store. -Note that this **changes the behavior** of `kubectl delete`. The command will be blocked until all pods are deleted from the key-value store, instead of being blocked until pods are in the terminating state. This means `kubectl delete` blocks for longer time, but it has the benefit that the resources used by the pods are released when the `kubectl delete` returns. To allow kubectl user not waiting for the cleanup, we will add a `--wait` flag. It defaults to true; if it's set to `false`, `kubectl delete` will send the delete request with `DeletePropagationPolicy=DeleteDependentsInBackground` and return immediately. +Note that this **changes the behavior** of `kubectl delete`. The command will be blocked until all pods are deleted from the key-value store, instead of being blocked until pods are in the terminating state. This means `kubectl delete` blocks for longer time, but it has the benefit that the resources used by the pods are released when the `kubectl delete` returns. To allow kubectl user not waiting for the cleanup, we will add a `--wait` flag. It defaults to true; if it's set to `false`, `kubectl delete` will send the delete request with `PropagationPolicy=DeletePropagationBackground` and return immediately. To make the new kubectl compatible with the 1.4 and earlier masters, kubectl needs to switch to use the old reaper logic if it finds synchronous GC is not supported by the master. -1.4 `kubectl delete rc/rs` uses `DeleteOptions.OrphanDependents=true`, which is going to be converted to `DeleteDependentsInBackground` (see [API Design](#api-changes)) by a 1.5 master, so its behavior keeps the same. +1.4 `kubectl delete rc/rs` uses `DeleteOptions.OrphanDependents=true`, which is going to be converted to `DeletePropagationBackground` (see [API Design](#api-changes)) by a 1.5 master, so its behavior keeps the same. -Pre 1.4 `kubectl delete` uses `DeleteOptions.OrphanDependents=nil`, so does the 1.4 `kubectl delete` for resources other than rc and rs. The option is going to be converted to `DefaultPropagationPolicy` (see [API Design](#api-changes)) by a 1.5 master, so these commands behave the same as when working with a 1.4 master. +Pre 1.4 `kubectl delete` uses `DeleteOptions.OrphanDependents=nil`, so does the 1.4 `kubectl delete` for resources other than rc and rs. The option is going to be converted to `DeletePropagationDefault` (see [API Design](#api-changes)) by a 1.5 master, so these commands behave the same as when working with a 1.4 master. [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/synchronous-garbage-collection.md?pixel)]()