diff --git a/CHANGELOG.md b/CHANGELOG.md
index d38902a2556..fce9fc111fe 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -4,6 +4,8 @@
- [Downloads](#downloads)
- [Highlights](#highlights)
- [Known Issues and Important Steps before Upgrading](#known-issues-and-important-steps-before-upgrading)
+ - [ThirdPartyResource](#thirdpartyresource)
+ - [kubectl](#kubectl)
- [kubernetes Core Known Issues](#kubernetes-core-known-issues)
- [Docker runtime Known Issues](#docker-runtime-known-issues)
- [Rkt runtime Known Issues](#rkt-runtime-known-issues)
diff --git a/docs/proposals/rescheduling.md b/docs/proposals/rescheduling.md
new file mode 100644
index 00000000000..91289335b96
--- /dev/null
+++ b/docs/proposals/rescheduling.md
@@ -0,0 +1,522 @@
+
+
+
+
+
+
+
+
+
+
+
PLEASE NOTE: This document applies to the HEAD of the source tree
+
+If you are using a released version of Kubernetes, you should
+refer to the docs that go with that version.
+
+Documentation for other releases can be found at
+[releases.k8s.io](http://releases.k8s.io).
+
+--
+
+
+
+
+
+# Controlled Rescheduling in Kubernetes
+
+## Overview
+
+Although the Kubernetes scheduler(s) try to make good placement decisions for pods,
+conditions in the cluster change over time (e.g. jobs finish and new pods arrive, nodes
+are removed due to failures or planned maintenance or auto-scaling down, nodes appear due
+to recovery after a failure or re-joining after maintenance or auto-scaling up or adding
+new hardware to a bare-metal cluster), and schedulers are not omniscient (e.g. there are
+some interactions between pods, or between pods and nodes, that they cannot predict). As
+a result, the initial node selected for a pod may turn out to be a bad match, from the
+perspective of the pod and/or the cluster as a whole, at some point after the pod has
+started running.
+
+Today (Kubernetes version 1.2) once a pod is scheduled to a node, it never moves unless
+it terminates on its own, is deleted by the user, or experiences some unplanned event
+(e.g. the node where it is running dies). Thus in a cluster with long-running pods, the
+assignment of pods to nodes degrades over time, no matter how good an initial scheduling
+decision the scheduler makes. This observation motivates "controlled rescheduling," a
+mechanism by which Kubernetes will "move" already-running pods over time to improve their
+placement. Controlled rescheduling is the subject of this proposal.
+
+Note that the term "move" is not technically accurate -- the mechanism used is that
+Kubernetes will terminate a pod that is managed by a controller, and the controller will
+create a replacement pod that is then scheduled by the pod's scheduler. The terminated
+pod and replacement pod are completely separate pods, and no pod migration is
+implied. However, describing the process as "moving" the pod is approximately accurate
+and easier to understand, so we will use this terminology in the document.
+
+We use the term "rescheduling" to describe any action the system takes to move an
+already-running pod. The decision may be made and executed by any component; we wil
+introduce the concept of a "rescheduler" component later, but it is not the only
+component that can do rescheduling.
+
+This proposal primarily focuses on the architecture and features/mechanisms used to
+achieve rescheduling, and only briefly discuss example policies. We expect that community
+experimentation will lead to a significantly better understanding of the range, potential,
+and limitations of rescheduling policies.
+
+## Example use cases
+
+Example use cases for rescheduling are
+
+* moving a running pod onto a node that better satisfies its scheduling criteria
+ * moving a pod onto an under-utilized node
+ * moving a pod onto a node that meets more of the pod's affinity/anti-affinity preferences
+* moving a running pod off of a node in anticipation of a known or speculated future event
+ * draining a node in preparation for maintenance, decomissioning, auto-scale-down, etc.
+ * "preempting" a running pod to make room for a pending pod to schedule
+ * proactively/speculatively make room for large and/or exclusive pods to facilitate
+ fast scheduling in the future (often called "defragmentation")
+ * (note that these last two cases are the only use cases where the first-order intent
+ is to move a pod specifically for the benefit of another pod)
+* moving a running pod off of a node from which it is receiving poor service
+ * anomalous crashlooping or other mysterious incompatiblity between the pod and the node
+ * repeated out-of-resource killing (see #18724)
+ * repeated attempts by the scheduler to schedule the pod onto some node, but it is
+ rejected by Kubelet admission control due to incomplete scheduler knowledge
+ * poor performance due to interference from other containers on the node (CPU hogs,
+ cache thrashers, etc.) (note that in this case there is a choice of moving the victim
+ or the aggressor)
+
+## Some axes of the design space
+
+Among the key design decisions are
+
+* how does a pod specify its tolerance for these system-generated disruptions, and how
+ does the system enforce such disruption limits
+* for each use case, where is the decision made about when and which pods to reschedule
+ (controllers, schedulers, an entirely new component e.g. "rescheduler", etc.)
+* rescheduler design issues: how much does a rescheduler need to know about pods'
+ schedulers' policies, how does the rescheduler specify its rescheduling
+ requests/decisions (e.g. just as an eviction, an eviction with a hint about where to
+ reschedule, or as an eviction paired with a specific binding), how does the system
+ implement these requests, does the rescheduler take into account the second-order
+ effects of decisions (e.g. whether an evicted pod will reschedule, will cause
+ a preemption when it reschedules, etc.), does the rescheduler execute multi-step plans
+ (e.g. evict two pods at the same time with the intent of moving one into the space
+ vacated by the other, or even more complex plans)
+
+Additional musings on the rescheduling design space can be found [here](rescheduler.md).
+
+## Design proposal
+
+The key mechanisms and components of the proposed design are priority, preemption,
+disruption budgets, the `/evict` subresource, and the rescheduler.
+
+### Priority
+
+#### Motivation
+
+
+Just as it is useful to overcommit nodes to increase node-level utilization, it is useful
+to overcommit clusters to increase cluster-level utilization. Scheduling priority (which
+we abbreviate as *priority*, in combination with disruption budgets (described in the
+next section), allows Kubernetes to safely overcommit clusters much as QoS levels allow
+it to safely overcommit nodes.
+
+Today, cluster sharing among users, workload types, etc. is regulated via the
+[quota](../admin/resourcequota/README.md) mechanism. When allocating quota, a cluster
+administrator has two choices: (1) the sum of the quotas is less than or equal to the
+capacity of the cluster, or (2) the sum of the quotas is greater than the capacity of the
+cluster (that is, the cluster is overcommitted). (1) is likely to lead to cluster
+under-utilization, while (2) is unsafe in the sense that someone's pods may go pending
+indefinitely even though they are still within their quota. Priority makes cluster
+overcommitment (i.e. case (2)) safe by allowing users and/or administrators to identify
+which pods should be allowed to run, and which should go pending, when demand for cluster
+resources exceeds supply to due to cluster overcommitment.
+
+Priority is also useful in some special-case scenarios, such as ensuring that system
+DaemonSets can always schedule and reschedule onto every node where they want to run
+(assuming they are given the highest priority), e.g. see #21767.
+
+#### Specifying priorities
+
+We propose to add a required `Priority` field to `PodSpec`. Its value type is string, and
+the cluster administrator defines a total ordering on these strings (for example
+`Critical`, `Normal`, `Preemptible`). We choose string instead of integer so that it is
+easy for an administrator to add new priority levels in between existing levels, to
+encourage thinking about priority in terms of user intent and avoid magic numbers, and to
+make the internal implementation more flexible.
+
+When a scheduler is scheduling a new pod P and cannot find any node that meets all of P's
+scheduling predicates, it is allowed to evict ("preempt") one or more pods that are at
+the same or lower priority than P (subject to disruption budgets, see next section) from
+a node in order to make room for P, i.e. in order to make the scheduling predicates
+satisfied for P on that node. (Note that when we add cluster-level resources (#19080),
+it might be necessary to preempt from multiple nodes, but that scenario is outside the
+scope of this document.) The preempted pod(s) may or may not be able to reschedule. The
+net effect of this process is that when demand for cluster resources exceeds supply, the
+higher-priority pods will be able to run while the lower-priority pods will be forced to
+wait. The detailed mechanics of preemption are described in a later section.
+
+In addition to taking disruption budget into account, for equal-priority preemptions the
+scheduler will try to enforce fairness (across victim controllers, services, etc.)
+
+Priorities could be specified directly by users in the podTemplate, or assigned by an
+admission controller using
+properties of the pod. Either way, all schedulers must be configured to understand the
+same priorities (names and ordering). This could be done by making them constants in the
+API, or using ConfigMap to configure the schedulers with the information. The advantage of
+the former (at least making the names, if not the ordering, constants in the API) is that
+it allows the API server to do validation (e.g. to catch mis-spelling).
+
+In the future, which priorities are usable for a given namespace and pods with certain
+attributes may be configurable, similar to ResourceQuota, LimitRange, or security policy.
+
+Priority and resource QoS are indepedent.
+
+The priority we have described here might be used to prioritize the scheduling queue
+(i.e. the order in which a scheduler examines pods in its scheduling loop), but the two
+priority concepts do not have to be connected. It is somewhat logical to tie them
+together, since a higher priority genreally indicates that a pod is more urgent to get
+running. Also, scheduling low-priority pods before high-priority pods might lead to
+avoidable preemptions if the high-priority pods end up preempting the low-priority pods
+that were just scheduled.
+
+TODO: Priority and preemption are global or namespace-relative? See
+[this discussion thread](https://github.com/kubernetes/kubernetes/pull/22217#discussion_r55737389).
+
+#### Relationship of priority to quota
+
+Of course, if the decision of what priority to give a pod is solely up to the user, then
+users have no incentive to ever request any priority less than the maximum. Thus
+priority is intimately related to quota, in the sense that resource quotas must be
+allocated on a per-priority-level basis (X amount of RAM at priority A, Y amount of RAM
+at priority B, etc.). The "guarantee" that highest-priority pods will always be able to
+schedule can only be achieved if the sum of the quotas at the top priority level is less
+than or equal to the cluster capacity. This is analogous to QoS, where safety can only be
+achieved if the sum of the limits of the top QoS level ("Guaranteed") is less than or
+equal to the node capacity. In terms of incentives, an organization could "charge"
+an amount proportional to the priority of the resources.
+
+The topic of how to allocate quota at different priority levels to achieve a desired
+balance between utilization and probability of schedulability is an extremely complex
+topic that is outside the scope of this document. For example, resource fragmentation and
+RequiredDuringScheduling node and pod affinity and anti-affinity means that even if the
+sum of the quotas at the top priority level is less than or equal to the total aggregate
+capacity of the cluster, some pods at the top priority level might still go pending. In
+general, priority provdes a *probabilistic* guarantees of pod schedulability in the face
+of overcommitment, by allowing prioritization of which pods should be allowed to run pods
+when demand for cluster resources exceeds supply.
+
+### Disruption budget
+
+While priority can protect pods from one source of disruption (preemption by a
+lower-priority pod), *disruption budgets* limit disruptions from all Kubernetes-initiated
+causes, including preemption by an equal or higher-priority pod, or being evicted to
+achieve other rescheduling goals. In particular, each pod is optionally associated with a
+"disruption budget," a new API resource that limits Kubernetes-initiated terminations
+across a set of pods (e.g. the pods of a particular Service might all point to the same
+disruption budget object), regardless of cause. Initially we expect disruption budget
+(e.g. `DisruptionBudgetSpec`) to consist of
+
+* a rate limit on disruptions (preemption and other evictions) across the corresponding
+ set of pods, e.g. no more than one disruption per hour across the pods of a particular Service
+* a minimum number of pods that must be up simultaneously (sometimes called "shard
+ strength") (of course this can also be expressed as the inverse, i.e. the number of
+ pods of the collection that can be down simultaneously)
+
+The second item merits a bit more explanation. One use case is to specify a quorum size,
+e.g. to ensure that at least 3 replicas of a quorum-based service with 5 replicas are up
+at the same time. In practice, a service should ideally create enough replicas to survive
+at least one planned and one unplanned outage. So in our quorum example, we would specify
+that at least 4 replicas must be up at the same time; this allows for one intentional
+disruption (bringing the number of live replicas down from 5 to 4 and consuming one unit
+of shard strength budget) and one unplanned disruption (bringing the number of live
+replicas down from 4 to 3) while still maintaining a quorum. Shard strength is also
+useful for simpler replicated services; for example, you might not want more than 10% of
+your front-ends to be down at the same time, so as to avoid overloading the remaining
+replicas.
+
+Initially, disruption budgets will be specified by the user. Thus as with priority,
+disruption budgets need to be tied into quota, to prevent users from saying none of their
+pods can ever be disrupted. The exact way of expressing and enforcing this quota is TBD,
+though a simple starting point would be to have an admission controller assign a default
+disruption budget based on priority level (more liberal with decreasing priority).
+We also likely need a quota that applies to Kubernetes *components*, to the limit the rate
+at which any one component is allowed to consume disruption budget.
+
+Of course there should also be a `DisruptionBudgetStatus` that indicates the current
+disruption rate that the collection of pods is experiencing, and the number of pods that
+are up.
+
+For the purposes of disruption budget, a pod is considered to be disrupted as soon as its
+graceful termination period starts.
+
+A pod that is not covered by a disruption budget but is managed by a controller,
+gets an implicit disruption budget of infinite (though the system should try to not
+unduly victimize such pods). How a pod that is not managed by a controller is
+handled is TBD.
+
+TBD: In addition to `PodSpec`, where do we store pointer to disruption budget
+(podTemplate in controller that managed the pod?)? Do we auto-generate a disruption
+budget (e.g. when instantiating a Service), or require the user to create it manually
+before they create a controller? Which objects should return the disruption budget object
+as part of the output on `kubectl get` other than (obviously) `kubectl get` for the
+disruption budget itself?
+
+TODO: Clean up distinction between "down due to voluntary action taken by Kubernetes"
+and "down due to unplanned outage" in spec and status.
+
+For now, there is nothing to prevent clients from circumventing the disruption budget
+protections. Of course, clients that do this are not being "good citizens." In the next
+section we describe a mechanism that at least makes it easy for well-behaved clients to
+obey the disruption budgets.
+
+See #12611 for additional discussion of disruption budgets.
+
+### /evict subresource and PreferAvoidPods
+
+Although we could put the responsibility for checking and updating disruption budgets
+solely on the client, it is safer and more convenient if we implement that functionality
+in the API server. Thus we will introduce a new `/evict` subresource on pod. It is similar to
+today's "delete" on pod except
+
+ * It will be rejected if the deletion would violate disruption budget. (See how
+ Deployment handles failure of /rollback for ideas on how clients could handle failure
+ of `/evict`.) There are two possible ways to implement this:
+
+ * For the initial implementation, this will be accomplished by the API server just
+ looking at the `DisruptionBudgetStatus` and seeing if the disruption would violate the
+ `DisruptionBudgetSpec`. In this approach, we assume a disruption budget controller
+ keeps the `DisruptionBudgetStatus` up-to-date by observing all pod deletions and
+ creations in the cluster, so that an approved disruption is quickly reflected in the
+ `DisruptionBudgetStatus`. Of course this approach does allow a race in which one or
+ more additional disruptions could be approved before the first one is reflected in the
+ `DisruptionBudgetStatus`.
+
+ * Thus a subsequent implementation will have the API server explicitly debit the
+ `DisruptionBudgetStatus` when it accepts an `/evict`. (There still needs to be a
+ controller, to keep the shard strength status up-to-date when replacement pods are
+ created after an eviction; the controller may also be necessary for the rate status
+ depending on how rate is represented, e.g. adding tokens to a bucket at a fixed rate.)
+ Once etcd support multi-object transactions (etcd v3), the debit and pod deletion will
+ be placed in the same transaction.
+
+ * Note: For the purposes of disruption budget, a pod is considered to be disrupted as soon as its
+ graceful termination period starts (so when we say "delete" here we do not mean
+ "deleted from etcd" but rather "graceful termination period has started").
+
+ * It will allow clients to communicate additional parameters when they wish to delete a
+ pod. (In the absence of the `/evict` subresource, we would have to create a pod-specific
+ type analogous to `api.DeleteOptions`.)
+
+We will make `kubectl delete pod` use `/evict` by default, and require a command-line
+flag to delete the pod directly.
+
+We will add to `NodeStatus` a bounded-sized list of signatures of pods that should avoid
+that node (provisionally called `PreferAvoidPods`). One of the pieces of information
+specified in the `/evict` subresource is whether the eviction should add the evicted
+pod's signature to the corresponding node's `PreferAvoidPods`. Initially the pod
+signature will be a
+[controllerRef](https://github.com/kubernetes/kubernetes/issues/14961#issuecomment-183431648),
+i.e. a reference to the pod's controller. Controllers are responsible for garbage
+collecting, after some period of time, `PreferAvoidPods` entries that point to them, but the API
+server will also enforce a bounded size on the list. All schedulers will have a
+highest-weighted priority function that gives a node the worst priority if the pod it is
+scheduling appears in that node's `PreferAvoidPods` list. Thus appearing in
+`PreferAvoidPods` is similar to
+[RequiredDuringScheduling node anti-affinity](../../docs/user-guide/node-selection/README.md)
+but it takes precedence over all other priority criteria and is not explicitly listed in
+the `NodeAffinity` of the pod.
+
+`PreferAvoidPods` is useful for the "moving a running pod off of a node from which it is
+receiving poor service" use case, as it reduces the chance that the replacement pod will
+end up on the same node (keep in mind that most of those cases are situations that the
+scheduler does not have explicit priority functions for, for example it cannot know in
+advance that a pod will be starved). Also, though we do not intend to implement any such
+policies in the first version of the rescheduler, it is useful whenever the rescheduler evicts
+two pods A and B with the intention of moving A into the space vacated by B (it prevents
+B from rescheduling back into the space it vacated before A's scheduler has a chance to
+reschedule A there). Note that these two uses are subtly different; in the first
+case we want the avoidance to last a relatively long time, whereas in the second case we
+may only need it to last until A schedules.
+
+See #20699 for more discussion.
+
+### Preemption mechanics
+
+**NOTE: We expect a fuller design doc to be written on preemption before it is implemented.
+However, a sketch of some ideas are presented here, since preemption is closely related to the
+concepts discussed in this doc.**
+
+Pod schedulers will decide and enact preemptions, subject to the priority and disruption
+budget rules described earlier. (Though note that we currently do not have any mechanism
+to prevent schedulers from bypassing either the priority or disruption budget rules.)
+The scheduler does not concern itself with whether the evicted pod(s) can reschedule. The
+eviction(s) use(s) the `/evict` subresource so that it is subject to the disruption
+budget(s) of the victim(s), but it does not request to add the victim pod(s) to the
+nodes' `PreferAvoidPods`.
+
+Evicting victim(s) and binding the pending pod that the evictions are intended to enable
+to schedule, are not transactional. We expect the scheduler to issue the operations in
+sequence, but it is still possible that another scheduler could schedule its pod in
+between the eviction(s) and the binding, or that the set of pods running on the node in
+question changed between the time the scheduler made its decision and the time it sent
+the operations to the API server thereby causing the eviction(s) to be not sufficient to get the
+pending pod to schedule. In general there are a number of race conditions that cannot be
+avoided without (1) making the evictions and binding be part of a single transaction, and
+(2) making the binding preconditioned on a version number that is associated with the
+node and is incremented on every binding. We may or may not implement those mechanisms in
+the future.
+
+Given a choice between a node where scheduling a pod requires preemption and one where it
+does not, all other things being equal, a scheduler should choose the one where
+preemption is not required. (TBD: Also, if the selected node does require preemption, the
+scheduler should preempt lower-priority pods before higher-priority pods (e.g. if the
+scheduler needs to free up 4 GB of RAM, and the node has two 2 GB low-priority pods and
+one 4 GB high-priority pod, all of which have sufficient disruption budget, it should
+preempt the two low-priority pods). This is debatable, since all have sufficient
+disruption budget. But still better to err on the side of giving better disruption SLO to
+higher-priority pods when possible?)
+
+Preemption victims must be given their termination grace period. One possible sequence
+of events is
+
+1. The API server binds the preemptor to the node (i.e. sets `nodeName` on the
+preempting pod) and sets `deletionTimestamp` on the victims
+2. Kubelet sees that `deletionTimestamp` has been set on the victims; they enter their
+graceful termination period
+3. Kubelet sees the preempting pod. It runs the admission checks on the new pod
+assuming all pods that are in their graceful termination period are gone and that
+all pods that are in the waiting state (see (4)) are running.
+4. If (3) fails, then the new pod is rejected. If (3) passes, then Kubelet holds the
+new pod in a waiting state, and does not run it until the pod passes passes the
+admission checks using the set of actually running pods.
+
+Note that there are a lot of details to be figured out here; above is just a very
+hand-wavy sketch of one general approach that might work.
+
+See #22212 for additional discussion.
+
+### Node drain
+
+Node drain will be handled by one or more components not described in this document. They
+will respect disruption budgets. Initially, we will just make `kubectl drain`
+respect disruption budgets. See #17393 for other discussion.
+
+### Rescheduler
+
+All rescheduling other than preemption and node drain will be decided and enacted by a
+new component called the *rescheduler*. It runs continuously in the background, looking
+for opportunities to move pods to better locations. It acts when the degree of
+improvement meets some threshold and is allowed by the pod's disruption budget. The
+action is eviction of a pod using the `/evict` subresource, with the pod's signature
+enqueued in the node's `PreferAvoidPods`. It does not force the pod to reschedule to any
+particular node. Thus it is really an "unscheduler"; only in combination with the evicted
+pod's scheduler, which schedules the replacement pod, do we get true "rescheduling." See
+the "Example use cases" section earlier for some example use cases.
+
+The rescheduler is a best-effort service that makes no guarantees about how quickly (or
+whether) it will resolve a suboptimal pod placement.
+
+The first version of the rescheduler will not take into consideration where or whether an
+evicted pod will reschedule. The evicted pod may go pending, consuming one unit of the
+corresponding shard strength disruption budget by one indefinitely. By using the `/evict`
+subresource, the rescheduler ensures that an evicted pod has sufficient budget for the
+evicted pod to go and stay pending. We expect future versions of the rescheduler may be
+linked with the "mandatory" predicate functions (currently, the ones that constitute the
+Kubelet admission criteria), and will only evict if the rescheduler determines that the
+pod can reschedule somewhere according to those criteria. (Note that this still does not
+guarantee that the pod actually will be able to reschedule, for at least two reasons: (1)
+the state of the cluster may change between the time the rescheduler evaluates it and
+when the evicted pod's scheduler tries to schedule the replacement pod, and (2) the
+evicted pod's scheduler may have additional predicate functions in addition to the
+mandatory ones).
+
+(Note: see [this comment](https://github.com/kubernetes/kubernetes/pull/22217#discussion_r54527968)).
+
+The first version of the rescheduler will only implement two objectives: moving a pod
+onto an under-utilized node, and moving a pod onto a node that meets more of the pod's
+affinity/anti-affinity preferences than wherever it is currently running. (We assume that
+nodes that are intentionally under-utilized, e.g. because they are being drained, are
+marked unschedulable, thus the first objective will not cause the rescheduler to "fight"
+a system that is draining nodes.) We assume that all schedulers sufficiently weight the
+priority functions for affinity/anti-affinity and avoiding very packed nodes,
+otherwise evicted pods may not actually move onto a node that is better according to
+the criteria that caused it to be evicted. (But note that in all cases it will move to a
+node that is better according to the totality of its scheduler's priority functions,
+except in the case where the node where it was already running was the only node
+where it can run.) As a general rule, the rescheduler should only act when it sees
+particularly bad situations, since (1) an eviction for a marginal improvement is likely
+not worth the disruption--just because there is sufficient budget for an eviction doesn't
+mean an eviction is painless to the application, and (2) rescheduling the pod might not
+actually mitigate the identified problem if it is minor enough that other scheduling
+factors dominate the decision of where the replacement pod is scheduled.
+
+We assume schedulers' priority functions are at least vaguely aligned with the
+rescheduler's policies; otherwise the rescheduler will never accomplish anything useful,
+given that it relies on the schedulers to actually reschedule the evicted pods. (Even if
+the rescheduler acted as a scheduler, explicitly rebinding evicted pods, we'd still want
+this to be true, to prevent the schedulers and rescheduler from "fighting" one another.)
+
+The rescheduler will be configured using ConfigMap; the cluster administrator can enable
+or disable policies and can tune the rescheduler's aggressiveness (aggressive means it
+will use a relatively low threshold for triggering an eviction and may consume a lot of
+disruption budget, while non-aggressive means it will use a relatively high threshold for
+triggering an eviction and will try to leave plenty of buffer in disruption budgets). The
+first version of the rescheduler will not be extensible or pluggable, since we want to
+keep the code simple while we gain experience with the overall concept. In the future, we
+anticipate a version that will be extensible and pluggable.
+
+We might want some way to force the evicted pod to the front of the scheduler queue,
+independently of its priority.
+
+See #12140 for additional discussion.
+
+### Final comments
+
+In general, the design space for this topic is huge. This document describes some of the
+design considerations and proposes one particular initial implementation. We expect
+certain aspects of the design to be "permanent" (e.g. the notion and use of priorities,
+preemption, disruption budgets, and the `/evict` subresource) while others may change over time
+(e.g. the partitioning of functionality between schedulers, controllers, rescheduler,
+horizontal pod autoscaler, and cluster autoscaler; the policies the rescheduler implements;
+the factors the rescheduler takes into account when making decisions (e.g. knowledge of
+schedulers' predicate and priority functions, second-order effects like whether and where
+evicted pod will be able to reschedule, etc.); the way the rescheduler enacts its
+decisions; and the complexity of the plans the rescheduler attempts to implement).
+
+## Implementation plan
+
+The highest-priority feature to implement is the rescheduler with the two use cases
+highlighted earlier: moving a pod onto an under-utilized node, and moving a pod onto a
+node that meets more of the pod's affinity/anti-affinity preferences. The former is
+useful to rebalance pods after cluster auto-scale-up, and the latter is useful for
+Ubernetes. This requires implementing disruption budgets and the `/evict` subresource,
+but not priority or preemption.
+
+Because the general topic of rescheduling is very speculative, we have intentionally
+proposed that the first version of the rescheduler be very simple -- only uses eviction
+(no attempt to guide replacement pod to any particular node), doesn't know schedulers'
+predicate or priority functions, doesn't try to move two pods at the same time, and only
+implements two use cases. As alluded to in the previous subsection, we expect the design
+and implementation to evolve over time, and we encourage members of the community to
+experiment with more sophisticated policies and to report their results from using them
+on real workloads.
+
+## Alternative implementations
+
+TODO.
+
+## Additional references
+
+TODO.
+
+TODO: Add reference to this doc from docs/proposals/rescheduler.md
+
+
+
+[]()
+