address issue #1488; clean up linewrap and some minor editing issues in the docs/design/* tree

Signed-off-by: mikebrow <brownwm@us.ibm.com>
2025-09-16 22:53:22 +00:00 · 2016-04-13 19:55:22 -05:00
parent 4638f2f355
commit 6bdc0bfdb7
39 changed files with 3744 additions and 2375 deletions
--- a/docs/design/control-plane-resilience.md
+++ b/docs/design/control-plane-resilience.md
@@ -54,7 +54,7 @@ ideas.
 * **High availability:** continuing to be available and work correctly
  even if some components are down or uncontactable.  This typically
  involves multiple replicas of critical services, and a reliable way
-  to find available replicas.  Note that it's possible (but not
+  to find available replicas. Note that it's possible (but not
  desirable) to have high
  availability properties (e.g. multiple replicas) in the absence of
  self-healing properties (e.g. if a replica fails, nothing replaces
@@ -109,11 +109,11 @@ ideas.

 ## Relative Priorities

-1. **(Possibly manual) recovery from catastrophic failures:** having a Kubernetes cluster, and all
-   applications running inside it, disappear forever perhaps is the worst
-   possible failure mode.  So it is critical that we be able to
-   recover the applications running inside a cluster from such
-   failures in some well-bounded time period.
+1. **(Possibly manual) recovery from catastrophic failures:** having a
+Kubernetes cluster, and all applications running inside it, disappear forever
+perhaps is the worst possible failure mode. So it is critical that we be able to
+recover the applications running inside a cluster from such failures in some
+well-bounded time period.
    1. In theory a cluster can be recovered by replaying all API calls
       that have ever been executed against it, in order, but most
       often that state has been lost, and/or is scattered across
@@ -121,12 +121,12 @@ ideas.
       probably infeasible.
    1. In theory a cluster can also be recovered to some relatively
       recent non-corrupt backup/snapshot of the disk(s) backing the
-       etcd cluster state.  But we have no default consistent
+       etcd cluster state. But we have no default consistent
       backup/snapshot, verification or restoration process.  And we
       don't routinely test restoration, so even if we did routinely
       perform and verify backups, we have no hard evidence that we
       can in practise effectively recover from catastrophic cluster
-       failure or data corruption by restoring from these backups.  So
+       failure or data corruption by restoring from these backups. So
       there's more work to be done here.
 1. **Self-healing:** Most major cloud providers provide the ability to
   easily and automatically replace failed virtual machines within a
@@ -144,7 +144,6 @@ ideas.
           addition](https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md#add-a-new-member)
           or [backup and
           recovery](https://github.com/coreos/etcd/blob/master/Documentation/admin_guide.md#disaster-recovery)).
-
    1. and boot disks are either:
        1. truely persistent (i.e. remote persistent disks), or
        1. reconstructible (e.g. using boot-from-snapshot,
@@ -157,7 +156,7 @@ ideas.
   quorum members).  In environments where cloud-assisted automatic
   self-healing might be infeasible (e.g. on-premise bare-metal
   deployments), it also gives cluster administrators more time to
-   respond (e.g.  replace/repair failed machines) without incurring
+   respond (e.g. replace/repair failed machines) without incurring
   system downtime.

 ## Design and Status (as of December 2015)
@@ -174,7 +173,7 @@ ideas.

 Multiple stateless, self-hosted, self-healing API servers behind a HA
 load balancer, built out by the default "kube-up" automation on GCE,
-AWS and basic bare metal (BBM).  Note that the single-host approach of
+AWS and basic bare metal (BBM). Note that the single-host approach of
 hving etcd listen only on localhost to ensure that onyl API server can
 connect to it will no longer work, so alternative security will be
 needed in the regard (either using firewall rules, SSL certs, or
@@ -189,13 +188,13 @@ design doc.
 <td>

 No scripted self-healing or HA on GCE, AWS or basic bare metal
-currently exists in the OSS distro.   To be clear, "no self healing"
+currently exists in the OSS distro. To be clear, "no self healing"
 means that even if multiple e.g. API servers are provisioned for HA
 purposes, if they fail, nothing replaces them, so eventually the
-system will fail.  Self-healing and HA can be set up
+system will fail. Self-healing and HA can be set up
 manually by following documented instructions, but this is not
 currently an automated process, and it is not tested as part of
-continuous integration.  So it's probably safest to assume that it
+continuous integration. So it's probably safest to assume that it
 doesn't actually work in practise.

 </td>
@@ -205,8 +204,8 @@ doesn't actually work in practise.
 <td>

 Multiple self-hosted, self healing warm standby stateless controller
-managers and schedulers with leader election and automatic failover of API server
-clients, automatically installed by default "kube-up" automation.
+managers and schedulers with leader election and automatic failover of API
+server clients, automatically installed by default "kube-up" automation.

 </td>
 <td>As above.</td>
@@ -218,47 +217,49 @@ clients, automatically installed by default "kube-up" automation.
 Multiple (3-5) etcd quorum members behind a load balancer with session
 affinity (to prevent clients from being bounced from one to another).

-Regarding self-healing, if a node running etcd goes down, it is always necessary to do three
-things:
+Regarding self-healing, if a node running etcd goes down, it is always necessary
+to do three things:
 <ol>
 <li>allocate a new node (not necessary if running etcd as a pod, in
 which case specific measures are required to prevent user pods from
 interfering with system pods, for example using node selectors as
-described in <A HREF=")
-<li>start an etcd replica on that new node,
+described in <A HREF="),
+<li>start an etcd replica on that new node, and
 <li>have the new replica recover the etcd state.
 </ol>
 In the case of local disk (which fails in concert with the machine), the etcd
-state must be recovered from the other replicas.  This is called  <A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md#add-a-new-member">dynamic member
-           addition</A>.
-In the case of remote persistent disk, the etcd state can be recovered
-by attaching the remote persistent disk to the replacement node, thus
-the state is recoverable even if all other replicas are down.
+state must be recovered from the other replicas. This is called
+<A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md#add-a-new-member">
+dynamic member addition</A>.
+
+In the case of remote persistent disk, the etcd state can be recovered by
+attaching the remote persistent disk to the replacement node, thus the state is
+recoverable even if all other replicas are down.

 There are also significant performance differences between local disks and remote
-persistent disks.  For example, the <A HREF="https://cloud.google.com/compute/docs/disks/#comparison_of_disk_types">sustained throughput
-local disks in GCE is approximatley 20x that of remote disks</A>.
+persistent disks. For example, the
+<A HREF="https://cloud.google.com/compute/docs/disks/#comparison_of_disk_types">
+sustained throughput local disks in GCE is approximatley 20x that of remote
+disks</A>.

-Hence we suggest that self-healing be provided by remotely mounted persistent disks in
-non-performance critical, single-zone cloud deployments.  For
-performance critical installations, faster local SSD's should be used,
-in which case remounting on node failure is not an option, so
-<A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md ">etcd runtime configuration</A>
-should be used to replace the failed machine.  Similarly, for
-cross-zone self-healing, cloud persistent disks are zonal, so
-automatic
-<A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md">runtime configuration</A>
-is required.  Similarly, basic bare metal deployments cannot generally
-rely on
-remote persistent disks, so the same approach applies there.
+Hence we suggest that self-healing be provided by remotely mounted persistent
+disks in non-performance critical, single-zone cloud deployments. For
+performance critical installations, faster local SSD's should be used, in which
+case remounting on node failure is not an option, so
+<A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md ">
+etcd runtime configuration</A> should be used to replace the failed machine.
+Similarly, for cross-zone self-healing, cloud persistent disks are zonal, so
+automatic <A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md">
+runtime configuration</A> is required. Similarly, basic bare metal deployments
+cannot generally rely on remote persistent disks, so the same approach applies
+there.
 </td>
 <td>
 <A HREF="http://kubernetes.io/v1.1/docs/admin/high-availability.html">
-Somewhat vague instructions exist</A>
-on how to set some of this up manually in a self-hosted
-configuration. But automatic bootstrapping and self-healing is not
-described (and is not implemented for the non-PD cases).  This all
-still needs to be automated and continuously tested.
+Somewhat vague instructions exist</A> on how to set some of this up manually in
+a self-hosted configuration. But automatic bootstrapping and self-healing is not
+described (and is not implemented for the non-PD cases). This all still needs to
+be automated and continuously tested.
 </td>
 </tr>
 </table>