Changes per reviews

This commit is contained in:
Justin Santa Barbara 2015-09-19 12:53:19 -04:00
parent e7ff4d2245
commit b3a4b1853d

View File

@ -31,21 +31,31 @@ Documentation for other releases can be found at
<!-- END MUNGE: UNVERSIONED_WARNING --> <!-- END MUNGE: UNVERSIONED_WARNING -->
## Peeking under the hood of kubernetes on AWS # Peeking under the hood of Kubernetes on AWS
We encourage you to use kube-up (or CloudFormation) to create a cluster. But This document provides high-level insight into how Kubernetes works on AWS and
it is useful to know what is being created: for curiosity, to understand any maps to AWS objects. We assume that you are familiar with AWS.
problems that may arise, or if you have to create things manually because the
scripts are unsuitable for any reason. We don't recommend manual configuration
(please file an issue and let us know what's missing if there's something you
need) but sometimes it is the only option.
This document sets out to document how kubernetes on AWS maps to AWS objects. We encourage you to use [kube-up](../getting-started-guides/aws.md) (or
Familiarity with AWS is assumed. [CloudFormation](../getting-started-guides/aws-coreos.md) to create clusters on
AWS. We recommend that you avoid manual configuration but are aware that
sometimes it's the only option.
### Top-level Tip: You should open an issue and let us know what enhancements can be made to
the scripts to better suit your needs.
That said, it's also useful to know what's happening under the hood when
Kubernetes clusters are created on AWS. This can be particularly useful if
problems arise or in circumstances where the provided scripts are lacking and
you manually created or configured your cluster.
### Architecture overview
Kubernetes is a cluster of several machines that consists of a Kubernetes
master and a set number of nodes (previously known as 'minions') for which the
master which is responsible. See the [Architecture](architecture.md) topic for
more details.
Kubernetes consists of a single master node, and a collection of minion nodes.
Other documents describe the general architecture of Kubernetes (all nodes run Other documents describe the general architecture of Kubernetes (all nodes run
Docker; the kubelet agent runs on each node and launches containers; the Docker; the kubelet agent runs on each node and launches containers; the
kube-proxy relays traffic between the nodes etc). kube-proxy relays traffic between the nodes etc).
@ -53,171 +63,192 @@ kube-proxy relays traffic between the nodes etc).
By default on AWS: By default on AWS:
* Instances run Ubuntu 15.04 (the official AMI). It includes a sufficiently * Instances run Ubuntu 15.04 (the official AMI). It includes a sufficiently
modern kernel to give a good experience with Docker, it doesn't require a modern kernel that parise well with Docker and doesn't require a
reboot. (The default SSH user is `ubuntu` for this and other ubuntu images) reboot. (The default SSH user is `ubuntu` for this and other ubuntu images.)
* By default we run aufs over ext4 as the filesystem / container storage on the * By default we run aufs over ext4 as the filesystem / container storage on the
nodes (mostly because this is what GCE uses). nodes (mostly because this is what GCE uses).
These defaults can be changed by passing different environment variables to You can override these defaults by passing different environment variables to
kube-up. kube-up.
### Storage ### Storage
AWS does support persistent volumes via EBS. These can then be attached to AWS supports persistent volumes by using [Elastic Block Store
pods that should store persistent data (e.g. if you're running a database). (EBS)](../user-guide/volumes.md#awselasticblockstore). These can then be
attached to pods that should store persistent data (e.g. if you're running a
database).
Minions do not have persistent volumes otherwise. In general, kubernetes By default, nodes in AWS use `[instance
containers do not have persistent storage unless you attach a persistent storage](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html)'
volume, and so minions on AWS use instance storage. Instance storage is unless you create pods with persistent volumes
cheaper, often faster, and historically more reliable. This does mean that you `[(EBS)](../user-guide/volumes.md#awselasticblockstore)`. In general,
should pick an instance type that has sufficient instance storage, unless you Kubernetes containers do not have persistent storage unless you attach a
can make do with whatever space is left on your root partition. persistent volume, and so nodes on AWS use instance storage. Instance
storage is cheaper, often faster, and historically more reliable. This does
mean that you should pick an instance type that has sufficient instance
storage, unless you can make do with whatever space is left on your root
partition.
The master _does_ have a persistent volume attached to it. Containers are Note: Master uses a persistent volume ([etcd](architecture.html#etcd)) to track
mostly run against instance storage, just like the minions, except that we its state but similar to the nodes, container are mostly run against instance
repoint some important data onto the peristent volume. storage, except that we repoint some important data onto the peristent volume.
By default we use aufs over ext4. `DOCKER_STORAGE=btrfs` is also a good choice The default storage driver for Docker images is aufs. Passing the environment
for a filesystem: it is relatively reliable with Docker; btrfs itself is much variable `DOCKER_STORAGE=btrfs` is also a good choice for a filesystem. btrfs
more reliable than it used to be with modern kernels. It can easily span is relatively reliable with Docker and has improved its reliability with modern
multiple volumes, which is particularly useful when we are using an instance kernels. It can easily span multiple volumes, which is particularly useful
type with multiple ephemeral instance disks. when we are using an instance type with multiple ephemeral instance disks.
### AutoScaling ### AutoScaling
We run the minions in an AutoScalingGroup. Currently auto-scaling (e.g. based Nodes (except for the master) are run in an
on CPU) is not actually enabled (#11935). Instead, the auto-scaling group `[AutoScalingGroup](http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/AutoScalingGroup.html)
means that AWS will relaunch any minions that are terminated. on AWS. Currently auto-scaling (e.g. based on CPU) is not actually enabled
([#11935](http://issues.k8s.io/11935)). Instead, the auto-scaling group means
that AWS will relaunch any non-master nodes that are terminated.
We do not currently run the master in an AutoScalingGroup, but we should We do not currently run the master in an AutoScalingGroup, but we should
(#11934) ([#11934](http://issues.k8s.io/11934)).
### Networking ### Networking
Kubernetes uses an IP-per-pod model. This means that a node, which runs many Kubernetes uses an IP-per-pod model. This means that a node, which runs many
pods, must have many IPs. The way we implement this on AWS is to use VPCs and pods, must have many IPs. AWS uses virtual private clouds (VPCs) and advanced
the advanced routing support that it allows. Each pod is assigned a /24 CIDR; routing support so each pod is assigned a /24 CIDR. Each pod is assigned a /24
then this CIDR is configured to route to an instance in the VPC routing table. CIDR; the assigned CIDR is then configured to route to an instance in the VPC
routing table.
It is also possible to use overlay networking on AWS, but the default kube-up It is also possible to use overlay networking on AWS, but that is not the
configuration does not. configuration of the kube-up script.
### NodePort & LoadBalancing ### NodePort and LoadBalancing
Kubernetes on AWS integrates with ELB. When you create a service with Kubernetes on AWS integrates with [Elastic Load Balancing
Type=LoadBalancer, kubernetes (the kube-controller-manager) will create an ELB, (ELB)](http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/US_SetUpASLBApp.html).
create a security group for the ELB which allows access on the service ports, When you create a service with `Type=LoadBalancer`, Kubernetes (the
attach all the minions to the ELB, and modify the security group for the kube-controller-manager) will create an ELB, create a security group for the
minions to allow traffic from the ELB to the minions. This traffic reaches ELB which allows access on the service ports, attach all the nodes to the ELB,
kube-proxy where it is then forwarded to the pods. and modify the security group for the nodes to allow traffic from the ELB to
the nodes. This traffic reaches kube-proxy where it is then forwarded to the
pods.
ELB requires that all minions listen on a single port, and it acts as a layer-7 ELB has some restrictions: it requires that all nodes listen on a single port,
forwarding proxy (i.e. the source IP is not preserved). It is not trivial for and it acts as a forwarding proxy (i.e. the source IP is not preserved). To
kube-proxy to recognize the traffic therefore. So, LoadBalancer services are work with these restrictions, in Kubernetes, `[LoadBalancer
also exposed as NodePort services. For NodePort services, a cluster-wide port services](../user-guide/services.html#type-loadbalancer)` are exposed as
is assigned by kubernetes to the service, and kube-proxy listens externally on `[NodePort services](../user-guide/services.html#type-nodeport)`. Then
that port on every minion, and forwards traffic to the pods. So for a kube-proxy listens externally on the cluster-wide port that's assigned to
load-balanced service, ELB is configured to proxy traffic on the public port NodePort services and forwards traffic to the corresponding pods. So ELB is
(e.g. port 80) to the NodePort assigned to the service (e.g. 31234), kube-proxy configured to proxy traffic on the public port (e.g. port 80) to the NodePort
recognizes the traffic coming to the NodePort by the inbound port number, and that is assigned to the service (e.g. 31234). Any in-coming traffic sent to
send it to the correct pods for the service. the NodePort (e.g. port 31234) is recognized by kube-proxy and then sent to the
correct pods for that service.
Note that we do not automatically open NodePort services in the AWS firewall Note that we do not automatically open NodePort services in the AWS firewall
(although we do open LoadBalancer services). This is because we expect that (although we do open LoadBalancer services). This is because we expect that
NodePort services are more of a building block for things like inter-cluster NodePort services are more of a building block for things like inter-cluster
services or for LoadBalancer. To consume a NodePort service externally, you services or for LoadBalancer. To consume a NodePort service externally, you
will likely have to open the port in the minion security group will likely have to open the port in the node security group
(`kubernetes-minion-<clusterid>`). (`kubernetes-minion-<clusterid>`).
### IAM ### Identity and Access Management (IAM)
kube-proxy sets up two IAM roles, one for the master called kube-proxy sets up two IAM roles, one for the master called
(kubernetes-master)[cluster/aws/templates/iam/kubernetes-master-policy.json] [kubernetes-master](../../cluster/aws/templates/iam/kubernetes-master-policy.json)
and one for the minions called and one for the non-master nodes called
(kubernetes-minion)[cluster/aws/templates/iam/kubernetes-minion-policy.json]. [kubernetes-minion](../../cluster/aws/templates/iam/kubernetes-minion-policy.json).
The master is responsible for creating ELBs and configuring them, as well as The master is responsible for creating ELBs and configuring them, as well as
setting up advanced VPC routing. Currently it has blanket permissions on EC2, setting up advanced VPC routing. Currently it has blanket permissions on EC2,
along with rights to create and destroy ELBs. along with rights to create and destroy ELBs.
The minion does not need a lot of access to the AWS APIs. It needs to download The (non-master) nodes do not need a lot of access to the AWS APIs. They need to download
a distribution file, and then it is responsible for attaching and detaching EBS a distribution file, and then are responsible for attaching and detaching EBS
volumes to itself. volumes from itself.
The minion policy is relatively minimal. The master policy is probably overly The (non-master) node policy is relatively minimal. The master policy is probably overly
permissive. The security concious may want to lock-down the IAM policies permissive. The security concious may want to lock-down the IAM policies
further (#11936) further ([#11936](http://issues.k8s.io/11936)).
We should make it easier to extend IAM permissions and also ensure that they We should make it easier to extend IAM permissions and also ensure that they
are correctly configured (#???) are correctly configured ([#14226](http://issues.k8s.io/14226)).
### Tagging ### Tagging
All AWS resources are tagged with a tag named "KuberentesCluster". This tag is All AWS resources are tagged with a tag named "KuberentesCluster", with a value
used to identify a particular 'instance' of Kubernetes, even if two clusters that is the unique cluster-id. This tag is used to identify a particular
are deployed into the same VPC. (The script doesn't do this by default, but it 'instance' of Kubernetes, even if two clusters are deployed into the same VPC.
can be done.) Resources are considered to belong to the same cluster if and only if they have
the same value in the tag named "KubernetesCluster". (The kube-up script is
not configured to create multiple clusters in the same VPC by default, but it
is possible to create another cluster in the same VPC.)
Within the AWS cloud provider logic, we filter requests to the AWS APIs to Within the AWS cloud provider logic, we filter requests to the AWS APIs to
match resources with our cluster tag. So we only see our own AWS objects. match resources with our cluster tag. By filtering the requests, we ensure
that we see only our own AWS objects.
If you choose not to use kube-up, you must tag everything with a Important: If you choose not to use kube-up, you must pick a unique cluster-id
KubernetesCluster tag with a unique per-cluster value. value, and ensure that all AWS resources have a tag with
`Name=KubernetesCluster,Value=<clusterid>`.
### AWS Objects
# AWS Objects
The kube-up script does a number of things in AWS: The kube-up script does a number of things in AWS:
* Creates an S3 bucket (`AWS_S3_BUCKET`) and copy the kubernetes distribution * Creates an S3 bucket (`AWS_S3_BUCKET`) and then copies the Kubernetes distribution
and the salt scripts into it. They are made world-readable and the HTTP URLs and the salt scripts into it. They are made world-readable and the HTTP URLs
are passed to instances; this is how kubernetes code gets onto the machines. are passed to instances; this is how Kubernetes code gets onto the machines.
* Creates two IAM profiles based on templates in `cluster/aws/templates/iam`. * Creates two IAM profiles based on templates in `cluster/aws/templates/iam`:
`kubernetes-master` is used by the master node; `kubernetes-minion` is used * `kubernetes-master` is used by the master node
by minion nodes. * `kubernetes-minion` is used by non-master nodes.
* Creates an AWS SSH key named `kubernetes-<fingerprint>`. Fingerprint here is * Creates an AWS SSH key named `kubernetes-<fingerprint>`. Fingerprint here is
the OpenSSH key fingerprint, so that multiple users can run the script with the OpenSSH key fingerprint, so that multiple users can run the script with
different keys and their keys will not collide (with near-certainty) It will different keys and their keys will not collide (with near-certainty). It will
use an existing key if one is found at `AWS_SSH_KEY`, otherwise it will create use an existing key if one is found at `AWS_SSH_KEY`, otherwise it will create
one there. (With the default ubuntu images, if you have to SSH in: the user is one there. (With the default ubuntu images, if you have to SSH in: the user is
`ubuntu` and that user can `sudo`) `ubuntu` and that user can `sudo`)
* Creates a VPC for use with the cluster (with a CIDR of 172.20.0.0/16)., and * Creates a VPC for use with the cluster (with a CIDR of 172.20.0.0/16) and
enables the `dns-support` and `dns-hostnames` options. enables the `dns-support` and `dns-hostnames` options.
* Creates an internet gateway for the VPC. * Creates an internet gateway for the VPC.
* Creates a route table for the VPC, with the internet gateway as the default * Creates a route table for the VPC, with the internet gateway as the default
route route
* Creates a subnet (with a CIDR of 172.20.0.0/24) in the AZ `KUBE_AWS_ZONE` * Creates a subnet (with a CIDR of 172.20.0.0/24) in the AZ `KUBE_AWS_ZONE`
(defaults to us-west-2a). Currently kubernetes runs in a single AZ; there (defaults to us-west-2a). Currently, each Kubernetes cluster runs in a
are two philosophies on how to achieve HA: cluster-per-AZ and single AZ on AWS. Although, there are two philosophies in discussion on how to
cross-AZ-clusters. cluster-per-AZ says you should have an independent cluster achieve High Availability (HA):
for each AZ, they are entirely separate. cross-AZ-clusters allows a single * cluster-per-AZ: An independent cluster for each AZ, where each cluster
cluster to span multiple AZs. The debate is open here: cluster-per-AZ is more is entirely separate.
robust but cross-AZ-clusters are more convenient. For now though, each AWS * cross-AZ-clusters: A single cluster spans multiple AZs.
kuberentes cluster lives in one AZ. The debate is open here, where cluster-per-AZ is discussed as more robust but
cross-AZ-clusters are more convenient.
* Associates the subnet to the route table * Associates the subnet to the route table
* Creates security groups for the master node (`kubernetes-master-<clusterid>`) * Creates security groups for the master node (`kubernetes-master-<clusterid>`)
and the minion nodes (`kubernetes-minion-<clusterid>`) and the non-master nodes (`kubernetes-minion-<clusterid>`)
* Configures security groups so that masters & minions can intercommunicate, * Configures security groups so that masters and nodes can communicate. This
and opens SSH to the world on master & minions, and opens port 443 to the includes intercommunication between masters and nodes, opening SSH publicly
world on the master (for the HTTPS API endpoint) for both masters and nodes, and opening port 443 on the master for the HTTPS
API endpoints.
* Creates an EBS volume for the master node of size `MASTER_DISK_SIZE` and type * Creates an EBS volume for the master node of size `MASTER_DISK_SIZE` and type
`MASTER_DISK_TYPE` `MASTER_DISK_TYPE`
* Launches a master node with a fixed IP address (172.20.0.9), with the * Launches a master node with a fixed IP address (172.20.0.9) that is also
security group, IAM credentials etc. An instance script is used to pass configured for the security group and all the necessary IAM credentials. An
vital configuration information to Salt. The hope is that over time we can instance script is used to pass vital configuration information to Salt. Note:
reduce the amount of configuration information that must be passed in this way. The hope is that over time we can reduce the amount of configuration
* Once the instance is up, it attaches the EBS volume & sets up a manual information that must be passed in this way.
* Once the instance is up, it attaches the EBS volume and sets up a manual
routing rule for the internal network range (`MASTER_IP_RANGE`, defaults to routing rule for the internal network range (`MASTER_IP_RANGE`, defaults to
10.246.0.0/24) 10.246.0.0/24)
* Creates an auto-scaling launch-configuration and group for the minions. The * For auto-scaling, on each nodes it creates a launch configuration and group.
name for both is `<KUBE_AWS_INSTANCE_PREFIX>-minion-group`, defaults to The name for both is <*KUBE_AWS_INSTANCE_PREFIX*>-minion-group. The default
`kubernetes-minion-group`. The auto-scaling group has size min & max both set name is kubernetes-minion-group. The auto-scaling group has a min and max size
to `NUM_MINIONS`. You can change the size of the auto-scaling group to add or that are both set to NUM_MINIONS. You can change the size of the auto-scaling
remove minions (directly though the AWS API/Console). The minion nodes group to add or remove the total number of nodes from within the AWS API or
self-configure: they come up, run Salt with the stored configuration; connect Console. Each nodes self-configures, meaning that they come up; run Salt with
to the master and are assigned an internal CIDR; the master configures the the stored configuration; connect to the master; are assigned an internal CIDR;
route-table with the minion CIDR. The script does health-check the minions, and then the master configures the route-table with the assigned CIDR. The
but this is a self-check, it is not required. kube-up script performs a health-check on the nodes but it's a self-check that
is not required.
If attempting this configuration manually, I highly recommend following along If attempting this configuration manually, I highly recommend following along
with the kube-up script, and being sure to tag everything with a with the kube-up script, and being sure to tag everything with a
@ -227,29 +258,32 @@ simplify this by having Kubernetes take on more node configuration, and even
potentially remove Salt altogether. potentially remove Salt altogether.
## Manual infrastructure creation ### Manual infrastructure creation
While this work is not yet complete, advanced users may choose to create (some) While this work is not yet complete, advanced users might choose to manually
AWS objects themselves, and still make use of the kube-up script (to configure certain AWS objects while still making use of the kube-up script (to configure
Salt, for example). Salt, for example). These objects can currently be manually created:
* `AWS_S3_BUCKET` will use an existing S3 bucket * Set the `AWS_S3_BUCKET` environment variable to use an existing S3 bucket.
* `VPC_ID` will reuse an existing VPC * Set the `VPC_ID` environment variable to reuse an existing VPC.
* `SUBNET_ID` will reuse an existing subnet * Set the `SUBNET_ID` environemnt variable to reuse an existing subnet.
* If your route table is tagged with the correct `KubernetesCluster`, it will * If your route table has a matching `KubernetesCluster` tag, it will
be reused be reused.
* If your security groups are appropriately named, they will be reused. * If your security groups are appropriately named, they will be reused.
Currently there is no way to do the following with kube-up. If these affect Currently there is no way to do the following with kube-up:
you, please open an issue with a description of what you're trying to do (your
use-case) and we'll see what we can do:
* Use an existing AWS SSH key with an arbitrary name * Use an existing AWS SSH key with an arbitrary name.
* Override the IAM credentials in a sensible way (but this is in-progress) * Override the IAM credentials in a sensible way
* Use different security group permissions ([#14226](http://issues.k8s.io/14226)).
* Configure your own auto-scaling groups * Use different security group permissions.
* Configure your own auto-scaling groups.
# Instance boot If any of the above items apply to your situation, open an issue to request an
enhancement to the kube-up script. You should provide a complete description of
the use-case, including all the details around what you want to accomplish.
### Instance boot
The instance boot procedure is currently pretty complicated, primarily because The instance boot procedure is currently pretty complicated, primarily because
we must marshal configuration from Bash to Salt via the AWS instance script. we must marshal configuration from Bash to Salt via the AWS instance script.
@ -260,7 +294,7 @@ When the kube-up script launches instances, it builds an instance startup
script which includes some configuration options passed to kube-up, and script which includes some configuration options passed to kube-up, and
concatenates some of the scripts found in the cluster/aws/templates directory. concatenates some of the scripts found in the cluster/aws/templates directory.
These scripts are responsible for mounting and formatting volumes, downloading These scripts are responsible for mounting and formatting volumes, downloading
Salt & Kubernetes from the S3 bucket, and then triggering Salt to actually Salt and Kubernetes from the S3 bucket, and then triggering Salt to actually
install Kubernetes. install Kubernetes.