From e7ff4d2245f1161fec807916d4f0c442eaa90850 Mon Sep 17 00:00:00 2001 From: Justin Santa Barbara Date: Tue, 28 Jul 2015 14:18:50 -0400 Subject: [PATCH] AWS "under the hood" document Document how we implement kubernetes on AWS, so that configuration tools other than kube-up can have a reference for what they should do, and generally to help developers get up to speed. --- docs/design/aws_under_the_hood.md | 271 ++++++++++++++++++++++++++++++ 1 file changed, 271 insertions(+) create mode 100644 docs/design/aws_under_the_hood.md diff --git a/docs/design/aws_under_the_hood.md b/docs/design/aws_under_the_hood.md new file mode 100644 index 00000000000..eece5dfb7ff --- /dev/null +++ b/docs/design/aws_under_the_hood.md @@ -0,0 +1,271 @@ + + + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + +The latest 1.0.x release of this document can be found +[here](http://releases.k8s.io/release-1.0/docs/design/aws_under_the_hood.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + + + + +## Peeking under the hood of kubernetes on AWS + +We encourage you to use kube-up (or CloudFormation) to create a cluster. But +it is useful to know what is being created: for curiosity, to understand any +problems that may arise, or if you have to create things manually because the +scripts are unsuitable for any reason. We don't recommend manual configuration +(please file an issue and let us know what's missing if there's something you +need) but sometimes it is the only option. + +This document sets out to document how kubernetes on AWS maps to AWS objects. +Familiarity with AWS is assumed. + +### Top-level + +Kubernetes consists of a single master node, and a collection of minion nodes. +Other documents describe the general architecture of Kubernetes (all nodes run +Docker; the kubelet agent runs on each node and launches containers; the +kube-proxy relays traffic between the nodes etc). + +By default on AWS: + +* Instances run Ubuntu 15.04 (the official AMI). It includes a sufficiently + modern kernel to give a good experience with Docker, it doesn't require a + reboot. (The default SSH user is `ubuntu` for this and other ubuntu images) +* By default we run aufs over ext4 as the filesystem / container storage on the + nodes (mostly because this is what GCE uses). + +These defaults can be changed by passing different environment variables to +kube-up. + +### Storage + +AWS does support persistent volumes via EBS. These can then be attached to +pods that should store persistent data (e.g. if you're running a database). + +Minions do not have persistent volumes otherwise. In general, kubernetes +containers do not have persistent storage unless you attach a persistent +volume, and so minions on AWS use instance storage. Instance storage is +cheaper, often faster, and historically more reliable. This does mean that you +should pick an instance type that has sufficient instance storage, unless you +can make do with whatever space is left on your root partition. + +The master _does_ have a persistent volume attached to it. Containers are +mostly run against instance storage, just like the minions, except that we +repoint some important data onto the peristent volume. + +By default we use aufs over ext4. `DOCKER_STORAGE=btrfs` is also a good choice +for a filesystem: it is relatively reliable with Docker; btrfs itself is much +more reliable than it used to be with modern kernels. It can easily span +multiple volumes, which is particularly useful when we are using an instance +type with multiple ephemeral instance disks. + +### AutoScaling + +We run the minions in an AutoScalingGroup. Currently auto-scaling (e.g. based +on CPU) is not actually enabled (#11935). Instead, the auto-scaling group +means that AWS will relaunch any minions that are terminated. + +We do not currently run the master in an AutoScalingGroup, but we should +(#11934) + +### Networking + +Kubernetes uses an IP-per-pod model. This means that a node, which runs many +pods, must have many IPs. The way we implement this on AWS is to use VPCs and +the advanced routing support that it allows. Each pod is assigned a /24 CIDR; +then this CIDR is configured to route to an instance in the VPC routing table. + +It is also possible to use overlay networking on AWS, but the default kube-up +configuration does not. + +### NodePort & LoadBalancing + +Kubernetes on AWS integrates with ELB. When you create a service with +Type=LoadBalancer, kubernetes (the kube-controller-manager) will create an ELB, +create a security group for the ELB which allows access on the service ports, +attach all the minions to the ELB, and modify the security group for the +minions to allow traffic from the ELB to the minions. This traffic reaches +kube-proxy where it is then forwarded to the pods. + +ELB requires that all minions listen on a single port, and it acts as a layer-7 +forwarding proxy (i.e. the source IP is not preserved). It is not trivial for +kube-proxy to recognize the traffic therefore. So, LoadBalancer services are +also exposed as NodePort services. For NodePort services, a cluster-wide port +is assigned by kubernetes to the service, and kube-proxy listens externally on +that port on every minion, and forwards traffic to the pods. So for a +load-balanced service, ELB is configured to proxy traffic on the public port +(e.g. port 80) to the NodePort assigned to the service (e.g. 31234), kube-proxy +recognizes the traffic coming to the NodePort by the inbound port number, and +send it to the correct pods for the service. + +Note that we do not automatically open NodePort services in the AWS firewall +(although we do open LoadBalancer services). This is because we expect that +NodePort services are more of a building block for things like inter-cluster +services or for LoadBalancer. To consume a NodePort service externally, you +will likely have to open the port in the minion security group +(`kubernetes-minion-`). + +### IAM + +kube-proxy sets up two IAM roles, one for the master called +(kubernetes-master)[cluster/aws/templates/iam/kubernetes-master-policy.json] +and one for the minions called +(kubernetes-minion)[cluster/aws/templates/iam/kubernetes-minion-policy.json]. + +The master is responsible for creating ELBs and configuring them, as well as +setting up advanced VPC routing. Currently it has blanket permissions on EC2, +along with rights to create and destroy ELBs. + +The minion does not need a lot of access to the AWS APIs. It needs to download +a distribution file, and then it is responsible for attaching and detaching EBS +volumes to itself. + +The minion policy is relatively minimal. The master policy is probably overly +permissive. The security concious may want to lock-down the IAM policies +further (#11936) + +We should make it easier to extend IAM permissions and also ensure that they +are correctly configured (#???) + +### Tagging + +All AWS resources are tagged with a tag named "KuberentesCluster". This tag is +used to identify a particular 'instance' of Kubernetes, even if two clusters +are deployed into the same VPC. (The script doesn't do this by default, but it +can be done.) + +Within the AWS cloud provider logic, we filter requests to the AWS APIs to +match resources with our cluster tag. So we only see our own AWS objects. + +If you choose not to use kube-up, you must tag everything with a +KubernetesCluster tag with a unique per-cluster value. + + +# AWS Objects + +The kube-up script does a number of things in AWS: + +* Creates an S3 bucket (`AWS_S3_BUCKET`) and copy the kubernetes distribution + and the salt scripts into it. They are made world-readable and the HTTP URLs +are passed to instances; this is how kubernetes code gets onto the machines. +* Creates two IAM profiles based on templates in `cluster/aws/templates/iam`. + `kubernetes-master` is used by the master node; `kubernetes-minion` is used +by minion nodes. +* Creates an AWS SSH key named `kubernetes-`. Fingerprint here is + the OpenSSH key fingerprint, so that multiple users can run the script with +different keys and their keys will not collide (with near-certainty) It will +use an existing key if one is found at `AWS_SSH_KEY`, otherwise it will create +one there. (With the default ubuntu images, if you have to SSH in: the user is +`ubuntu` and that user can `sudo`) +* Creates a VPC for use with the cluster (with a CIDR of 172.20.0.0/16)., and + enables the `dns-support` and `dns-hostnames` options. +* Creates an internet gateway for the VPC. +* Creates a route table for the VPC, with the internet gateway as the default + route +* Creates a subnet (with a CIDR of 172.20.0.0/24) in the AZ `KUBE_AWS_ZONE` + (defaults to us-west-2a). Currently kubernetes runs in a single AZ; there +are two philosophies on how to achieve HA: cluster-per-AZ and +cross-AZ-clusters. cluster-per-AZ says you should have an independent cluster +for each AZ, they are entirely separate. cross-AZ-clusters allows a single +cluster to span multiple AZs. The debate is open here: cluster-per-AZ is more +robust but cross-AZ-clusters are more convenient. For now though, each AWS +kuberentes cluster lives in one AZ. +* Associates the subnet to the route table +* Creates security groups for the master node (`kubernetes-master-`) + and the minion nodes (`kubernetes-minion-`) +* Configures security groups so that masters & minions can intercommunicate, + and opens SSH to the world on master & minions, and opens port 443 to the +world on the master (for the HTTPS API endpoint) +* Creates an EBS volume for the master node of size `MASTER_DISK_SIZE` and type + `MASTER_DISK_TYPE` +* Launches a master node with a fixed IP address (172.20.0.9), with the + security group, IAM credentials etc. An instance script is used to pass +vital configuration information to Salt. The hope is that over time we can +reduce the amount of configuration information that must be passed in this way. +* Once the instance is up, it attaches the EBS volume & sets up a manual + routing rule for the internal network range (`MASTER_IP_RANGE`, defaults to +10.246.0.0/24) +* Creates an auto-scaling launch-configuration and group for the minions. The + name for both is `-minion-group`, defaults to +`kubernetes-minion-group`. The auto-scaling group has size min & max both set +to `NUM_MINIONS`. You can change the size of the auto-scaling group to add or +remove minions (directly though the AWS API/Console). The minion nodes +self-configure: they come up, run Salt with the stored configuration; connect +to the master and are assigned an internal CIDR; the master configures the +route-table with the minion CIDR. The script does health-check the minions, +but this is a self-check, it is not required. + +If attempting this configuration manually, I highly recommend following along +with the kube-up script, and being sure to tag everything with a +`KubernetesCluster`=`` tag. Also, passing the right configuration +options to Salt when not using the script is tricky: the plan here is to +simplify this by having Kubernetes take on more node configuration, and even +potentially remove Salt altogether. + + +## Manual infrastructure creation + +While this work is not yet complete, advanced users may choose to create (some) +AWS objects themselves, and still make use of the kube-up script (to configure +Salt, for example). + +* `AWS_S3_BUCKET` will use an existing S3 bucket +* `VPC_ID` will reuse an existing VPC +* `SUBNET_ID` will reuse an existing subnet +* If your route table is tagged with the correct `KubernetesCluster`, it will + be reused +* If your security groups are appropriately named, they will be reused. + +Currently there is no way to do the following with kube-up. If these affect +you, please open an issue with a description of what you're trying to do (your +use-case) and we'll see what we can do: + +* Use an existing AWS SSH key with an arbitrary name +* Override the IAM credentials in a sensible way (but this is in-progress) +* Use different security group permissions +* Configure your own auto-scaling groups + +# Instance boot + +The instance boot procedure is currently pretty complicated, primarily because +we must marshal configuration from Bash to Salt via the AWS instance script. +As we move more post-boot configuration out of Salt and into Kubernetes, we +will hopefully be able to simplify this. + +When the kube-up script launches instances, it builds an instance startup +script which includes some configuration options passed to kube-up, and +concatenates some of the scripts found in the cluster/aws/templates directory. +These scripts are responsible for mounting and formatting volumes, downloading +Salt & Kubernetes from the S3 bucket, and then triggering Salt to actually +install Kubernetes. + + + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/aws_under_the_hood.md?pixel)]() +