diff --git a/docs/design/aws_under_the_hood.md b/docs/design/aws_under_the_hood.md new file mode 100644 index 00000000000..ec8a31c2e49 --- /dev/null +++ b/docs/design/aws_under_the_hood.md @@ -0,0 +1,314 @@ + + + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + +The latest 1.0.x release of this document can be found +[here](http://releases.k8s.io/release-1.0/docs/design/aws_under_the_hood.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + + + + +# Peeking under the hood of Kubernetes on AWS + +This document provides high-level insight into how Kubernetes works on AWS and +maps to AWS objects. We assume that you are familiar with AWS. + +We encourage you to use [kube-up](../getting-started-guides/aws.md) to create +clusters on AWS. We recommend that you avoid manual configuration but are aware +that sometimes it's the only option. + +Tip: You should open an issue and let us know what enhancements can be made to +the scripts to better suit your needs. + +That said, it's also useful to know what's happening under the hood when +Kubernetes clusters are created on AWS. This can be particularly useful if +problems arise or in circumstances where the provided scripts are lacking and +you manually created or configured your cluster. + +**Table of contents:** + * [Architecture overview](#architecture-overview) + * [Storage](#storage) + * [Auto Scaling group](#auto-scaling-group) + * [Networking](#networking) + * [NodePort and LoadBalancer services](#nodeport-and-loadbalancer-services) + * [Identity and access management (IAM)](#identity-and-access-management-iam) + * [Tagging](#tagging) + * [AWS objects](#aws-objects) + * [Manual infrastructure creation](#manual-infrastructure-creation) + * [Instance boot](#instance-boot) + +### Architecture overview + +Kubernetes is a cluster of several machines that consists of a Kubernetes +master and a set number of nodes (previously known as 'minions') for which the +master which is responsible. See the [Architecture](architecture.md) topic for +more details. + +By default on AWS: + +* Instances run Ubuntu 15.04 (the official AMI). It includes a sufficiently + modern kernel that pairs well with Docker and doesn't require a + reboot. (The default SSH user is `ubuntu` for this and other ubuntu images.) +* Nodes use aufs instead of ext4 as the filesystem / container storage (mostly + because this is what Google Compute Engine uses). + +You can override these defaults by passing different environment variables to +kube-up. + +### Storage + +AWS supports persistent volumes by using [Elastic Block Store (EBS)](../user-guide/volumes.md#awselasticblockstore). These can then be +attached to pods that should store persistent data (e.g. if you're running a +database). + +By default, nodes in AWS use [instance storage](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html) +unless you create pods with persistent volumes +[(EBS)](../user-guide/volumes.md#awselasticblockstore). In general, Kubernetes +containers do not have persistent storage unless you attach a persistent +volume, and so nodes on AWS use instance storage. Instance storage is cheaper, +often faster, and historically more reliable. Unless you can make do with whatever +space is left on your root partition, you must choose an instance type that provides +you with sufficient instance storage for your needs. + +Note: The master uses a persistent volume ([etcd](architecture.md#etcd)) to track +its state. Similar to nodes, containers are mostly run against instance +storage, except that we repoint some important data onto the peristent volume. + +The default storage driver for Docker images is aufs. Specifying btrfs (by passing the environment +variable `DOCKER_STORAGE=btrfs` to kube-up) is also a good choice for a filesystem. btrfs +is relatively reliable with Docker and has improved its reliability with modern +kernels. It can easily span multiple volumes, which is particularly useful +when we are using an instance type with multiple ephemeral instance disks. + +### Auto Scaling group + +Nodes (but not the master) are run in an +[Auto Scaling group](http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/AutoScalingGroup.html) +on AWS. Currently auto-scaling (e.g. based on CPU) is not actually enabled +([#11935](http://issues.k8s.io/11935)). Instead, the Auto Scaling group means +that AWS will relaunch any nodes that are terminated. + +We do not currently run the master in an AutoScalingGroup, but we should +([#11934](http://issues.k8s.io/11934)). + +### Networking + +Kubernetes uses an IP-per-pod model. This means that a node, which runs many +pods, must have many IPs. AWS uses virtual private clouds (VPCs) and advanced +routing support so each pod is assigned a /24 CIDR. The assigned CIDR is then +configured to route to an instance in the VPC routing table. + +It is also possible to use overlay networking on AWS, but that is not the default +configuration of the kube-up script. + +### NodePort and LoadBalancer services + +Kubernetes on AWS integrates with [Elastic Load Balancing +(ELB)](http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/US_SetUpASLBApp.html). +When you create a service with `Type=LoadBalancer`, Kubernetes (the +kube-controller-manager) will create an ELB, create a security group for the +ELB which allows access on the service ports, attach all the nodes to the ELB, +and modify the security group for the nodes to allow traffic from the ELB to +the nodes. This traffic reaches kube-proxy where it is then forwarded to the +pods. + +ELB has some restrictions: +* it requires that all nodes listen on a single port, +* it acts as a forwarding proxy (i.e. the source IP is not preserved). + +To work with these restrictions, in Kubernetes, [LoadBalancer +services](../user-guide/services.md#type-loadbalancer) are exposed as +[NodePort services](../user-guide/services.md#type-nodeport). Then +kube-proxy listens externally on the cluster-wide port that's assigned to +NodePort services and forwards traffic to the corresponding pods. + +So for example, if we configure a service of Type LoadBalancer with a +public port of 80: +* Kubernetes will assign a NodePort to the service (e.g. 31234) +* ELB is configured to proxy traffic on the public port 80 to the NodePort + that is assigned to the service (31234). +* Then any in-coming traffic that ELB forwards to the NodePort (e.g. port 31234) + is recognized by kube-proxy and sent to the correct pods for that service. + +Note that we do not automatically open NodePort services in the AWS firewall +(although we do open LoadBalancer services). This is because we expect that +NodePort services are more of a building block for things like inter-cluster +services or for LoadBalancer. To consume a NodePort service externally, you +will likely have to open the port in the node security group +(`kubernetes-minion-`). + +### Identity and Access Management (IAM) + +kube-proxy sets up two IAM roles, one for the master called +[kubernetes-master](../../cluster/aws/templates/iam/kubernetes-master-policy.json) +and one for the nodes called +[kubernetes-minion](../../cluster/aws/templates/iam/kubernetes-minion-policy.json). + +The master is responsible for creating ELBs and configuring them, as well as +setting up advanced VPC routing. Currently it has blanket permissions on EC2, +along with rights to create and destroy ELBs. + +The nodes do not need a lot of access to the AWS APIs. They need to download +a distribution file, and then are responsible for attaching and detaching EBS +volumes from itself. + +The node policy is relatively minimal. The master policy is probably overly +permissive. The security concious may want to lock-down the IAM policies +further ([#11936](http://issues.k8s.io/11936)). + +We should make it easier to extend IAM permissions and also ensure that they +are correctly configured ([#14226](http://issues.k8s.io/14226)). + +### Tagging + +All AWS resources are tagged with a tag named "KuberentesCluster", with a value +that is the unique cluster-id. This tag is used to identify a particular +'instance' of Kubernetes, even if two clusters are deployed into the same VPC. +Resources are considered to belong to the same cluster if and only if they have +the same value in the tag named "KubernetesCluster". (The kube-up script is +not configured to create multiple clusters in the same VPC by default, but it +is possible to create another cluster in the same VPC.) + +Within the AWS cloud provider logic, we filter requests to the AWS APIs to +match resources with our cluster tag. By filtering the requests, we ensure +that we see only our own AWS objects. + +Important: If you choose not to use kube-up, you must pick a unique cluster-id +value, and ensure that all AWS resources have a tag with +`Name=KubernetesCluster,Value=`. + +### AWS objects + +The kube-up script does a number of things in AWS: + +* Creates an S3 bucket (`AWS_S3_BUCKET`) and then copies the Kubernetes distribution + and the salt scripts into it. They are made world-readable and the HTTP URLs + are passed to instances; this is how Kubernetes code gets onto the machines. +* Creates two IAM profiles based on templates in [cluster/aws/templates/iam](../../cluster/aws/templates/iam/): + * `kubernetes-master` is used by the master. + * `kubernetes-minion` is used by nodes. +* Creates an AWS SSH key named `kubernetes-`. Fingerprint here is + the OpenSSH key fingerprint, so that multiple users can run the script with + different keys and their keys will not collide (with near-certainty). It will + use an existing key if one is found at `AWS_SSH_KEY`, otherwise it will create + one there. (With the default Ubuntu images, if you have to SSH in: the user is + `ubuntu` and that user can `sudo`). +* Creates a VPC for use with the cluster (with a CIDR of 172.20.0.0/16) and + enables the `dns-support` and `dns-hostnames` options. +* Creates an internet gateway for the VPC. +* Creates a route table for the VPC, with the internet gateway as the default + route. +* Creates a subnet (with a CIDR of 172.20.0.0/24) in the AZ `KUBE_AWS_ZONE` + (defaults to us-west-2a). Currently, each Kubernetes cluster runs in a + single AZ on AWS. Although, there are two philosophies in discussion on how to + achieve High Availability (HA): + * cluster-per-AZ: An independent cluster for each AZ, where each cluster + is entirely separate. + * cross-AZ-clusters: A single cluster spans multiple AZs. +The debate is open here, where cluster-per-AZ is discussed as more robust but +cross-AZ-clusters are more convenient. +* Associates the subnet to the route table +* Creates security groups for the master (`kubernetes-master-`) + and the nodes (`kubernetes-minion-`). +* Configures security groups so that masters and nodes can communicate. This + includes intercommunication between masters and nodes, opening SSH publicly + for both masters and nodes, and opening port 443 on the master for the HTTPS + API endpoints. +* Creates an EBS volume for the master of size `MASTER_DISK_SIZE` and type + `MASTER_DISK_TYPE`. +* Launches a master with a fixed IP address (172.20.0.9) that is also + configured for the security group and all the necessary IAM credentials. An + instance script is used to pass vital configuration information to Salt. Note: + The hope is that over time we can reduce the amount of configuration + information that must be passed in this way. +* Once the instance is up, it attaches the EBS volume and sets up a manual + routing rule for the internal network range (`MASTER_IP_RANGE`, defaults to + 10.246.0.0/24). +* For auto-scaling, on each nodes it creates a launch configuration and group. + The name for both is <*KUBE_AWS_INSTANCE_PREFIX*>-minion-group. The default + name is kubernetes-minion-group. The auto-scaling group has a min and max size + that are both set to NUM_MINIONS. You can change the size of the auto-scaling + group to add or remove the total number of nodes from within the AWS API or + Console. Each nodes self-configures, meaning that they come up; run Salt with + the stored configuration; connect to the master; are assigned an internal CIDR; + and then the master configures the route-table with the assigned CIDR. The + kube-up script performs a health-check on the nodes but it's a self-check that + is not required. + + +If attempting this configuration manually, I highly recommend following along +with the kube-up script, and being sure to tag everything with a tag with name +`KubernetesCluster` and value set to a unique cluster-id. Also, passing the +right configuration options to Salt when not using the script is tricky: the +plan here is to simplify this by having Kubernetes take on more node +configuration, and even potentially remove Salt altogether. + + +### Manual infrastructure creation + +While this work is not yet complete, advanced users might choose to manually +create certain AWS objects while still making use of the kube-up script (to configure +Salt, for example). These objects can currently be manually created: + +* Set the `AWS_S3_BUCKET` environment variable to use an existing S3 bucket. +* Set the `VPC_ID` environment variable to reuse an existing VPC. +* Set the `SUBNET_ID` environemnt variable to reuse an existing subnet. +* If your route table has a matching `KubernetesCluster` tag, it will + be reused. +* If your security groups are appropriately named, they will be reused. + +Currently there is no way to do the following with kube-up: + +* Use an existing AWS SSH key with an arbitrary name. +* Override the IAM credentials in a sensible way + ([#14226](http://issues.k8s.io/14226)). +* Use different security group permissions. +* Configure your own auto-scaling groups. + +If any of the above items apply to your situation, open an issue to request an +enhancement to the kube-up script. You should provide a complete description of +the use-case, including all the details around what you want to accomplish. + +### Instance boot + +The instance boot procedure is currently pretty complicated, primarily because +we must marshal configuration from Bash to Salt via the AWS instance script. +As we move more post-boot configuration out of Salt and into Kubernetes, we +will hopefully be able to simplify this. + +When the kube-up script launches instances, it builds an instance startup +script which includes some configuration options passed to kube-up, and +concatenates some of the scripts found in the cluster/aws/templates directory. +These scripts are responsible for mounting and formatting volumes, downloading +Salt and Kubernetes from the S3 bucket, and then triggering Salt to actually +install Kubernetes. + + + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/aws_under_the_hood.md?pixel)]() +