From 426346c7e33083a159c6589ffe210e42967e2a11 Mon Sep 17 00:00:00 2001 From: Justin Santa Barbara Date: Mon, 19 Oct 2015 13:55:43 -0400 Subject: [PATCH] More fixes based on commments --- docs/design/aws_under_the_hood.md | 119 +++++++++++++++++------------- 1 file changed, 66 insertions(+), 53 deletions(-) diff --git a/docs/design/aws_under_the_hood.md b/docs/design/aws_under_the_hood.md index ac9efe558fa..845964f2193 100644 --- a/docs/design/aws_under_the_hood.md +++ b/docs/design/aws_under_the_hood.md @@ -49,6 +49,18 @@ Kubernetes clusters are created on AWS. This can be particularly useful if problems arise or in circumstances where the provided scripts are lacking and you manually created or configured your cluster. +**Table of contents:** + * [Architecture overview](#architecture-overview) + * [Storage](#storage) + * [Auto Scaling group](#auto-scaling-group) + * [Networking](#networking) + * [NodePort and LoadBalancing services](#nodeport-and-loadbalancing-services) + * [Identity and access management (IAM)](#identity-and-access-management-iam) + * [Tagging](#tagging) + * [AWS objects](#aws-objects) + * [Manual infrastructure creation](#manual-infrastructure-creation) + * [Instance boot](#instance-boot) + ### Architecture overview Kubernetes is a cluster of several machines that consists of a Kubernetes @@ -56,17 +68,13 @@ master and a set number of nodes (previously known as 'minions') for which the master which is responsible. See the [Architecture](architecture.md) topic for more details. -Other documents describe the general architecture of Kubernetes (all nodes run -Docker; the kubelet agent runs on each node and launches containers; the -kube-proxy relays traffic between the nodes etc). - By default on AWS: * Instances run Ubuntu 15.04 (the official AMI). It includes a sufficiently modern kernel that pairs well with Docker and doesn't require a reboot. (The default SSH user is `ubuntu` for this and other ubuntu images.) -* By default we run aufs over ext4 as the filesystem / container storage on the - nodes (mostly because this is what GCE uses). +* Nodes use aufs instead of ext4 as the filesystem / container storage (mostly + because this is what Google Compute Engine uses). You can override these defaults by passing different environment variables to kube-up. @@ -82,12 +90,12 @@ unless you create pods with persistent volumes [(EBS)](../user-guide/volumes.md#awselasticblockstore). In general, Kubernetes containers do not have persistent storage unless you attach a persistent volume, and so nodes on AWS use instance storage. Instance storage is cheaper, -often faster, and historically more reliable. This does mean that you should -pick an instance type that has sufficient instance storage, unless you can make -do with whatever space is left on your root partition. +often faster, and historically more reliable. Unless you can make do with whatever +space is left on your root partition, you must choose an instance type that provides +you with sufficient instance storage for your needs. Note: The master uses a persistent volume ([etcd](architecture.md#etcd)) to track -its state but similar to the nodes, containers are mostly run against instance +its state. Similar to nodes, containers are mostly run against instance storage, except that we repoint some important data onto the peristent volume. The default storage driver for Docker images is aufs. Specifying btrfs (by passing the environment @@ -96,12 +104,12 @@ is relatively reliable with Docker and has improved its reliability with modern kernels. It can easily span multiple volumes, which is particularly useful when we are using an instance type with multiple ephemeral instance disks. -### AutoScaling +### Auto Scaling group Nodes (but not the master) are run in an -[AutoScalingGroup](http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/AutoScalingGroup.html) +[Auto Scaling group](http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/AutoScalingGroup.html) on AWS. Currently auto-scaling (e.g. based on CPU) is not actually enabled -([#11935](http://issues.k8s.io/11935)). Instead, the auto-scaling group means +([#11935](http://issues.k8s.io/11935)). Instead, the Auto Scaling group means that AWS will relaunch any nodes that are terminated. We do not currently run the master in an AutoScalingGroup, but we should @@ -111,14 +119,13 @@ We do not currently run the master in an AutoScalingGroup, but we should Kubernetes uses an IP-per-pod model. This means that a node, which runs many pods, must have many IPs. AWS uses virtual private clouds (VPCs) and advanced -routing support so each pod is assigned a /24 CIDR. Each pod is assigned a /24 -CIDR; the assigned CIDR is then configured to route to an instance in the VPC -routing table. +routing support so each pod is assigned a /24 CIDR. The assigned CIDR is then +configured to route to an instance in the VPC routing table. -It is also possible to use overlay networking on AWS, but that is not the +It is also possible to use overlay networking on AWS, but that is not the default configuration of the kube-up script. -### NodePort and LoadBalancing +### NodePort and LoadBalancing services Kubernetes on AWS integrates with [Elastic Load Balancing (ELB)](http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/US_SetUpASLBApp.html). @@ -129,17 +136,23 @@ and modify the security group for the nodes to allow traffic from the ELB to the nodes. This traffic reaches kube-proxy where it is then forwarded to the pods. -ELB has some restrictions: it requires that all nodes listen on a single port, -and it acts as a forwarding proxy (i.e. the source IP is not preserved). To -work with these restrictions, in Kubernetes, [LoadBalancer -services](../user-guide/services.html#type-loadbalancer) are exposed as +ELB has some restrictions: +* it requires that all nodes listen on a single port, +* it acts as a forwarding proxy (i.e. the source IP is not preserved). + +To work with these restrictions, in Kubernetes, [LoadBalancer +services](../user-guide/services.md#type-loadbalancer) are exposed as [NodePort services](../user-guide/services.md#type-nodeport). Then kube-proxy listens externally on the cluster-wide port that's assigned to -NodePort services and forwards traffic to the corresponding pods. So ELB is -configured to proxy traffic on the public port (e.g. port 80) to the NodePort -that is assigned to the service (e.g. 31234). Any in-coming traffic sent to -the NodePort (e.g. port 31234) is recognized by kube-proxy and then sent to the -correct pods for that service. +NodePort services and forwards traffic to the corresponding pods. + +So for example, if we configure a service of Type LoadBalancer with a +public port of 80: +* Kubernetes will assign a NodePort to the service (e.g. 31234) +* ELB is configured to proxy traffic on the public port 80 to the NodePort + that is assigned to the service (31234). +* Then any in-coming traffic that ELB forwards to the NodePort (e.g. port 31234) + is recognized by kube-proxy and sent to the correct pods for that service. Note that we do not automatically open NodePort services in the AWS firewall (although we do open LoadBalancer services). This is because we expect that @@ -188,31 +201,31 @@ Important: If you choose not to use kube-up, you must pick a unique cluster-id value, and ensure that all AWS resources have a tag with `Name=KubernetesCluster,Value=`. -### AWS Objects +### AWS objects The kube-up script does a number of things in AWS: * Creates an S3 bucket (`AWS_S3_BUCKET`) and then copies the Kubernetes distribution and the salt scripts into it. They are made world-readable and the HTTP URLs -are passed to instances; this is how Kubernetes code gets onto the machines. + are passed to instances; this is how Kubernetes code gets onto the machines. * Creates two IAM profiles based on templates in [cluster/aws/templates/iam](../../cluster/aws/templates/iam/): - * `kubernetes-master` is used by the master + * `kubernetes-master` is used by the master. * `kubernetes-minion` is used by nodes. * Creates an AWS SSH key named `kubernetes-`. Fingerprint here is the OpenSSH key fingerprint, so that multiple users can run the script with -different keys and their keys will not collide (with near-certainty). It will -use an existing key if one is found at `AWS_SSH_KEY`, otherwise it will create -one there. (With the default ubuntu images, if you have to SSH in: the user is -`ubuntu` and that user can `sudo`) + different keys and their keys will not collide (with near-certainty). It will + use an existing key if one is found at `AWS_SSH_KEY`, otherwise it will create + one there. (With the default Ubuntu images, if you have to SSH in: the user is + `ubuntu` and that user can `sudo`). * Creates a VPC for use with the cluster (with a CIDR of 172.20.0.0/16) and enables the `dns-support` and `dns-hostnames` options. * Creates an internet gateway for the VPC. * Creates a route table for the VPC, with the internet gateway as the default - route + route. * Creates a subnet (with a CIDR of 172.20.0.0/24) in the AZ `KUBE_AWS_ZONE` (defaults to us-west-2a). Currently, each Kubernetes cluster runs in a -single AZ on AWS. Although, there are two philosophies in discussion on how to -achieve High Availability (HA): + single AZ on AWS. Although, there are two philosophies in discussion on how to + achieve High Availability (HA): * cluster-per-AZ: An independent cluster for each AZ, where each cluster is entirely separate. * cross-AZ-clusters: A single cluster spans multiple AZs. @@ -220,31 +233,31 @@ The debate is open here, where cluster-per-AZ is discussed as more robust but cross-AZ-clusters are more convenient. * Associates the subnet to the route table * Creates security groups for the master (`kubernetes-master-`) - and the nodes (`kubernetes-minion-`) + and the nodes (`kubernetes-minion-`). * Configures security groups so that masters and nodes can communicate. This includes intercommunication between masters and nodes, opening SSH publicly -for both masters and nodes, and opening port 443 on the master for the HTTPS -API endpoints. + for both masters and nodes, and opening port 443 on the master for the HTTPS + API endpoints. * Creates an EBS volume for the master of size `MASTER_DISK_SIZE` and type - `MASTER_DISK_TYPE` + `MASTER_DISK_TYPE`. * Launches a master with a fixed IP address (172.20.0.9) that is also configured for the security group and all the necessary IAM credentials. An -instance script is used to pass vital configuration information to Salt. Note: -The hope is that over time we can reduce the amount of configuration -information that must be passed in this way. + instance script is used to pass vital configuration information to Salt. Note: + The hope is that over time we can reduce the amount of configuration + information that must be passed in this way. * Once the instance is up, it attaches the EBS volume and sets up a manual routing rule for the internal network range (`MASTER_IP_RANGE`, defaults to -10.246.0.0/24) + 10.246.0.0/24). * For auto-scaling, on each nodes it creates a launch configuration and group. The name for both is <*KUBE_AWS_INSTANCE_PREFIX*>-minion-group. The default -name is kubernetes-minion-group. The auto-scaling group has a min and max size -that are both set to NUM_MINIONS. You can change the size of the auto-scaling -group to add or remove the total number of nodes from within the AWS API or -Console. Each nodes self-configures, meaning that they come up; run Salt with -the stored configuration; connect to the master; are assigned an internal CIDR; -and then the master configures the route-table with the assigned CIDR. The -kube-up script performs a health-check on the nodes but it's a self-check that -is not required. + name is kubernetes-minion-group. The auto-scaling group has a min and max size + that are both set to NUM_MINIONS. You can change the size of the auto-scaling + group to add or remove the total number of nodes from within the AWS API or + Console. Each nodes self-configures, meaning that they come up; run Salt with + the stored configuration; connect to the master; are assigned an internal CIDR; + and then the master configures the route-table with the assigned CIDR. The + kube-up script performs a health-check on the nodes but it's a self-check that + is not required. If attempting this configuration manually, I highly recommend following along