diff --git a/docs/proposals/gpu-support.md b/docs/proposals/gpu-support.md new file mode 100644 index 00000000000..6d1ce9f1d5d --- /dev/null +++ b/docs/proposals/gpu-support.md @@ -0,0 +1,308 @@ + + + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + + + + + + +- [GPU support](#gpu-support) + - [Objective](#objective) + - [Background](#background) + - [Detailed discussion](#detailed-discussion) + - [Inventory](#inventory) + - [Scheduling](#scheduling) + - [The runtime](#the-runtime) + - [NVIDIA support](#nvidia-support) + - [Event flow](#event-flow) + - [Too complex for now: nvidia-docker](#too-complex-for-now-nvidia-docker) + - [Implementation plan](#implementation-plan) + - [V0](#v0) + - [Scheduling](#scheduling) + - [Runtime](#runtime) + - [Other](#other) + - [Future work](#future-work) + - [V1](#v1) + - [V2](#v2) + - [V3](#v3) + - [Undetermined](#undetermined) + - [Security considerations](#security-considerations) + + + +# GPU support + +Author: @therc + +Date: Apr 2016 + +Status: Design in progress, early implementation of requirements + +## Objective + +Users should be able to request GPU resources for their workloads, as easily as +for CPU or memory. Kubernetes should keep an inventory of machines with GPU +hardware, schedule containers on appropriate nodes and set up the container +environment with all that's necessary to access the GPU. All of this should +eventually be supported for clusters on either bare metal or cloud providers. + +## Background + +An increasing number of workloads, such as machine learning and seismic survey +processing, benefits from offloading computations to graphic hardware. While not +as tuned as traditional, dedicated high performance computing systems such as +MPI, a Kubernetes cluster can still be a great environment for organizations +that need a variety of additional, "classic" workloads, such as database, web +serving, etc. + +GPU support is hard to provide extensively and will thus take time to tame +completely, because + +- different vendors expose the hardware to users in different ways +- some vendors require fairly tight coupling between the kernel driver +controlling the GPU and the libraries/applications that access the hardware +- it adds more resource types (whole GPUs, GPU cores, GPU memory) +- it can introduce new security pitfalls +- for systems with multiple GPUs, affinity matters, similarly to NUMA +considerations for CPUs +- running GPU code in containers is still a relatively novel idea + +## Detailed discussion + +Currently, this document is mostly focused on the basic use case: run GPU code +on AWS `g2.2xlarge` EC2 machine instances using Docker. It constitutes a narrow +enough scenario that it does not require large amounts of generic code yet. GCE +doesn't support GPUs at all; bare metal systems throw a lot of extra variables +into the mix. + +Later sections will outline future work to support a broader set of hardware, +environments and container runtimes. + +### Inventory + +Before any scheduling can occur, we need to know what's available out there. In +v0, we'll hardcode capacity detected by the kubelet based on a flag, +`--experimental-nvidia-gpu`. This will result in the user-defined resource +`alpha.kubernetes.io/nvidia-gpu` to be reported for `NodeCapacity` and +`NodeAllocatable`, as well as as a node label. + +### Scheduling + +GPUs will be visible as first-class resources. In v0, we'll only assign whole +devices; sharing among multiple pods is left to future implementations. It's +probable that GPUs will exacerbate the need for [a rescheduler](rescheduler.md) +or pod priorities, especially if the nodes in a cluster are not homogeneous. +Consider these two cases: + +> Only half of the machines have a GPU and they're all busy with other +workloads. The other half of the cluster is doing very little work. A GPU +workload comes, but it can't schedule, because the devices are sitting idle on +nodes that are running something else and the nodes with little load lack the +hardware. + +> Some or all the machines have two graphic cards each. A number of jobs get +scheduled, requesting one device per pod. The scheduler puts them all on +different machines, spreading the load, perhaps by design. Then a new job comes +in, requiring two devices per pod, but it can't schedule anywhere, because all +we can find, at most, is one unused device per node. + +### The runtime + +Once we know where to run the container, it's time to set up its environment. At +a minimum, we'll need to map the host device(s) into the container. Because each +manufacturer exposes different device nodes (`/dev/ati/card0`, `/dev/nvidia0`, +but also the required `/dev/nvidiactl` and `/dev/nvidia-uvm`), some of the logic +needs to be hardware-specific, mapping from a logical device to a list of device +nodes necessary for software to talk to it. + +Support binaries and libraries are often versioned along with the kernel module, +so there should be further hooks to project those under `/bin` and some kind of +`/lib` before the application is started. This can be done for Docker with the +use of a versioned [Docker +volume](https://docs.docker.com/engine/userguide/containers/dockervolumes/) or +with upcoming Kubernetes-specific hooks such as init containers and volume +containers. In v0, images are expected to bundle everything they need. + +#### NVIDIA support + +The first implementation and testing ground will be for NVIDIA devices, by far +the most common setup. + +In v0, the `--experimental-nvidia-gpu` flag will also result in the host devices +(limited to those required to drive the first card, `nvidia0`) to be mapped into +the container by the dockertools library. + +### Event flow + +This is what happens before and after an user schedules a GPU pod. + +1. Administrator installs a number of Kubernetes nodes with GPUs. The correct +kernel modules and device nodes under `/dev/` are present. + +1. Administrator makes sure the latest CUDA/driver versions are installed. + +1. Administrator enables `--experimental-nvidia-gpu` on kubelets + +1. Kubelets update node status with information about the GPU device, in addition +to cAdvisor's usual data about CPU/memory/disk + +1. User creates a Docker image compiling their application for CUDA, bundling +the necessary libraries. We ignore any versioning requirements in the image +using labels based on [NVIDIA's +conventions](https://github.com/NVIDIA/nvidia-docker/blob/64510511e3fd0d00168eb076623854b0fcf1507d/tools/src/nvidia-docker/utils.go#L13). + +1. User creates a pod using the image, requiring +`alpha.kubernetes.io/nvidia-gpu: 1` + +1. Scheduler picks a node for the pod + +1. The kubelet notices the GPU requirement and maps the three devices. In +Docker's engine-api, this means it'll add them to the Resources.Devices list. + +1. Docker runs the container to completion + +1. The scheduler notices that the device is available again + +### Too complex for now: nvidia-docker + +For v0, we discussed at length, but decided to leave aside initially the +[nvidia-docker plugin](https://github.com/NVIDIA/nvidia-docker). The plugin is +an officially supported solution, thus avoiding a lot of new low level code, as +it takes care of functionality such as: + +- creating a Docker volume with binaries such as `nvidia-smi` and shared +libraries +- providing HTTP endpoints that monitoring tools can use to collect GPU metrics +- abstracting details such as `/dev` entry names for each device, as well as +control ones like `nvidiactl` + +The `nvidia-docker` wrapper also verifies that the CUDA version required by a +given image is supported by the host drivers, through inspection of well-known +image labels, if present. We should try to provide equivalent checks, either +for CUDA or OpenCL. + +This is current sample output from `nvidia-docker-plugin`, wrapped for +readability: + + $ curl -s localhost:3476/docker/cli + --device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia0 + --volume-driver=nvidia-docker + --volume=nvidia_driver_352.68:/usr/local/nvidia:ro + +It runs as a daemon listening for HTTP requests on port 3476. The endpoint above +returns flags that need to be added to the Docker command line in order to +expose GPUs to the containers. There are optional URL arguments to request +specific devices if more than one are present on the system, as well as specific +versions of the support software. An obvious improvement is an additional +endpoint for JSON output. + +The unresolved question is whether `nvidia-docker-plugin` would run standalone +as it does today (called over HTTP, perhaps with endpoints for a new Kubernetes +resource API) or whether the relevant code from its `nvidia` package should be +linked directly into kubelet. A partial list of tradeoffs: + +| | External binary | Linked in | +|---------------------|---------------------------------------------------------------------------------------------------|--------------------------------------------------------------| +| Use of cgo | Confined to binary | Linked into kubelet, but with lazy binding | +| Expandibility | Limited if we run the plugin, increased if library is used to build a Kubernetes-tailored daemon. | Can reuse the `nvidia` library as we prefer | +| Bloat | None | Larger kubelet, even for systems without GPUs | +| Reliability | Need to handle the binary disappearing at any time | Fewer headeaches | +| (Un)Marshalling | Need to talk over JSON | None | +| Administration cost | One more daemon to install, configure and monitor | No extra work required, other than perhaps configuring flags | +| Releases | Potentially on its own schedule | Tied to Kubernetes' | + +## Implementation plan + +### V0 + +The first two tracks can progress in parallel. + +#### Scheduling + +1. Define new resource `alpha.kubernetes.io:nvidia-gpu` in `pkg/api/types.go` +and co. +1. Plug resource into feasability checks used by kubelet, scheduler and +schedulercache. Maybe gated behind a flag? +1. Plug resource into resource_helpers.go +1. Plug resource into the limitranger + +#### Runtime + +1. Add kubelet config parameter to enable the resource +1. Make kubelet's `setNodeStatusMachineInfo` report the resource +1. Add a Devices list to container.RunContainerOptions +1. Use it from DockerManager's runContainer +1. Do the same for rkt (stretch goal) +1. When a pod requests a GPU, add the devices to the container options + +#### Other + +1. Add new resource to `kubectl describe` output. Optional for non-GPU users? +1. Administrator documentation, with sample scripts +1. User documentation + +## Future work + +Above all, we need to collect feedback from real users and use that to set +priorities for any of the items below. + +### V1 + +- Perform real detection of the installed hardware +- Figure a standard way to avoid bundling of shared libraries in images +- Support fractional resources so multiple pods can share the same GPU +- Support bare metal setups +- Report resource usage + +### V2 + +- Support multiple GPUs with resource hierarchies and affinities +- Support versioning of resources (e.g. "CUDA v7.5+") +- Build resource plugins into the kubelet? +- Support other device vendors +- Support Azure? +- Support rkt? + +### V3 + +- Support OpenCL (so images can be device-agnostic) + +### Undetermined + +It makes sense to turn the output of this project (external resource plugins, +etc.) into a more generic abstraction at some point. + + +## Security considerations + +There should be knobs for the cluster administrator to only allow certain users +or roles to schedule GPU workloads. Overcommitting or sharing the same device +across different pods is not considered safe. It should be possible to segregate +such GPU-sharing pods by user, namespace or a combination thereof. + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/gpu-support.md?pixel)]() +