mirror of
				https://github.com/k3s-io/kubernetes.git
				synced 2025-10-31 13:50:01 +00:00 
			
		
		
		
	
		
			
				
	
	
		
			314 lines
		
	
	
		
			13 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			314 lines
		
	
	
		
			13 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| <!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
 | |
| 
 | |
| <!-- BEGIN STRIP_FOR_RELEASE -->
 | |
| 
 | |
| <img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
 | |
|      width="25" height="25">
 | |
| <img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
 | |
|      width="25" height="25">
 | |
| <img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
 | |
|      width="25" height="25">
 | |
| <img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
 | |
|      width="25" height="25">
 | |
| <img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
 | |
|      width="25" height="25">
 | |
| 
 | |
| <h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
 | |
| 
 | |
| If you are using a released version of Kubernetes, you should
 | |
| refer to the docs that go with that version.
 | |
| 
 | |
| <!-- TAG RELEASE_LINK, added by the munger automatically -->
 | |
| <strong>
 | |
| The latest release of this document can be found
 | |
| [here](http://releases.k8s.io/release-1.4/docs/proposals/gpu-support.md).
 | |
| 
 | |
| Documentation for other releases can be found at
 | |
| [releases.k8s.io](http://releases.k8s.io).
 | |
| </strong>
 | |
| --
 | |
| 
 | |
| <!-- END STRIP_FOR_RELEASE -->
 | |
| 
 | |
| <!-- END MUNGE: UNVERSIONED_WARNING -->
 | |
| 
 | |
| <!-- BEGIN MUNGE: GENERATED_TOC -->
 | |
| 
 | |
| - [GPU support](#gpu-support)
 | |
|   - [Objective](#objective)
 | |
|   - [Background](#background)
 | |
|   - [Detailed discussion](#detailed-discussion)
 | |
|     - [Inventory](#inventory)
 | |
|     - [Scheduling](#scheduling)
 | |
|     - [The runtime](#the-runtime)
 | |
|       - [NVIDIA support](#nvidia-support)
 | |
|     - [Event flow](#event-flow)
 | |
|     - [Too complex for now: nvidia-docker](#too-complex-for-now-nvidia-docker)
 | |
|   - [Implementation plan](#implementation-plan)
 | |
|     - [V0](#v0)
 | |
|       - [Scheduling](#scheduling-1)
 | |
|       - [Runtime](#runtime)
 | |
|       - [Other](#other)
 | |
|   - [Future work](#future-work)
 | |
|     - [V1](#v1)
 | |
|     - [V2](#v2)
 | |
|     - [V3](#v3)
 | |
|     - [Undetermined](#undetermined)
 | |
|   - [Security considerations](#security-considerations)
 | |
| 
 | |
| <!-- END MUNGE: GENERATED_TOC -->
 | |
| 
 | |
| # GPU support
 | |
| 
 | |
| Author: @therc
 | |
| 
 | |
| Date: Apr 2016
 | |
| 
 | |
| Status: Design in progress, early implementation of requirements
 | |
| 
 | |
| ## Objective
 | |
| 
 | |
| Users should be able to request GPU resources for their workloads, as easily as
 | |
| for CPU or memory. Kubernetes should keep an inventory of machines with GPU
 | |
| hardware, schedule containers on appropriate nodes and set up the container
 | |
| environment with all that's necessary to access the GPU. All of this should
 | |
| eventually be supported for clusters on either bare metal or cloud providers.
 | |
| 
 | |
| ## Background
 | |
| 
 | |
| An increasing number of workloads, such as machine learning and seismic survey
 | |
| processing, benefits from offloading computations to graphic hardware. While not
 | |
| as tuned as traditional, dedicated high performance computing systems such as
 | |
| MPI, a Kubernetes cluster can still be a great environment for organizations
 | |
| that need a variety of additional, "classic" workloads, such as database, web
 | |
| serving, etc.
 | |
| 
 | |
| GPU support is hard to provide extensively and will thus take time to tame
 | |
| completely, because
 | |
| 
 | |
| - different vendors expose the hardware to users in different ways
 | |
| - some vendors require fairly tight coupling between the kernel driver
 | |
| controlling the GPU and the libraries/applications that access the hardware
 | |
| - it adds more resource types (whole GPUs, GPU cores, GPU memory)
 | |
| - it can introduce new security pitfalls
 | |
| - for systems with multiple GPUs, affinity matters, similarly to NUMA
 | |
| considerations for CPUs
 | |
| - running GPU code in containers is still a relatively novel idea
 | |
| 
 | |
| ## Detailed discussion
 | |
| 
 | |
| Currently, this document is mostly focused on the basic use case: run GPU code
 | |
| on AWS `g2.2xlarge` EC2 machine instances using Docker. It constitutes a narrow
 | |
| enough scenario that it does not require large amounts of generic code yet. GCE
 | |
| doesn't support GPUs at all; bare metal systems throw a lot of extra variables
 | |
| into the mix.
 | |
| 
 | |
| Later sections will outline future work to support a broader set of hardware,
 | |
| environments and container runtimes.
 | |
| 
 | |
| ### Inventory
 | |
| 
 | |
| Before any scheduling can occur, we need to know what's available out there. In
 | |
| v0, we'll hardcode capacity detected by the kubelet based on a flag,
 | |
| `--experimental-nvidia-gpu`. This will result in the user-defined resource
 | |
| `alpha.kubernetes.io/nvidia-gpu` to be reported for `NodeCapacity` and
 | |
| `NodeAllocatable`, as well as as a node label.
 | |
| 
 | |
| ### Scheduling
 | |
| 
 | |
| GPUs will be visible as first-class resources. In v0, we'll only assign whole
 | |
| devices; sharing among multiple pods is left to future implementations. It's
 | |
| probable that GPUs will exacerbate the need for [a rescheduler](rescheduler.md)
 | |
| or pod priorities, especially if the nodes in a cluster are not homogeneous.
 | |
| Consider these two cases:
 | |
| 
 | |
| > Only half of the machines have a GPU and they're all busy with other
 | |
| workloads. The other half of the cluster is doing very little work. A GPU
 | |
| workload comes, but it can't schedule, because the devices are sitting idle on
 | |
| nodes that are running something else and the nodes with little load lack the
 | |
| hardware.
 | |
| 
 | |
| > Some or all the machines have two graphic cards each. A number of jobs get
 | |
| scheduled, requesting one device per pod. The scheduler puts them all on
 | |
| different machines, spreading the load, perhaps by design. Then a new job comes
 | |
| in, requiring two devices per pod, but it can't schedule anywhere, because all
 | |
| we can find, at most, is one unused device per node.
 | |
| 
 | |
| ### The runtime
 | |
| 
 | |
| Once we know where to run the container, it's time to set up its environment. At
 | |
| a minimum, we'll need to map the host device(s) into the container. Because each
 | |
| manufacturer exposes different device nodes (`/dev/ati/card0`, `/dev/nvidia0`,
 | |
| but also the required `/dev/nvidiactl` and `/dev/nvidia-uvm`), some of the logic
 | |
| needs to be hardware-specific, mapping from a logical device to a list of device
 | |
| nodes necessary for software to talk to it.
 | |
| 
 | |
| Support binaries and libraries are often versioned along with the kernel module,
 | |
| so there should be further hooks to project those under `/bin` and some kind of
 | |
| `/lib` before the application is started. This can be done for Docker with the
 | |
| use of a versioned [Docker
 | |
| volume](https://docs.docker.com/engine/tutorials/dockervolumes/) or
 | |
| with upcoming Kubernetes-specific hooks such as init containers and volume
 | |
| containers. In v0, images are expected to bundle everything they need.
 | |
| 
 | |
| #### NVIDIA support
 | |
| 
 | |
| The first implementation and testing ground will be for NVIDIA devices, by far
 | |
| the most common setup.
 | |
| 
 | |
| In v0, the `--experimental-nvidia-gpu` flag will also result in the host devices
 | |
| (limited to those required to drive the first card, `nvidia0`) to be mapped into
 | |
| the container by the dockertools library.
 | |
| 
 | |
| ### Event flow
 | |
| 
 | |
| This is what happens before and after an user schedules a GPU pod.
 | |
| 
 | |
| 1. Administrator installs a number of Kubernetes nodes with GPUs. The correct
 | |
| kernel modules and device nodes under `/dev/` are present.
 | |
| 
 | |
| 1. Administrator makes sure the latest CUDA/driver versions are installed.
 | |
| 
 | |
| 1. Administrator enables `--experimental-nvidia-gpu` on kubelets
 | |
| 
 | |
| 1. Kubelets update node status with information about the GPU device, in addition
 | |
| to cAdvisor's usual data about CPU/memory/disk
 | |
| 
 | |
| 1. User creates a Docker image compiling their application for CUDA, bundling
 | |
| the necessary libraries. We ignore any versioning requirements in the image
 | |
| using labels based on [NVIDIA's
 | |
| conventions](https://github.com/NVIDIA/nvidia-docker/blob/64510511e3fd0d00168eb076623854b0fcf1507d/tools/src/nvidia-docker/utils.go#L13).
 | |
| 
 | |
| 1. User creates a pod using the image, requiring
 | |
| `alpha.kubernetes.io/nvidia-gpu: 1`
 | |
| 
 | |
| 1. Scheduler picks a node for the pod
 | |
| 
 | |
| 1. The kubelet notices the GPU requirement and maps the three devices. In
 | |
| Docker's engine-api, this means it'll add them to the Resources.Devices list.
 | |
| 
 | |
| 1. Docker runs the container to completion
 | |
| 
 | |
| 1. The scheduler notices that the device is available again
 | |
| 
 | |
| ### Too complex for now: nvidia-docker
 | |
| 
 | |
| For v0, we discussed at length, but decided to leave aside initially the
 | |
| [nvidia-docker plugin](https://github.com/NVIDIA/nvidia-docker). The plugin is
 | |
| an officially supported solution, thus avoiding a lot of new low level code, as
 | |
| it takes care of functionality such as:
 | |
| 
 | |
| - creating a Docker volume with binaries such as `nvidia-smi` and shared
 | |
| libraries
 | |
| - providing HTTP endpoints that monitoring tools can use to collect GPU metrics
 | |
| - abstracting details such as `/dev` entry names for each device, as well as
 | |
| control ones like `nvidiactl`
 | |
| 
 | |
| The `nvidia-docker` wrapper also verifies that the CUDA version required by a
 | |
| given image is supported by the host drivers, through inspection of well-known
 | |
| image labels, if present. We should try to provide equivalent checks, either
 | |
| for CUDA or OpenCL.
 | |
| 
 | |
| This is current sample output from `nvidia-docker-plugin`, wrapped for
 | |
| readability:
 | |
| 
 | |
|     $ curl -s localhost:3476/docker/cli
 | |
|     --device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia0
 | |
|     --volume-driver=nvidia-docker
 | |
|     --volume=nvidia_driver_352.68:/usr/local/nvidia:ro
 | |
| 
 | |
| It runs as a daemon listening for HTTP requests on port 3476. The endpoint above
 | |
| returns flags that need to be added to the Docker command line in order to
 | |
| expose GPUs to the containers. There are optional URL arguments to request
 | |
| specific devices if more than one are present on the system, as well as specific
 | |
| versions of the support software. An obvious improvement is an additional
 | |
| endpoint for JSON output.
 | |
| 
 | |
| The unresolved question is whether `nvidia-docker-plugin` would run standalone
 | |
| as it does today (called over HTTP, perhaps with endpoints for a new Kubernetes
 | |
| resource API) or whether the relevant code from its `nvidia` package should be
 | |
| linked directly into kubelet. A partial list of tradeoffs:
 | |
| 
 | |
| |                     | External binary                                                                                   | Linked in                                                    |
 | |
| |---------------------|---------------------------------------------------------------------------------------------------|--------------------------------------------------------------|
 | |
| | Use of cgo          | Confined to binary                                                                                | Linked into kubelet, but with lazy binding                   |
 | |
| | Expandibility       | Limited if we run the plugin, increased if library is used to build a Kubernetes-tailored daemon. | Can reuse the `nvidia` library as we prefer                  |
 | |
| | Bloat               | None                                                                                              | Larger kubelet, even for systems without GPUs                |
 | |
| | Reliability         | Need to handle the binary disappearing at any time                                                | Fewer headeaches                                             |
 | |
| | (Un)Marshalling     | Need to talk over JSON                                                                            | None                                                         |
 | |
| | Administration cost | One more daemon to install, configure and monitor                                                 | No extra work required, other than perhaps configuring flags |
 | |
| | Releases            | Potentially on its own schedule                                                                   | Tied to Kubernetes'                                          |
 | |
| 
 | |
| ## Implementation plan
 | |
| 
 | |
| ### V0
 | |
| 
 | |
| The first two tracks can progress in parallel.
 | |
| 
 | |
| #### Scheduling
 | |
| 
 | |
| 1. Define new resource `alpha.kubernetes.io:nvidia-gpu` in `pkg/api/types.go`
 | |
| and co.
 | |
| 1. Plug resource into feasability checks used by kubelet, scheduler and
 | |
| schedulercache. Maybe gated behind a flag?
 | |
| 1. Plug resource into resource_helpers.go
 | |
| 1. Plug resource into the limitranger
 | |
| 
 | |
| #### Runtime
 | |
| 
 | |
| 1. Add kubelet config parameter to enable the resource
 | |
| 1. Make kubelet's `setNodeStatusMachineInfo` report the resource
 | |
| 1. Add a Devices list to container.RunContainerOptions
 | |
| 1. Use it from DockerManager's runContainer
 | |
| 1. Do the same for rkt (stretch goal)
 | |
| 1. When a pod requests a GPU, add the devices to the container options
 | |
| 
 | |
| #### Other
 | |
| 
 | |
| 1. Add new resource to `kubectl describe` output. Optional for non-GPU users?
 | |
| 1. Administrator documentation, with sample scripts
 | |
| 1. User documentation
 | |
| 
 | |
| ## Future work
 | |
| 
 | |
| Above all, we need to collect feedback from real users and use that to set
 | |
| priorities for any of the items below.
 | |
| 
 | |
| ### V1
 | |
| 
 | |
| - Perform real detection of the installed hardware
 | |
| - Figure a standard way to avoid bundling of shared libraries in images
 | |
| - Support fractional resources so multiple pods can share the same GPU
 | |
| - Support bare metal setups
 | |
| - Report resource usage
 | |
| 
 | |
| ### V2
 | |
| 
 | |
| - Support multiple GPUs with resource hierarchies and affinities
 | |
| - Support versioning of resources (e.g. "CUDA v7.5+")
 | |
| - Build resource plugins into the kubelet?
 | |
| - Support other device vendors
 | |
| - Support Azure?
 | |
| - Support rkt?
 | |
| 
 | |
| ### V3
 | |
| 
 | |
| - Support OpenCL (so images can be device-agnostic)
 | |
| 
 | |
| ### Undetermined
 | |
| 
 | |
| It makes sense to turn the output of this project (external resource plugins,
 | |
| etc.) into a more generic abstraction at some point.
 | |
| 
 | |
| 
 | |
| ## Security considerations
 | |
| 
 | |
| There should be knobs for the cluster administrator to only allow certain users
 | |
| or roles to schedule GPU workloads. Overcommitting or sharing the same device
 | |
| across different pods is not considered safe. It should be possible to segregate
 | |
| such GPU-sharing pods by user, namespace or a combination thereof.
 | |
| 
 | |
| <!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
 | |
| []()
 | |
| <!-- END MUNGE: GENERATED_ANALYTICS -->
 |