Kubernetes Submit Queue 6de28fab7d Merge pull request #42942 from vishh/gpu-cont-fix
Automatic merge from submit-queue (batch tested with PRs 42942, 42935)

[Bug] Handle container restarts and avoid using runtime pod cache while allocating GPUs

Fixes #42412

**Background**
Support for multiple GPUs is an experimental feature in v1.6. 
Container restarts were handled incorrectly which resulted in stranding of GPUs
Kubelet is incorrectly using runtime cache to track running pods which can result in race conditions (as it did in other parts of kubelet). This can result in same GPU being assigned to multiple pods.

**What does this PR do**
This PR tracks assignment of GPUs to containers and returns pre-allocated GPUs instead of (incorrectly) allocating new GPUs.
GPU manager is updated to consume a list of active pods derived from apiserver cache instead of runtime cache.
Node e2e has been extended to validate this failure scenario.

**Risk**
Minimal/None since support for GPUs is an experimental feature that is turned off by default. The code is also isolated to GPU manager in kubelet.

**Workarounds**
In the absence of this PR, users can mitigate the original issue by setting `RestartPolicyNever`  in their pods.
There is no workaround for the race condition caused by using the runtime cache though.
Hence it is worth including this fix in v1.6.0.

cc @jianzhangbjz @seelam @kubernetes/sig-node-pr-reviews 

Replaces #42560
2017-03-14 10:19:17 -07:00
2017-03-09 07:34:56 +00:00
2017-03-09 07:34:56 +00:00
2017-02-21 14:35:55 -08:00

Kubernetes

Submit Queue Widget GoDoc Widget

Introduction

Kubernetes is an open source system for managing containerized applications across multiple hosts, providing basic mechanisms for deployment, maintenance, and scaling of applications. Kubernetes is hosted by the Cloud Native Computing Foundation (CNCF).

Kubernetes builds upon a decade and a half of experience at Google running production workloads at scale using a system called Borg, combined with best-of-breed ideas and practices from the community.


Are you ...

Code of Conduct

The Kubernetes community abides by the CNCF code of conduct. Here is an excerpt:

As contributors and maintainers of this project, and in the interest of fostering an open and welcoming community, we pledge to respect all people who contribute through reporting issues, posting feature requests, updating documentation, submitting pull requests or patches, and other activities.

Community

Do you want to help shape the evolution of technologies that are container-packaged, dynamically-scheduled and microservices-oriented?. If you are a company, you should consider joining the CNCF. For details about who's involved in CNCF and how Kubernetes plays a role, read the announcement. For general information about our community, see the website community page.

Contribute

If you're interested in being a contributor and want to get involved in developing Kubernetes, get started with this reading:

You will then most certainly gain a lot from joining a SIG, attending the regular hangouts as well as the community meeting.

If you have an idea for a new feature, see the Kubernetes Features repository for a list of features that are coming in new releases as well as details on how to propose one.

Building Kubernetes for the impatient

If you want to build Kubernetes right away there are two options:

$ go get -d k8s.io/kubernetes
$ cd $GOPATH/src/k8s.io/kubernetes
$ make
$ git clone https://github.com/kubernetes/kubernetes
$ cd kubernetes
$ make quick-release

If you are less impatient, head over to the developer's documentation.

Support

While there are many different channels that you can use to get hold of us (Slack, Stack Overflow, Issues, Forums/Mailing lists), you can help make sure that we are efficient in getting you the help that you need.

If you need support, start with the troubleshooting guide and work your way through the process that we've outlined.

That said, if you have questions, reach out to us one way or another. We don't bite!

Analytics

Description
Production-Grade Container Scheduling and Management
Readme Apache-2.0 1.3 GiB
Languages
Go 97%
Shell 2.6%
PowerShell 0.2%