Patrick Ohly 75fd186e7a DRA scheduler: fix one root cause of double device allocation
DRA depends on the assume cache having invoked all event handlers before
Assume() returns, because DRA maintains state that is relevant for scheduling
through those event handlers.

This log snippet shows how this went wrong during PreBind:

    dynamicresources.go:1150: I0115 10:35:29.264437] scheduler: Claim stored in assume cache pod="testdra-all-usesallresources-kqjpj/my-pod-0091" claim="testdra-all-usesallresources-kqjpj/claim-0091" uid=<types.UID>: 516f274f-e1a9-4a4b-b7d2-bb86138e4240 resourceVersion="5636"
    dra_manager.go:198: I0115 10:35:29.264448] scheduler: Removed in-flight claim claim="testdra-all-usesallresources-kqjpj/claim-0091" uid=<types.UID>: 516f274f-e1a9-4a4b-b7d2-bb86138e4240 version="287"
    dynamicresources.go:1157: I0115 10:35:29.264463] scheduler: Removed claim from in-flight claims pod="testdra-all-usesallresources-kqjpj/my-pod-0091" claim="testdra-all-usesallresources-kqjpj/claim-0091" uid=<types.UID>: 516f274f-e1a9-4a4b-b7d2-bb86138e4240 resourceVersion="5636" allocation=<
    ...
    allocateddevices.go:189: I0115 10:35:29.267315] scheduler: Observed device allocation device="testdra-all-usesallresources-kqjpj.driver/worker-1/worker-1-device-096" claim="testdra-all-usesallresources-kqjpj/claim-0091"

- goroutine #1: UpdateStatus result delivered via informer.
  AssumeCache updates cache, pushes event A, emitEvents pulls event A from queue.
  *Not* done with delivering it yet!
- goroutine #2: AssumeCache.Assume called. Updates cache, pushes event B, emits it.
  Old and new claim have allocation, so no "Observed device allocation".
- goroutine #3: Schedules next pod, without considering device as allocated (not in the log snippet).
- goroutine #1: Finally delivers event A: "Observed device allocation", but too late.

Also, events are delivered out-of-order.

The fix is to let emitEvents when called by Assume wait for a potentially
running emitEvents in some other goroutine, thus ensuring that an event pulled
out of the queue by that other goroutine got delivered before Assume itself
checks the queue one more time and then returns.

The time window were things go wrong is small. An E2E test covering this only
flaked rarely, and only in the CI. An integration test (separate commit) with
higher number of pods finally made it possible to reproduce locally. It also
uncovered a second race (fix in separate commit).

The unit test fails without the fix:

    === RUN   TestAssumeConcurrency
        assume_cache_test.go:311: FATAL ERROR:
            	Assume should have blocked and didn't.
    --- FAIL: TestAssumeConcurrency (0.00s)
2026-01-27 14:44:03 +01:00
2025-11-01 06:05:27 +02:00
2025-03-08 15:21:58 -08:00
2025-02-26 11:27:07 +01:00
2025-03-08 15:21:58 -08:00
2025-02-12 11:04:11 +02:00
2024-12-23 00:15:17 +03:30

Kubernetes (K8s)

CII Best Practices Go Report Card GitHub release (latest SemVer)


Kubernetes, also known as K8s, is an open source system for managing containerized applications across multiple hosts. It provides basic mechanisms for the deployment, maintenance, and scaling of applications.

Kubernetes builds upon a decade and a half of experience at Google running production workloads at scale using a system called Borg, combined with best-of-breed ideas and practices from the community.

Kubernetes is hosted by the Cloud Native Computing Foundation (CNCF). If your company wants to help shape the evolution of technologies that are container-packaged, dynamically scheduled, and microservices-oriented, consider joining the CNCF. For details about who's involved and how Kubernetes plays a role, read the CNCF announcement.


To start using K8s

See our documentation on kubernetes.io.

Take a free course on Scalable Microservices with Kubernetes.

To use Kubernetes code as a library in other applications, see the list of published components. Use of the k8s.io/kubernetes module or k8s.io/kubernetes/... packages as libraries is not supported.

To start developing K8s

The community repository hosts all information about building Kubernetes from source, how to contribute code and documentation, who to contact about what, etc.

If you want to build Kubernetes right away there are two options:

You have a working Go environment.
git clone https://github.com/kubernetes/kubernetes
cd kubernetes
make
You have a working Docker environment.
git clone https://github.com/kubernetes/kubernetes
cd kubernetes
make quick-release

For the full story, head over to the developer's documentation.

Support

If you need support, start with the troubleshooting guide, and work your way through the process that we've outlined.

That said, if you have questions, reach out to us one way or another.

Community Meetings

The Calendar has the list of all the meetings in the Kubernetes community in a single location.

Adopters

The User Case Studies website has real-world use cases of organizations across industries that are deploying/migrating to Kubernetes.

Governance

Kubernetes project is governed by a framework of principles, values, policies and processes to help our community and constituents towards our shared goals.

The Kubernetes Community is the launching point for learning about how we organize ourselves.

The Kubernetes Steering community repo is used by the Kubernetes Steering Committee, which oversees governance of the Kubernetes project.

Roadmap

The Kubernetes Enhancements repo provides information about Kubernetes releases, as well as feature tracking and backlogs.

Description
Production-Grade Container Scheduling and Management
Readme Apache-2.0 1.4 GiB
Languages
Go 97%
Shell 2.6%
PowerShell 0.2%