Compare commits

...

21 Commits

Author SHA1 Message Date
Zvonko Kaiser
99f32de1e5 kata-deploy: Update RuntimeClass PodOverhead
Align the podOverhead with the default_memory updated
in the previous commit.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-03-15 09:53:32 -07:00
Zvonko Kaiser
6a853a9684 gpu: Bump NVRC
We have a new release add this one to the next
Kata release.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>

Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-03-15 09:53:32 -07:00
Zvonko Kaiser
8ff5d164c6 runtime: make CDI annotation vendor-agnostic with lookup table
Replace hardcoded NVIDIA vendor ID (0x10de) and class (0x030) checks
with a vendor-agnostic lookup table (cdiDeviceKind) that maps PCI
vendor/class pairs to CDI device kinds. This makes it straightforward
to add support for new device types by adding entries to the table.

Refactor siblingAnnotation to resolve device BDFs once upfront and
reuse them for both CDI type detection and sibling matching, eliminating
redundant sysfs reads. Devices not in the lookup table (e.g. NVSwitches)
are skipped with errNoSiblingFound, while known device types that fail
to match a sibling produce a hard error.

Consolidate the hot-plug and cold-plug device loops into a single loop
over extracted container paths, removing duplicated filtering logic.

Export GetPCIDeviceProperty from the device drivers package to allow
vendor/class lookup from sysfs in the container annotation path.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-03-15 09:53:32 -07:00
Zvonko Kaiser
d4c21f50b5 gpu: Bump default memory to 8G for GPU runtimes
We need enough inital memory to prepare more complex
platforms like HGX H100 or HGX B200 systems.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-03-15 09:53:32 -07:00
Zvonko Kaiser
5c9683f006 gpu: Remove devtmpfs.mount=0
With the newest NVRC release this is solved and does
not need to be overriden.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-03-15 09:53:32 -07:00
Zvonko Kaiser
d22c314e91 gpu: Increase dial_timeout=1200
For cold-plug when running with nerdctl the timeouts in the config
are being used, increase the dial_timeout (e.g. for CreateSandbox) to match
create_container_timeout.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-03-15 09:53:32 -07:00
Zvonko Kaiser
7fe84c8038 gpu: HGX Rootfs Fixes
Various smaller fixes to enable HGX systems.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-03-15 09:53:32 -07:00
Joji Mekkattuparamban
1fd66db271 nvidia-gpu: add missing libraries to rootfs
Added the missing packages to the nvidia rootfs.

Fixes #12534

Signed-off-by: Joji Mekkattuparamban <jojim@nvidia.com>
2026-03-13 16:24:32 -07:00
Dan Mihai
9332b75c04 Merge pull request #12661 from stevenhorsman/runtime-go-1.25.8
runtime: bump go.mod version
2026-03-13 14:06:08 -07:00
Manuel Huber
4a7022d2f4 tests: nvidia: call genpolicy auth for all tests
Call the setup_genpolicy_registry_auth in run_kubernetes_nv_tests.sh.
Authenticate before exercising any tests.

Recently, we have seen UnauthorizedError messages for the CUDA
vectorAdd image. While this image is not gated behind authentication,
rate limiting may be a possible issue.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-03-13 09:03:01 -07:00
Zvonko Kaiser
4c450a5b01 Merge pull request #12648 from manuelh-dev/mahuber/trustee-upgrade
versions: bump trustee to latest version
2026-03-12 14:09:15 -04:00
Steve Horsman
7d2e18575c Merge pull request #12343 from zvonkok/release-model
doc: Release model update
2026-03-12 14:44:51 +00:00
Zvonko Kaiser
7f662662cf lint: Fix 80 char column size
Make markdownlint happy

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-03-12 12:03:29 +00:00
Zvonko Kaiser
6e03a95730 doc: Update Release Process
Add how Kata is doing the rolling release.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-03-12 12:03:29 +00:00
stevenhorsman
f25fa6ab25 runtime: bump go.mod version
Update the runtime's go.mod go version to 1.25.8 to
keep in sync with versions.yaml

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-03-12 08:53:40 +00:00
Steve Horsman
a29eb3751a Merge pull request #12517 from kata-containers/osv-scanner-bump-2.3.3
workflows: Bump OSV scanner
2026-03-12 08:48:52 +00:00
stevenhorsman
064a960aaa workflows: Bump OSV scanner
Bump to the latest version to pick up bug fixes

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-03-12 07:00:11 +00:00
Steve Horsman
f41edcb4c0 Merge pull request #12653 from kata-containers/dependabot/cargo/src/tools/agent-ctl/quinn-proto-0.11.14
build(deps): bump quinn-proto from 0.11.8 to 0.11.14 in /src/tools/agent-ctl
2026-03-12 06:53:59 +00:00
Manuel Huber
0926c92aa0 versions: bump trustee to latest version
Ingest various recent fixes and dependency updates.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-03-11 14:45:42 -07:00
Manuel Huber
8162d15b46 nvidia: fix invalid CTK reference
Use proper reference from versions yaml structure.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-03-11 12:49:29 -07:00
dependabot[bot]
d366d103cc build(deps): bump quinn-proto in /src/tools/agent-ctl
Bumps [quinn-proto](https://github.com/quinn-rs/quinn) from 0.11.8 to 0.11.14.
- [Release notes](https://github.com/quinn-rs/quinn/releases)
- [Commits](https://github.com/quinn-rs/quinn/compare/quinn-proto-0.11.8...quinn-proto-0.11.14)

---
updated-dependencies:
- dependency-name: quinn-proto
  dependency-version: 0.11.14
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-03-11 16:04:34 +00:00
18 changed files with 352 additions and 237 deletions

View File

@@ -19,23 +19,25 @@ permissions: {}
jobs:
scan-scheduled:
name: Scan of whole repo
permissions:
actions: read # # Required to upload SARIF file to CodeQL
contents: read # Read commit contents
security-events: write # Require writing security events to upload SARIF file to security tab
if: ${{ github.event_name == 'push' || github.event_name == 'schedule' || github.event_name == 'workflow_dispatch' }}
uses: "google/osv-scanner-action/.github/workflows/osv-scanner-reusable.yml@b00f71e051ddddc6e46a193c31c8c0bf283bf9e6" # v2.1.0
uses: "google/osv-scanner-action/.github/workflows/osv-scanner-reusable.yml@8ae4be80636b94886b3c271caad730985ce0611c" # v2.3.3
with:
scan-args: |-
-r
./
scan-pr:
name: Scan of just PR code
permissions:
actions: read # Required to upload SARIF file to CodeQL
contents: read # Read commit contents
security-events: write # Require writing security events to upload SARIF file to security tab
if: ${{ github.event_name == 'pull_request' }}
uses: "google/osv-scanner-action/.github/workflows/osv-scanner-reusable-pr.yml@b00f71e051ddddc6e46a193c31c8c0bf283bf9e6" # v2.1.0
uses: "google/osv-scanner-action/.github/workflows/osv-scanner-reusable-pr.yml@8ae4be80636b94886b3c271caad730985ce0611c" # v2.3.3
with:
# Example of specifying custom arguments
scan-args: |-

View File

@@ -1,57 +1,64 @@
# How to do a Kata Containers Release
This document lists the tasks required to create a Kata Release.
## Requirements
- GitHub permissions to run workflows.
## Versioning
## Release Model
The Kata Containers project uses [semantic versioning](http://semver.org/) for all releases.
Semantic versions are comprised of three fields in the form:
Kata Containers follows a rolling release model with monthly snapshots.
New features, bug fixes, and improvements are continuously integrated into
`main`. Each month, a snapshot is tagged as a new `MINOR` release.
```
MAJOR.MINOR.PATCH
```
### Versioning
When `MINOR` increases, the new release adds **new features** but *without changing the existing behavior*.
Releases use the `MAJOR.MINOR.PATCH` scheme. Monthly snapshots increment
`MINOR`; `PATCH` is typically `0`. Major releases are rare (years apart) and
signal significant architectural changes that may require updates to container
managers (Containerd, CRI-O) or other infrastructure. Breaking changes in
`MINOR` releases are avoided where possible, but may occasionally occur as
features are deprecated or removed.
When `MAJOR` increases, the new release adds **new features, bug fixes, or
both** and which **changes the behavior from the previous release** (incompatible with previous releases).
### No Stable Branches
A major release will also likely require a change of the container manager version used,
-for example Containerd or CRI-O. Please refer to the release notes for further details.
**Important** : the Kata Containers project doesn't have stable branches (see
[this issue](https://github.com/kata-containers/kata-containers/issues/9064) for details).
Bug fixes are released as part of `MINOR` or `MAJOR` releases only. `PATCH` is always `0`.
The Kata Containers project does not maintain stable branches (see
[#9064](https://github.com/kata-containers/kata-containers/issues/9064)).
Bug fixes land on `main` and ship in the next monthly snapshot rather than
being backported. Downstream projects that need extended support or compliance
certifications should select a monthly snapshot as their stable base and manage
their own validation and patch backporting from there.
## Release Process
### Bump the `VERSION` and `Chart.yaml` file
When the `kata-containers/kata-containers` repository is ready for a new release,
first create a PR to set the release in the [`VERSION`](./../VERSION) file and update the
`version` and `appVersion` in the
[`Chart.yaml`](./../tools/packaging/kata-deploy/helm-chart/kata-deploy/Chart.yaml) file and
have it merged.
When the `kata-containers/kata-containers` repository is ready for a new
release, first create a PR to set the release in the [`VERSION`](./../VERSION)
file and update the `version` and `appVersion` in the
[`Chart.yaml`](./../tools/packaging/kata-deploy/helm-chart/kata-deploy/Chart.yaml)
file and have it merged.
### Lock the `main` branch
In order to prevent any PRs getting merged during the release process, and slowing the release
process down, by impacting the payload caches, we have recently trailed setting the `main`
branch to read only whilst the release action runs.
In order to prevent any PRs getting merged during the release process, and
slowing the release process down, by impacting the payload caches, we have
recently trialed setting the `main` branch to read only whilst the release
action runs.
> [!NOTE]
> Admin permission is needed to complete this task.
### Wait for the `VERSION` bump PR payload publish to complete
To reduce the chance of need to re-run the release workflow, check the
[CI | Publish Kata Containers payload](https://github.com/kata-containers/kata-containers/actions/workflows/payload-after-push.yaml)
To reduce the chance of need to re-run the release workflow, check the [CI |
Publish Kata Containers
payload](https://github.com/kata-containers/kata-containers/actions/workflows/payload-after-push.yaml)
once the `VERSION` PR bump has merged to check that the assets build correctly
and are cached, so that the release process can just download these artifacts
rather than needing to build them all, which takes time and can reveal errors in infra.
rather than needing to build them all, which takes time and can reveal errors in
infra.
### Check GitHub Actions
@@ -63,11 +70,10 @@ release artifacts.
> [!NOTE]
> Write permissions to trigger the action.
The action is manually triggered and is responsible for generating a new
release (including a new tag), pushing those to the
`kata-containers/kata-containers` repository. The new release is initially
created as a draft. It is promoted to an official release when the whole
workflow has completed successfully.
The action is manually triggered and is responsible for generating a new release
(including a new tag), pushing those to the `kata-containers/kata-containers`
repository. The new release is initially created as a draft. It is promoted to
an official release when the whole workflow has completed successfully.
Check the [actions status
page](https://github.com/kata-containers/kata-containers/actions) to verify all
@@ -75,12 +81,13 @@ steps in the actions workflow have completed successfully. On success, a static
tarball containing Kata release artifacts will be uploaded to the [Release
page](https://github.com/kata-containers/kata-containers/releases).
If the workflow fails because of some external environmental causes, e.g. network
timeout, simply re-run the failed jobs until they eventually succeed.
If the workflow fails because of some external environmental causes, e.g.
network timeout, simply re-run the failed jobs until they eventually succeed.
If for some reason you need to cancel the workflow or re-run it entirely, go first
to the [Release page](https://github.com/kata-containers/kata-containers/releases) and
delete the draft release from the previous run.
If for some reason you need to cancel the workflow or re-run it entirely, go
first to the [Release
page](https://github.com/kata-containers/kata-containers/releases) and delete
the draft release from the previous run.
### Unlock the `main` branch
@@ -90,9 +97,8 @@ an admin to do it.
### Improve the release notes
Release notes are auto-generated by the GitHub CLI tool used as part of our
release workflow. However, some manual tweaking may still be necessary in
order to highlight the most important features and bug fixes in a specific
release.
release workflow. However, some manual tweaking may still be necessary in order
to highlight the most important features and bug fixes in a specific release.
With this in mind, please, poke @channel on #kata-dev and people who worked on
the release will be able to contribute to that.

View File

@@ -483,16 +483,12 @@ ifneq (,$(QEMUCMD))
KERNELPATH_CONFIDENTIAL_NV = $(KERNELDIR)/$(KERNELNAME_CONFIDENTIAL_NV)
DEFAULTVCPUS_NV = 1
DEFAULTMEMORY_NV = 2048
DEFAULTMEMORY_NV = 8192
DEFAULTTIMEOUT_NV = 1200
DEFAULTVFIOPORT_NV = root-port
DEFAULTPCIEROOTPORT_NV = 8
# Disable the devtmpfs mount in guest. NVRC does this, and later kata-agent
# attempts this as well in a non-failing manner. Otherwise, NVRC fails when
# using an image and /dev is already mounted.
KERNELPARAMS_NV = "cgroup_no_v1=all"
KERNELPARAMS_NV += "devtmpfs.mount=0"
KERNELPARAMS_NV += "pci=realloc"
KERNELPARAMS_NV += "pci=nocrs"
KERNELPARAMS_NV += "pci=assign-busses"

View File

@@ -599,7 +599,7 @@ debug_console_enabled = false
# Agent connection dialing timeout value in seconds
# (default: 90)
dial_timeout = 90
dial_timeout = @DEFAULTTIMEOUT_NV@
[runtime]
# If enabled, the runtime will log additional debug messages to the

View File

@@ -576,7 +576,7 @@ debug_console_enabled = false
# Agent connection dialing timeout value in seconds
# (default: 90)
dial_timeout = 90
dial_timeout = @DEFAULTTIMEOUT_NV@
[runtime]
# If enabled, the runtime will log additional debug messages to the

View File

@@ -578,7 +578,7 @@ debug_console_enabled = false
# Agent connection dialing timeout value in seconds
# (default: 90)
dial_timeout = 90
dial_timeout = @DEFAULTTIMEOUT_NV@
[runtime]
# If enabled, the runtime will log additional debug messages to the

View File

@@ -1,7 +1,7 @@
module github.com/kata-containers/kata-containers/src/runtime
// Keep in sync with version in versions.yaml
go 1.25.7
go 1.25.8
// WARNING: Do NOT use `replace` directives as those break dependabot:
// https://github.com/kata-containers/kata-containers/issues/11020

View File

@@ -72,7 +72,7 @@ func IsPCIeDevice(bdf string) bool {
}
// read from /sys/bus/pci/devices/xxx/property
func getPCIDeviceProperty(bdf string, property PCISysFsProperty) string {
func GetPCIDeviceProperty(bdf string, property PCISysFsProperty) string {
if len(strings.Split(bdf, ":")) == 2 {
bdf = PCIDomain + ":" + bdf
}
@@ -220,9 +220,9 @@ func GetDeviceFromVFIODev(device config.DeviceInfo) ([]*config.VFIODev, error) {
return nil, err
}
vendorID := getPCIDeviceProperty(deviceBDF, PCISysFsDevicesVendor)
deviceID := getPCIDeviceProperty(deviceBDF, PCISysFsDevicesDevice)
pciClass := getPCIDeviceProperty(deviceBDF, PCISysFsDevicesClass)
vendorID := GetPCIDeviceProperty(deviceBDF, PCISysFsDevicesVendor)
deviceID := GetPCIDeviceProperty(deviceBDF, PCISysFsDevicesDevice)
pciClass := GetPCIDeviceProperty(deviceBDF, PCISysFsDevicesClass)
i, err := extractIndex(device.HostPath)
if err != nil {
@@ -276,7 +276,7 @@ func GetAllVFIODevicesFromIOMMUGroup(device config.DeviceInfo) ([]*config.VFIODe
switch vfioDeviceType {
case config.VFIOPCIDeviceNormalType, config.VFIOPCIDeviceMediatedType:
// This is vfio-pci and vfio-mdev specific
pciClass := getPCIDeviceProperty(deviceBDF, PCISysFsDevicesClass)
pciClass := GetPCIDeviceProperty(deviceBDF, PCISysFsDevicesClass)
// We need to ignore Host or PCI Bridges that are in the same IOMMU group as the
// passed-through devices. One CANNOT pass-through a PCI bridge or Host bridge.
// Class 0x0604 is PCI bridge, 0x0600 is Host bridge
@@ -288,8 +288,8 @@ func GetAllVFIODevicesFromIOMMUGroup(device config.DeviceInfo) ([]*config.VFIODe
continue
}
// Fetch the PCI Vendor ID and Device ID
vendorID := getPCIDeviceProperty(deviceBDF, PCISysFsDevicesVendor)
deviceID := getPCIDeviceProperty(deviceBDF, PCISysFsDevicesDevice)
vendorID := GetPCIDeviceProperty(deviceBDF, PCISysFsDevicesVendor)
deviceID := GetPCIDeviceProperty(deviceBDF, PCISysFsDevicesDevice)
// Do not directly assign to `vfio` -- need to access field still
vfio = config.VFIODev{

View File

@@ -7,6 +7,7 @@ package virtcontainers
import (
"context"
"errors"
"fmt"
"io"
"os"
@@ -1135,7 +1136,9 @@ func (c *Container) createDevices(ctx context.Context, contConfig *ContainerConf
// If we're hot-plugging this will be a no-op because at this stage
// no devices are attached to the root-port or switch-port
c.annotateContainerWithVFIOMetadata(vfioColdPlugDevices)
if err := c.annotateContainerWithVFIOMetadata(vfioColdPlugDevices); err != nil {
return err
}
return nil
}
@@ -1194,11 +1197,40 @@ func sortContainerVFIODevices(devices []config.DeviceInfo) []config.DeviceInfo {
return vfioDevices
}
// errNoSiblingFound is returned by siblingAnnotation when the VFIO device is
// not of a supported CDI device type, i.e. it has no entry in the cdiDeviceKind
// table (e.g. NVSwitches). Callers should treat this as a non-fatal "device not
// applicable" condition rather than a sibling-matching failure.
var errNoSiblingFound = fmt.Errorf("no suitable sibling found")
// cdiDeviceKey identifies a device type by vendor ID and PCI class prefix.
type cdiDeviceKey struct {
VendorID string
ClassPrefix string
}
// cdiDeviceKind maps known device types to their CDI annotation kind.
var cdiDeviceKind = map[cdiDeviceKey]string{
{VendorID: "0x10de", ClassPrefix: "0x030"}: "nvidia.com/gpu",
}
// cdiKindForDevice returns the CDI kind for a given vendor ID and PCI class,
// or empty string and false if the device is not recognized.
func cdiKindForDevice(vendorID, class string) (string, bool) {
for key, kind := range cdiDeviceKind {
if vendorID == key.VendorID && strings.Contains(class, key.ClassPrefix) {
return kind, true
}
}
return "", false
}
type DeviceRelation struct {
Bus string
Path string
Index int
BDF string
Bus string
Path string
Index int
BDF string
CDIKind string
}
// Depending on the HW we might need to inject metadata into the container
@@ -1223,15 +1255,13 @@ func (c *Container) annotateContainerWithVFIOMetadata(devices interface{}) error
// so lets first iterate over all root-port devices and then
// switch-port devices no special handling for bridge-port (PCI)
for _, dev := range config.PCIeDevicesPerPort["root-port"] {
// For the NV GPU we need special handling let's use only those
if dev.VendorID == "0x10de" && strings.Contains(dev.Class, "0x030") {
siblings = append(siblings, DeviceRelation{Bus: dev.Bus, Path: dev.HostPath, BDF: dev.BDF})
if kind, ok := cdiKindForDevice(dev.VendorID, dev.Class); ok {
siblings = append(siblings, DeviceRelation{Bus: dev.Bus, Path: dev.HostPath, BDF: dev.BDF, CDIKind: kind})
}
}
for _, dev := range config.PCIeDevicesPerPort["switch-port"] {
// For the NV GPU we need special handling let's use only those
if dev.VendorID == "0x10de" && strings.Contains(dev.Class, "0x030") {
siblings = append(siblings, DeviceRelation{Bus: dev.Bus, Path: dev.HostPath, BDF: dev.BDF})
if kind, ok := cdiKindForDevice(dev.VendorID, dev.Class); ok {
siblings = append(siblings, DeviceRelation{Bus: dev.Bus, Path: dev.HostPath, BDF: dev.BDF, CDIKind: kind})
}
}
// We need to sort the VFIO devices by bus to get the correct
@@ -1244,48 +1274,53 @@ func (c *Container) annotateContainerWithVFIOMetadata(devices interface{}) error
siblings[i].Index = i
}
// Now that we have the index lets connect the /dev/vfio/<num>
// to the correct index
if devices, ok := devices.([]ContainerDevice); ok {
for _, dev := range devices {
if dev.ContainerPath == "/dev/vfio/vfio" {
c.Logger().Infof("skipping /dev/vfio/vfio for vfio_mode=guest-kernel")
continue
}
err := c.siblingAnnotation(dev.ContainerPath, siblings)
if err != nil {
return err
}
// Collect container paths from either hot-plug or cold-plug devices
var containerPaths []string
if devs, ok := devices.([]ContainerDevice); ok {
for _, dev := range devs {
containerPaths = append(containerPaths, dev.ContainerPath)
}
}
if devs, ok := devices.([]config.DeviceInfo); ok {
for _, dev := range devs {
containerPaths = append(containerPaths, dev.ContainerPath)
}
}
if devices, ok := devices.([]config.DeviceInfo); ok {
for _, dev := range devices {
if dev.ContainerPath == "/dev/vfio/vfio" {
c.Logger().Infof("skipping /dev/vfio/vfio for vfio_mode=guest-kernel")
// Now that we have the index lets connect the /dev/vfio/<num>
// to the correct index
for _, devPath := range containerPaths {
if !strings.HasPrefix(devPath, "/dev/vfio") {
c.Logger().Infof("skipping guest annotations for non-VFIO device %q", devPath)
continue
}
if devPath == "/dev/vfio/vfio" {
c.Logger().Infof("skipping /dev/vfio/vfio for vfio_mode=guest-kernel")
continue
}
if err := c.siblingAnnotation(devPath, siblings); err != nil {
if errors.Is(err, errNoSiblingFound) {
c.Logger().Infof("no CDI annotation for device %s (not a known CDI device type)", devPath)
continue
}
err := c.siblingAnnotation(dev.ContainerPath, siblings)
if err != nil {
return err
}
return err
}
}
}
return nil
}
// createCDIAnnotation adds a container annotation mapping a VFIO device to a GPU index.
// createCDIAnnotation adds a container annotation mapping a VFIO device to a device index.
//
// devPath is the path to the VFIO device, which can be in the format
// "/dev/vfio/<num>" or "/dev/vfio/devices/vfio<num>". The function extracts
// the device number from the path and creates an annotation with the key
// "cdi.k8s.io/vfio<num>" and the value "nvidia.com/gpu=<index>", where
// <num> is the device number and <index> is the provided GPU index.
// "cdi.k8s.io/vfio<num>" and the value "<cdiKind>=<index>", where
// <cdiKind> is the CDI device kind (e.g. "nvidia.com/gpu"),
// <num> is the device number and <index> is the provided device index.
// The annotation is stored in c.config.CustomSpec.Annotations.
func (c *Container) createCDIAnnotation(devPath string, index int) {
func (c *Container) createCDIAnnotation(devPath string, index int, cdiKind string) {
// We have here either /dev/vfio/<num> or /dev/vfio/devices/vfio<num>
baseName := filepath.Base(devPath)
vfioNum := baseName
@@ -1294,66 +1329,68 @@ func (c *Container) createCDIAnnotation(devPath string, index int) {
vfioNum = strings.TrimPrefix(baseName, "vfio")
}
annoKey := fmt.Sprintf("cdi.k8s.io/vfio%s", vfioNum)
annoValue := fmt.Sprintf("nvidia.com/gpu=%d", index)
annoValue := fmt.Sprintf("%s=%d", cdiKind, index)
if c.config.CustomSpec.Annotations == nil {
c.config.CustomSpec.Annotations = make(map[string]string)
}
c.config.CustomSpec.Annotations[annoKey] = annoValue
c.Logger().Infof("annotated container with %s: %s", annoKey, annoValue)
}
func (c *Container) siblingAnnotation(devPath string, siblings []DeviceRelation) error {
for _, sibling := range siblings {
if sibling.Path == devPath {
c.createCDIAnnotation(devPath, sibling.Index)
return nil
// Resolve the device's BDFs once upfront. This serves two purposes:
// 1. Determine if the device is a known CDI type (if not, skip it)
// 2. Reuse the BDFs for sibling matching without redundant sysfs reads
isKnownCDIDevice := false
var devBDFs []string
if strings.HasPrefix(filepath.Base(devPath), "vfio") {
// IOMMUFD device (/dev/vfio/devices/vfio<NUM>): single device per char dev
major, minor, err := deviceUtils.GetMajorMinorFromDevPath(devPath)
if err != nil {
return err
}
// If the sandbox has cold-plugged an IOMMUFD device and if the
// device-plugins sends us a /dev/vfio/<NUM> device we need to
// check if the IOMMUFD device and the VFIO device are the same
// We have the sibling.BDF we now need to extract the BDF of the
// devPath that is either /dev/vfio/<NUM> or
// /dev/vfio/devices/vfio<NUM>
if strings.HasPrefix(filepath.Base(devPath), "vfio") {
// IOMMUFD device format (/dev/vfio/devices/vfio<NUM>), extract BDF from sysfs
major, minor, err := deviceUtils.GetMajorMinorFromDevPath(devPath)
if err != nil {
return err
}
iommufdBDF, err := deviceUtils.GetBDFFromVFIODev(major, minor)
if err != nil {
return err
}
if sibling.BDF == iommufdBDF {
c.createCDIAnnotation(devPath, sibling.Index)
// exit handling IOMMUFD device
return nil
}
bdf, err := deviceUtils.GetBDFFromVFIODev(major, minor)
if err != nil {
return err
}
// Legacy VFIO group device (/dev/vfio/<GROUP_NUM>), extract BDF from sysfs
devBDFs = []string{bdf}
vendorID := deviceUtils.GetPCIDeviceProperty(bdf, deviceUtils.PCISysFsDevicesVendor)
class := deviceUtils.GetPCIDeviceProperty(bdf, deviceUtils.PCISysFsDevicesClass)
_, isKnownCDIDevice = cdiKindForDevice(vendorID, class)
} else {
// Legacy VFIO group (/dev/vfio/<GROUP>): may contain multiple devices
vfioGroup := filepath.Base(devPath)
iommuDevicesPath := filepath.Join(config.SysIOMMUGroupPath, vfioGroup, "devices")
deviceFiles, err := os.ReadDir(iommuDevicesPath)
if err != nil {
return err
}
vfioBDFs := make([]string, 0)
for _, deviceFile := range deviceFiles {
// Get bdf of device eg 0000:00:1c.0
deviceBDF, _, _, err := deviceUtils.GetVFIODetails(deviceFile.Name(), iommuDevicesPath)
if err != nil {
return err
}
vfioBDFs = append(vfioBDFs, deviceBDF)
devBDFs = append(devBDFs, deviceBDF)
if !isKnownCDIDevice {
vendorID := deviceUtils.GetPCIDeviceProperty(deviceBDF, deviceUtils.PCISysFsDevicesVendor)
class := deviceUtils.GetPCIDeviceProperty(deviceBDF, deviceUtils.PCISysFsDevicesClass)
if _, ok := cdiKindForDevice(vendorID, class); ok {
isKnownCDIDevice = true
}
}
}
if slices.Contains(vfioBDFs, sibling.BDF) {
c.createCDIAnnotation(devPath, sibling.Index)
// exit handling legacy VFIO device
}
if !isKnownCDIDevice {
return fmt.Errorf("device %s: %w", devPath, errNoSiblingFound)
}
for _, sibling := range siblings {
if sibling.Path == devPath || slices.Contains(devBDFs, sibling.BDF) {
c.createCDIAnnotation(devPath, sibling.Index, sibling.CDIKind)
return nil
}
}
return fmt.Errorf("failed to match device %s with any cold-plugged GPU device by path or BDF; no suitable sibling found", devPath)
return fmt.Errorf("device %s is a known CDI device type but failed to match any sibling by path or BDF", devPath)
}
// create creates and starts a container inside a Sandbox. It has to be
@@ -1382,7 +1419,9 @@ func (c *Container) create(ctx context.Context) (err error) {
return
}
c.annotateContainerWithVFIOMetadata(c.devices)
if err := c.annotateContainerWithVFIOMetadata(c.devices); err != nil {
return fmt.Errorf("annotating VFIO devices: %w", err)
}
// Deduce additional system mount info that should be handled by the agent
// inside the VM

View File

@@ -841,7 +841,6 @@ func (q *qemu) createPCIeTopology(qemuConfig *govmmQemu.Config, hypervisorConfig
// /dev/vfio/devices/vfio0
// (1) Check if we have the new IOMMUFD or old container based VFIO
if strings.HasPrefix(dev.HostPath, pkgDevice.IommufdDevPath) {
q.Logger().Infof("### IOMMUFD Path: %s", dev.HostPath)
vfioDevices, err = drivers.GetDeviceFromVFIODev(dev)
if err != nil {
return fmt.Errorf("Cannot get VFIO device from IOMMUFD with device: %v err: %v", dev, err)

View File

@@ -80,7 +80,7 @@ version = "0.7.7"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "5a824f2aa7e75a0c98c5a504fceb80649e9c35265d44525b5f94de4771a395cd"
dependencies = [
"getrandom",
"getrandom 0.2.15",
"once_cell",
"version_check",
]
@@ -414,7 +414,7 @@ dependencies = [
"bitflags 2.6.0",
"cexpr",
"clang-sys",
"itertools 0.10.5",
"itertools 0.11.0",
"log",
"prettyplease",
"proc-macro2",
@@ -974,7 +974,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "0dc92fb57ca44df6db8059111ab3af99a63d5d0f8375d9972e319a379c6bab76"
dependencies = [
"generic-array",
"rand_core",
"rand_core 0.6.4",
"subtle",
"zeroize",
]
@@ -986,7 +986,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "1bfb12502f3fc46cca1bb51ac28df9d618d813cdc3d2f25b9fe775a34af26bb3"
dependencies = [
"generic-array",
"rand_core",
"rand_core 0.6.4",
"typenum",
]
@@ -1375,7 +1375,7 @@ checksum = "4a3daa8e81a3963a60642bcc1f90a670680bd4a77535faa384e9d1c79d620871"
dependencies = [
"curve25519-dalek",
"ed25519",
"rand_core",
"rand_core 0.6.4",
"serde",
"sha2 0.10.9",
"subtle",
@@ -1403,7 +1403,7 @@ dependencies = [
"hkdf",
"pem-rfc7468",
"pkcs8",
"rand_core",
"rand_core 0.6.4",
"sec1",
"subtle",
"zeroize",
@@ -1485,7 +1485,7 @@ checksum = "fe5e43d0f78a42ad591453aedb1d7ae631ce7ee445c7643691055a9ed8d3b01c"
dependencies = [
"log",
"once_cell",
"rand",
"rand 0.8.5",
]
[[package]]
@@ -1503,7 +1503,7 @@ version = "0.13.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "ded41244b729663b1e574f1b4fb731469f69f79c17667b5d776b16cda0479449"
dependencies = [
"rand_core",
"rand_core 0.6.4",
"subtle",
]
@@ -1705,6 +1705,20 @@ dependencies = [
"wasm-bindgen",
]
[[package]]
name = "getrandom"
version = "0.3.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "899def5c37c4fd7b2664648c28120ecec138e4d395b459e5ca34f9cce2dd77fd"
dependencies = [
"cfg-if 1.0.4",
"js-sys",
"libc",
"r-efi",
"wasip2",
"wasm-bindgen",
]
[[package]]
name = "getset"
version = "0.1.6"
@@ -1755,7 +1769,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "f0f9ef7462f7c099f518d754361858f86d8a07af53ba9af0fe635bbccb151a63"
dependencies = [
"ff",
"rand_core",
"rand_core 0.6.4",
"subtle",
]
@@ -2091,7 +2105,7 @@ dependencies = [
"qapi",
"qapi-qmp",
"qapi-spec",
"rand",
"rand 0.8.5",
"rust-ini",
"safe-path 0.1.0 (registry+https://github.com/rust-lang/crates.io-index)",
"seccompiler",
@@ -2418,10 +2432,11 @@ dependencies = [
[[package]]
name = "js-sys"
version = "0.3.70"
version = "0.3.91"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "1868808506b929d7b0cfa8f75951347aa71bb21144b7791bae35d9bccfcfe37a"
checksum = "b49715b7073f385ba4bc528e5747d02e66cb39c6146efb66b781f131f0fb399c"
dependencies = [
"once_cell",
"wasm-bindgen",
]
@@ -2460,7 +2475,7 @@ dependencies = [
"oci-spec",
"protobuf",
"protocols",
"rand",
"rand 0.8.5",
"safe-path 0.1.0",
"serde",
"serde_json",
@@ -2485,7 +2500,7 @@ dependencies = [
"nix 0.26.4",
"oci-spec",
"pci-ids",
"rand",
"rand 0.8.5",
"runtime-spec",
"serde",
"serde_json",
@@ -2590,7 +2605,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "4979f22fdb869068da03c9f7528f8297c6fd2606bc3a4affe42e6a823fdb8da4"
dependencies = [
"cfg-if 1.0.4",
"windows-targets 0.48.0",
"windows-targets 0.52.6",
]
[[package]]
@@ -2686,6 +2701,12 @@ dependencies = [
"libc",
]
[[package]]
name = "lru-slab"
version = "0.1.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "112b39cec0b298b6c1999fee3e31427f74f676e4cb9879ed1a121b43661a4154"
[[package]]
name = "matchit"
version = "0.8.4"
@@ -2895,7 +2916,7 @@ dependencies = [
"num-integer",
"num-iter",
"num-traits",
"rand",
"rand 0.8.5",
"smallvec",
"zeroize",
]
@@ -2954,9 +2975,9 @@ checksum = "51e219e79014df21a225b1860a479e2dcd7cbd9130f4defd4bd0e191ea31d67d"
dependencies = [
"base64 0.22.1",
"chrono",
"getrandom",
"getrandom 0.2.15",
"http 1.1.0",
"rand",
"rand 0.8.5",
"reqwest",
"serde",
"serde_json",
@@ -3100,7 +3121,7 @@ dependencies = [
"oauth2",
"p256",
"p384",
"rand",
"rand 0.8.5",
"rsa",
"serde",
"serde-value",
@@ -3173,7 +3194,7 @@ dependencies = [
"ecdsa",
"elliptic-curve",
"primeorder",
"rand_core",
"rand_core 0.6.4",
"sha2 0.10.9",
]
@@ -3207,7 +3228,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "346f04948ba92c43e8469c1ee6736c7563d71012b17d40745260fe106aac2166"
dependencies = [
"base64ct",
"rand_core",
"rand_core 0.6.4",
"subtle",
]
@@ -3330,7 +3351,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "3c80231409c20246a13fddb31776fb942c38553c51e871f8cbd687a4cfb5843d"
dependencies = [
"phf_shared",
"rand",
"rand 0.8.5",
]
[[package]]
@@ -3408,7 +3429,7 @@ checksum = "f950b2377845cebe5cf8b5165cb3cc1a5e0fa5cfa3e1f7f55707d8fd82e0a7b7"
dependencies = [
"der",
"pkcs5",
"rand_core",
"rand_core 0.6.4",
"spki",
]
@@ -3617,7 +3638,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "27c6023962132f4b30eb4c172c91ce92d933da334c59c23cddee82358ddafb0b"
dependencies = [
"anyhow",
"itertools 0.10.5",
"itertools 0.11.0",
"proc-macro2",
"quote",
"syn 2.0.87",
@@ -3806,19 +3827,23 @@ dependencies = [
[[package]]
name = "quinn-proto"
version = "0.11.8"
version = "0.11.14"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "fadfaed2cd7f389d0161bb73eeb07b7b78f8691047a6f3e73caaeae55310a4a6"
checksum = "434b42fec591c96ef50e21e886936e66d3cc3f737104fdb9b737c40ffb94c098"
dependencies = [
"bytes",
"rand",
"getrandom 0.3.4",
"lru-slab",
"rand 0.9.2",
"ring",
"rustc-hash 2.1.1",
"rustls",
"rustls-pki-types",
"slab",
"thiserror 1.0.40",
"thiserror 2.0.12",
"tinyvec",
"tracing",
"web-time",
]
[[package]]
@@ -3843,6 +3868,12 @@ dependencies = [
"proc-macro2",
]
[[package]]
name = "r-efi"
version = "5.3.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "69cdb34c158ceb288df11e18b4bd39de994f6657d83847bdffdbd7f346754b0f"
[[package]]
name = "radium"
version = "0.7.0"
@@ -3856,8 +3887,18 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "34af8d1a0e25924bc5b7c43c079c942339d8f0a8b57c39049bef581b46327404"
dependencies = [
"libc",
"rand_chacha",
"rand_core",
"rand_chacha 0.3.1",
"rand_core 0.6.4",
]
[[package]]
name = "rand"
version = "0.9.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "6db2770f06117d490610c7488547d543617b21bfa07796d7a12f6f1bd53850d1"
dependencies = [
"rand_chacha 0.9.0",
"rand_core 0.9.5",
]
[[package]]
@@ -3867,7 +3908,17 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e6c10a63a0fa32252be49d21e7709d4d4baf8d231c2dbce1eaa8141b9b127d88"
dependencies = [
"ppv-lite86",
"rand_core",
"rand_core 0.6.4",
]
[[package]]
name = "rand_chacha"
version = "0.9.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d3022b5f1df60f26e1ffddd6c66e8aa15de382ae63b3a0c1bfc0e4d3e3f325cb"
dependencies = [
"ppv-lite86",
"rand_core 0.9.5",
]
[[package]]
@@ -3876,7 +3927,16 @@ version = "0.6.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "ec0be4795e2f6a28069bec0b5ff3e2ac9bafc99e6a9a7dc3547996c5c816922c"
dependencies = [
"getrandom",
"getrandom 0.2.15",
]
[[package]]
name = "rand_core"
version = "0.9.5"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "76afc826de14238e6e8c374ddcc1fa19e374fd8dd986b0d2af0d02377261d83c"
dependencies = [
"getrandom 0.3.4",
]
[[package]]
@@ -3912,7 +3972,7 @@ version = "0.4.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b033d837a7cf162d7993aded9304e30a83213c648b6e389db233191f891e5c2b"
dependencies = [
"getrandom",
"getrandom 0.2.15",
"redox_syscall 0.2.16",
"thiserror 1.0.40",
]
@@ -4040,7 +4100,7 @@ checksum = "a4689e6c2294d81e88dc6261c768b63bc4fcdb852be6d1352498b114f61383b7"
dependencies = [
"cc",
"cfg-if 1.0.4",
"getrandom",
"getrandom 0.2.15",
"libc",
"untrusted 0.9.0",
"windows-sys 0.52.0",
@@ -4097,7 +4157,7 @@ dependencies = [
"num-traits",
"pkcs1",
"pkcs8",
"rand_core",
"rand_core 0.6.4",
"signature",
"spki",
"subtle",
@@ -4133,7 +4193,7 @@ dependencies = [
"borsh",
"bytes",
"num-traits",
"rand",
"rand 0.8.5",
"rkyv",
"serde",
"serde_json",
@@ -4234,6 +4294,7 @@ version = "1.12.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "229a4a4c221013e7e1f1a043678c5cc39fe5171437c88fb47151a21e6f5b5c79"
dependencies = [
"web-time",
"zeroize",
]
@@ -4423,7 +4484,7 @@ dependencies = [
"ed25519",
"ed25519-dalek",
"flate2",
"getrandom",
"getrandom 0.2.15",
"hkdf",
"idea",
"idna",
@@ -4437,8 +4498,8 @@ dependencies = [
"p256",
"p384",
"p521",
"rand",
"rand_core",
"rand 0.8.5",
"rand_core 0.6.4",
"regex",
"regex-syntax",
"ripemd",
@@ -4697,7 +4758,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "77549399552de45a898a580c1b41d445bf730df867cc44e6c0233bbc4b8329de"
dependencies = [
"digest 0.10.7",
"rand_core",
"rand_core 0.6.4",
]
[[package]]
@@ -4726,7 +4787,7 @@ dependencies = [
"pem",
"pkcs1",
"pkcs8",
"rand",
"rand 0.8.5",
"regex",
"reqwest",
"rsa",
@@ -5056,7 +5117,7 @@ version = "0.1.0"
dependencies = [
"anyhow",
"kata-types",
"rand",
"rand 0.8.5",
]
[[package]]
@@ -5702,28 +5763,24 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "9c8d87e72b64a3b4db28d11ce29237c246188f4f51057d65a7eab63b7987e423"
[[package]]
name = "wasm-bindgen"
version = "0.2.93"
name = "wasip2"
version = "1.0.2+wasi-0.2.9"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "a82edfc16a6c469f5f44dc7b571814045d60404b55a0ee849f9bcfa2e63dd9b5"
checksum = "9517f9239f02c069db75e65f174b3da828fe5f5b945c4dd26bd25d89c03ebcf5"
dependencies = [
"cfg-if 1.0.4",
"once_cell",
"wasm-bindgen-macro",
"wit-bindgen",
]
[[package]]
name = "wasm-bindgen-backend"
version = "0.2.93"
name = "wasm-bindgen"
version = "0.2.114"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "9de396da306523044d3302746f1208fa71d7532227f15e347e2d93e4145dd77b"
checksum = "6532f9a5c1ece3798cb1c2cfdba640b9b3ba884f5db45973a6f442510a87d38e"
dependencies = [
"bumpalo",
"log",
"cfg-if 1.0.4",
"once_cell",
"proc-macro2",
"quote",
"syn 2.0.87",
"rustversion",
"wasm-bindgen-macro",
"wasm-bindgen-shared",
]
@@ -5741,9 +5798,9 @@ dependencies = [
[[package]]
name = "wasm-bindgen-macro"
version = "0.2.93"
version = "0.2.114"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "585c4c91a46b072c92e908d99cb1dcdf95c5218eeb6f3bf1efa991ee7a68cccf"
checksum = "18a2d50fcf105fb33bb15f00e7a77b772945a2ee45dcf454961fd843e74c18e6"
dependencies = [
"quote",
"wasm-bindgen-macro-support",
@@ -5751,22 +5808,25 @@ dependencies = [
[[package]]
name = "wasm-bindgen-macro-support"
version = "0.2.93"
version = "0.2.114"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "afc340c74d9005395cf9dd098506f7f44e38f2b4a21c6aaacf9a105ea5e1e836"
checksum = "03ce4caeaac547cdf713d280eda22a730824dd11e6b8c3ca9e42247b25c631e3"
dependencies = [
"bumpalo",
"proc-macro2",
"quote",
"syn 2.0.87",
"wasm-bindgen-backend",
"wasm-bindgen-shared",
]
[[package]]
name = "wasm-bindgen-shared"
version = "0.2.93"
version = "0.2.114"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "c62a0a307cb4a311d3a07867860911ca130c3494e8c2719593806c08bc5d0484"
checksum = "75a326b8c223ee17883a4251907455a2431acc2791c98c26279376490c378c16"
dependencies = [
"unicode-ident",
]
[[package]]
name = "wasm-streams"
@@ -5791,6 +5851,16 @@ dependencies = [
"wasm-bindgen",
]
[[package]]
name = "web-time"
version = "1.1.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "5a6580f308b1fad9207618087a65c04e7a10bc77e02c8e84e9b00dd4b12fa0bb"
dependencies = [
"js-sys",
"wasm-bindgen",
]
[[package]]
name = "webpki-roots"
version = "0.26.6"
@@ -6163,6 +6233,12 @@ dependencies = [
"memchr",
]
[[package]]
name = "wit-bindgen"
version = "0.51.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d7249219f66ced02969388cf2bb044a09756a083d0fab1e566056b04d9fbcaa5"
[[package]]
name = "writeable"
version = "0.6.1"
@@ -6185,7 +6261,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "c7e468321c81fb07fa7f4c636c3972b9100f0346e5b6a9f2bd0603a52f7ed277"
dependencies = [
"curve25519-dalek",
"rand_core",
"rand_core 0.6.4",
"zeroize",
]

View File

@@ -113,27 +113,6 @@ setup_langchain_flow() {
[[ "$(pip show beautifulsoup4 2>/dev/null | awk '/^Version:/{print $2}')" = "4.13.4" ]] || pip install beautifulsoup4==4.13.4
}
# Create Docker config for genpolicy so it can authenticate to nvcr.io when
# pulling image manifests (avoids "UnauthorizedError" from genpolicy's registry pull).
# Genpolicy (src/tools/genpolicy) uses docker_credential::get_credential() in
# src/tools/genpolicy/src/registry.rs build_auth(). The docker_credential crate
# reads config from DOCKER_CONFIG (directory) + "/config.json", so we set
# DOCKER_CONFIG to a directory containing config.json with nvcr.io auth.
setup_genpolicy_registry_auth() {
if [[ -z "${NGC_API_KEY:-}" ]]; then
return
fi
local auth_dir
auth_dir="${BATS_SUITE_TMPDIR}/.docker-genpolicy"
mkdir -p "${auth_dir}"
# Docker config format: auths -> registry -> auth (base64 of "user:password")
echo -n "{\"auths\":{\"nvcr.io\":{\"username\":\"\$oauthtoken\",\"password\":\"${NGC_API_KEY}\",\"auth\":\"$(echo -n "\$oauthtoken:${NGC_API_KEY}" | base64 -w0)\"}}}" \
> "${auth_dir}/config.json"
export DOCKER_CONFIG="${auth_dir}"
# REGISTRY_AUTH_FILE (containers-auth.json format) is the same structure for auths
export REGISTRY_AUTH_FILE="${auth_dir}/config.json"
}
# Create initdata TOML file for genpolicy with CDH configuration.
# This file is used by genpolicy via --initdata-path. Genpolicy will add the
# generated policy.rego to it and set it as the cc_init_data annotation.
@@ -243,9 +222,6 @@ setup_file() {
add_requests_to_policy_settings "${policy_settings_dir}" "ReadStreamRequest"
if [ "${TEE}" = "true" ]; then
# So genpolicy can pull nvcr.io image manifests when generating policy (avoids UnauthorizedError).
setup_genpolicy_registry_auth
setup_kbs_credentials
# Overwrite the empty default-initdata.toml with our CDH configuration.
# This must happen AFTER create_tmp_policy_settings_dir() copies the empty

View File

@@ -51,6 +51,27 @@ kernel_params = "${new_params}"
EOF
}
# Create Docker config for genpolicy so it can authenticate to nvcr.io when
# pulling image manifests (avoids "UnauthorizedError" from genpolicy's registry pull).
# Genpolicy (src/tools/genpolicy) uses docker_credential::get_credential() in
# src/tools/genpolicy/src/registry.rs build_auth(). The docker_credential crate
# reads config from DOCKER_CONFIG (directory) + "/config.json", so we set
# DOCKER_CONFIG to a directory containing config.json with nvcr.io auth.
setup_genpolicy_registry_auth() {
if [[ -z "${NGC_API_KEY:-}" ]]; then
return
fi
local auth_dir
auth_dir="${kubernetes_dir}/.docker-genpolicy"
mkdir -p "${auth_dir}"
# Docker config format: auths -> registry -> auth (base64 of "user:password")
echo -n "{\"auths\":{\"nvcr.io\":{\"username\":\"\$oauthtoken\",\"password\":\"${NGC_API_KEY}\",\"auth\":\"$(echo -n "\$oauthtoken:${NGC_API_KEY}" | base64 -w0)\"}}}" \
> "${auth_dir}/config.json"
export DOCKER_CONFIG="${auth_dir}"
# REGISTRY_AUTH_FILE (containers-auth.json format) is the same structure for auths
export REGISTRY_AUTH_FILE="${auth_dir}/config.json"
}
cleanup() {
true
}
@@ -84,6 +105,9 @@ if [[ "${ENABLE_NVRC_TRACE:-true}" == "true" ]]; then
enable_nvrc_trace
fi
# So genpolicy can pull nvcr.io image manifests when generating policy (avoids UnauthorizedError).
setup_genpolicy_registry_auth
# Use common bats test runner with proper reporting
export BATS_TEST_FAIL_FAST="${K8S_TEST_FAIL_FAST}"
run_bats_tests "${kubernetes_dir}" K8S_TEST_NV

View File

@@ -61,12 +61,12 @@ install_userspace_components() {
eval "${APT_INSTALL}" nvidia-imex nvidia-firmware \
libnvidia-cfg1 libnvidia-gl libnvidia-extra \
libnvidia-decode libnvidia-fbc1 libnvidia-encode \
libnvidia-nscq
libnvidia-nscq libnvidia-compute nvidia-settings
apt-mark hold nvidia-imex nvidia-firmware \
libnvidia-cfg1 libnvidia-gl libnvidia-extra \
libnvidia-decode libnvidia-fbc1 libnvidia-encode \
libnvidia-nscq
libnvidia-nscq libnvidia-compute nvidia-settings
}
setup_apt_repositories() {

View File

@@ -151,14 +151,8 @@ chisseled_nvswitch() {
cp -a "${stage_one}"/usr/share/nvidia/nvswitch usr/share/nvidia/.
libdir=usr/lib/"${machine_arch}"-linux-gnu
cp -a "${stage_one}/${libdir}"/libnvidia-nscq.so.* lib/"${machine_arch}"-linux-gnu/.
# Logs will be redirected to console(stderr)
# if the specified log file can't be opened or the path is empty.
# LOG_FILE_NAME=/var/log/fabricmanager.log -> setting to empty for stderr -> kmsg
sed -i 's|^LOG_FILE_NAME=.*|LOG_FILE_NAME=|' usr/share/nvidia/nvswitch/fabricmanager.cfg
# NVLINK SubnetManager dependencies
local nvlsm=usr/share/nvidia/nvlsm
mkdir -p "${nvlsm}"
@@ -166,6 +160,8 @@ chisseled_nvswitch() {
cp -a "${stage_one}"/opt/nvidia/nvlsm/lib/libgrpc_mgr.so lib/.
cp -a "${stage_one}"/opt/nvidia/nvlsm/sbin/nvlsm sbin/.
cp -a "${stage_one}/${nvlsm}"/*.conf "${nvlsm}"/.
# Redirect all the logs to syslog instead of logging to file
sed -i 's|^LOG_USE_SYSLOG=.*|LOG_USE_SYSLOG=1|' usr/share/nvidia/nvswitch/fabricmanager.cfg
}
chisseled_dcgm() {
@@ -211,9 +207,8 @@ chisseled_compute() {
cp -aL "${stage_one}/${libdir}"/ld-linux-* "${libdir}"/.
libdir=usr/lib/"${machine_arch}"-linux-gnu
cp -a "${stage_one}/${libdir}"/libnvidia-ml.so.* lib/"${machine_arch}"-linux-gnu/.
cp -a "${stage_one}/${libdir}"/libnv* lib/"${machine_arch}"-linux-gnu/.
cp -a "${stage_one}/${libdir}"/libcuda.so.* lib/"${machine_arch}"-linux-gnu/.
cp -a "${stage_one}/${libdir}"/libnvidia-cfg.so.* lib/"${machine_arch}"-linux-gnu/.
# basic GPU admin tools
cp -a "${stage_one}"/usr/bin/nvidia-persistenced bin/.
@@ -245,6 +240,8 @@ chisseled_init() {
usr/bin etc/modprobe.d etc/ssl/certs
ln -sf ../run var/run
ln -sf ../run var/log
ln -sf ../run var/cache
# Needed for various RUST static builds with LIBC=gnu
libdir=lib/"${machine_arch}"-linux-gnu

View File

@@ -96,9 +96,9 @@ scheduling:
"qemu-snp-runtime-rs" (dict "memory" "2048Mi" "cpu" "1.0")
"qemu-tdx" (dict "memory" "2048Mi" "cpu" "1.0")
"qemu-tdx-runtime-rs" (dict "memory" "2048Mi" "cpu" "1.0")
"qemu-nvidia-gpu" (dict "memory" "4096Mi" "cpu" "1.0")
"qemu-nvidia-gpu-snp" (dict "memory" "20480Mi" "cpu" "1.0")
"qemu-nvidia-gpu-tdx" (dict "memory" "20480Mi" "cpu" "1.0")
"qemu-nvidia-gpu" (dict "memory" "10240Mi" "cpu" "1.0")
"qemu-nvidia-gpu-snp" (dict "memory" "10240Mi" "cpu" "1.0")
"qemu-nvidia-gpu-tdx" (dict "memory" "10240Mi" "cpu" "1.0")
"qemu-cca" (dict "memory" "2048Mi" "cpu" "1.0")
"stratovirt" (dict "memory" "130Mi" "cpu" "250m")
"remote" (dict "memory" "120Mi" "cpu" "250m")

View File

@@ -362,7 +362,7 @@ get_latest_kernel_nvidia_artefact_and_builder_image_version() {
}
get_latest_ctk_version() {
echo $(get_from_kata_deps ".assets.kernel.nvidia.ctk.version")
echo $(get_from_kata_deps ".externals.nvidia.ctk.version")
}
#Install guest image

View File

@@ -234,7 +234,7 @@ externals:
nvrc:
# yamllint disable-line rule:line-length
desc: "The NVRC project provides a Rust binary that implements a simple init system for microVMs"
version: "v0.1.1"
version: "v0.1.3"
url: "https://github.com/NVIDIA/nvrc/releases/download/"
nvidia:
@@ -294,12 +294,12 @@ externals:
coco-trustee:
description: "Provides attestation and secret delivery components"
url: "https://github.com/confidential-containers/trustee"
version: "3b2356a52e0d8a58730a1977e235a7e7f2007b5e"
version: "f5cb8fc1b51b652fc24e2d6b8742cf417805352e"
# image / ita_image and image_tag / ita_image_tag must be in sync
image: "ghcr.io/confidential-containers/staged-images/kbs"
image_tag: "3b2356a52e0d8a58730a1977e235a7e7f2007b5e"
image_tag: "f5cb8fc1b51b652fc24e2d6b8742cf417805352e"
ita_image: "ghcr.io/confidential-containers/staged-images/kbs-ita-as"
ita_image_tag: "3b2356a52e0d8a58730a1977e235a7e7f2007b5e-x86_64"
ita_image_tag: "f5cb8fc1b51b652fc24e2d6b8742cf417805352e-x86_64"
toolchain: "1.90.0"
containerd: