Commit Graph

1986 Commits

Author SHA1 Message Date
Zvonko Kaiser
451dcb289a kernel: bump kata_config_version
We have kernel build changes bump the config version

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-01-14 14:55:20 +01:00
Zvonko Kaiser
34cde2637d gpu: build_image.sh use versions.yaml
We've done some bad file based driver determination,
now with versions.yaml there is a single source of truth.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-01-14 14:55:20 +01:00
Zvonko Kaiser
664a3af02b gpu: nvidia_chroot.sh update decoupling
Remove all the driver build instructions,
sicne those are now done in the kernel target.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-01-14 14:55:20 +01:00
Zvonko Kaiser
e9bb43ef01 gpu: deploy modules for kernel build
We need to package the build modules for the rootfs
to be able to consume it. We package the whole
/lib/modules/$(uname -r)  directory strip=2.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-01-14 14:55:20 +01:00
Zvonko Kaiser
d4962bafac gpu: Add NVIDA modules to build-kernel.sh
Checkout and build the kernel modules along
with the kernel to avoid the kernel rootfs dependency.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-01-14 14:30:31 +01:00
Zvonko Kaiser
c42f7501fd gpu: Remove building of Headers
Since we build along the kernel we do not need to
carry over the headers to the rootfs build.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-01-14 14:30:31 +01:00
Zvonko Kaiser
1f6cfb11b0 kernel: bugfix install yq
We actually never installed yq to the kernel build,
there are  some path that use yq but were never hit,
for the GPU use-case we need to read values from versions.yaml

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-01-14 14:30:31 +01:00
Fabiano Fidêncio
2acb94ef2d arm64: Do not use DAX with the rootfs image
Kernel 6.18.x has an issue with DAX, which is not yet fixed upstream:
```
[    0.737679] EXT4-fs (pmem0p1): mounted filesystem 79676804-7c8b-491a-b2a6-9bae3c72af70 ro with ordered data mode. Quota mode: disabled.
[    0.737891] VFS: Mounted root (ext4 filesystem) readonly on device 259:1.
[    0.739119] devtmpfs: mounted
[    0.739476] Freeing unused kernel memory: 1920K
[    0.740156] Run /sbin/init as init process
[    0.740229]   with arguments:
[    0.740286]     /sbin/init
[    0.740321]   with environment:
[    0.740369]     HOME=/
[    0.740400]     TERM=linux
[    0.743162] Unable to handle kernel paging request at virtual address fffffdffbf000008
[    0.743285] Mem abort info:
[    0.743316]   ESR = 0x0000000096000006
[    0.743371]   EC = 0x25: DABT (current EL), IL = 32 bits
[    0.743444]   SET = 0, FnV = 0
[    0.743489]   EA = 0, S1PTW = 0
[    0.743545]   FSC = 0x06: level 2 translation fault
[    0.743610] Data abort info:
[    0.743656]   ISV = 0, ISS = 0x00000006, ISS2 = 0x00000000
[    0.743720]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[    0.743785]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[    0.743848] swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000000b9d17000
[    0.743931] [fffffdffbf000008] pgd=10000000bfa3d403, p4d=10000000bfa3d403, pud=1000000040bfe403, pmd=0000000000000000
[    0.744070] Internal error: Oops: 0000000096000006 [#1]  SMP
[    0.748888] CPU: 0 UID: 0 PID: 1 Comm: init Not tainted 6.18.4 #1 NONE
[    0.749421] pstate: 004000c5 (nzcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    0.749969] pc : dax_disassociate_entry.constprop.0+0x20/0x50
[    0.750444] lr : dax_insert_entry+0xcc/0x408
[    0.750802] sp : ffff80008000b9e0
[    0.751083] x29: ffff80008000b9e0 x28: 0000000000000000 x27: 0000000000000000
[    0.751682] x26: 0000000001963d01 x25: ffff0000004f7d90 x24: 0000000000000000
[    0.752264] x23: 0000000000000000 x22: ffff80008000bcc8 x21: 0000000000000011
[    0.752836] x20: ffff80008000ba90 x19: 0000000001963d01 x18: 0000000000000000
[    0.753407] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[    0.753970] x14: ffffbf3154b9ae70 x13: 0000000000000000 x12: ffffbf3154b9ae70
[    0.754548] x11: ffffffffffffffff x10: 0000000000000000 x9 : 0000000000000000
[    0.755122] x8 : 000000000000000d x7 : 000000000000001f x6 : 0000000000000000
[    0.755707] x5 : 0000000000000000 x4 : 0000000000000000 x3 : fffffdffc0000000
[    0.756287] x2 : 0000000000000008 x1 : 0000000040000000 x0 : fffffdffbf000000
[    0.756871] Call trace:
[    0.757107]  dax_disassociate_entry.constprop.0+0x20/0x50 (P)
[    0.757592]  dax_iomap_pte_fault+0x4fc/0x808
[    0.757951]  dax_iomap_fault+0x28/0x30
[    0.758258]  ext4_dax_huge_fault+0x80/0x2dc
[    0.758594]  ext4_dax_fault+0x10/0x3c
[    0.758892]  __do_fault+0x38/0x12c
[    0.759175]  __handle_mm_fault+0x530/0xcf0
[    0.759518]  handle_mm_fault+0xe4/0x230
[    0.759833]  do_page_fault+0x17c/0x4dc
[    0.760144]  do_translation_fault+0x30/0x38
[    0.760483]  do_mem_abort+0x40/0x8c
[    0.760771]  el0_ia+0x4c/0x170
[    0.761032]  el0t_64_sync_handler+0xd8/0xdc
[    0.761371]  el0t_64_sync+0x168/0x16c
[    0.761677] Code: f9453021 f2dfbfe3 cb813080 8b001860 (f9400401)
[    0.762168] ---[ end trace 0000000000000000 ]---
[    0.762550] note: init[1] exited with irqs disabled
[    0.762631] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
```

For now, we limit the rootfs that we ship to ARM64 to not use DAX, in
the future we'll re-enable it as soon as the patch lands on mainstream
kernel.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-01-14 11:46:40 +01:00
Fabiano Fidêncio
3ef99f4ee3 versions: Add specific nvidia kernel version
This is needed as the 580 driver doesn't build against 6.18.x, and the
590 driver is not yet fully working for our case, thus we stick to the
previous version that worked before.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-01-14 11:46:40 +01:00
Fabiano Fidêncio
cce5d4abf6 kernel: bump to v6.18.x (LTS)
Bump both the kernel and kernel-confidential versions from v6.12.x and
v6.16.x to v6.18.4, aligning with the new LTS release.

Kernel 6.18 introduced several configuration changes that required
updates to our kernel config fragments:

* CRYPTO_FIPS dependencies changed:
  - In 6.12: depended on !CRYPTO_MANAGER_DISABLE_TESTS
  - In 6.18: now depends on CRYPTO_SELFTESTS (which requires EXPERT)
  Added CONFIG_EXPERT=y and CONFIG_CRYPTO_SELFTESTS=y to crypto.conf
  to satisfy the new dependency chain.
  * CONFIG_EXPERT is a naughty one, as it disables / enables a bunch
    of things behind ones back, probably just to prove a point that
    it is for experts ;-) ... regardless, a reasonable amount of
    options had to be re-added in order to make sure anything ends
    up broken.

* Legacy iptables support:
  Kernel 6.18 requires explicit legacy xtables/iptables configs for
  IP_NF_* options. Added CONFIG_NETFILTER_XTABLES_LEGACY,
  CONFIG_IP_NF_IPTABLES_LEGACY, and CONFIG_IP6_NF_IPTABLES_LEGACY
  to netfilter.conf.

* Module signing dependencies:
  Added CONFIG_MODULES=y and other required dependencies to
  module_signing.conf to ensure MODULE_SIG can be properly enabled.

* Whitelist updates:
  - Added CONFIG_NF_CT_PROTO_DCCP (removed in 6.18+)
  - Added CONFIG_CRYPTO_SELFTESTS, CONFIG_NETFILTER_XTABLES_LEGACY,
    CONFIG_IP_NF_IPTABLES_LEGACY, CONFIG_IP6_NF_IPTABLES_LEGACY
    (added in 6.18+, not present in older kernels like 6.12)

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-01-14 11:46:40 +01:00
Fabiano Fidêncio
26dfcb627b tools: Build kubectl image
This image will be used by our helm charts to verify that a
kata-containers deployment is correct.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-01-12 15:48:44 +01:00
stevenhorsman
a0d96256f5 packaging: Fix tools permissions issue
In some builds we are seeing:
```
error: could not create temp file /opt/rustup/tmp/r2xu46kwuyc7k2kr_file: Permission denied (os error 13)
```
in the agent-ctl build, so try and port a fix from #12313 to the tools build
to try and resolve this.

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-01-09 21:45:26 +01:00
Federico A. Corazza
787768fe9b kata-deploy: Fix extraction of the containerd major version
Fixes deploying kata-containers using k3s. The deploy script fails with /opt/kata-artifacts/scripts/kata-deploy.sh: line 397: [: too many arguments

Signed-off-by: Federico A. Corazza <git@facorazza.com>
2026-01-09 19:52:18 +01:00
Hyounggyu Choi
2962e14c10 virtiofsd: fix RUSTUP_HOME and CARGO_HOME permissions for non-root builds
The following error was observed during virtiofsd static build:

```
error: could not create temp file /opt/rustup/tmp/p44enysfaxwdbvw4_file:
Permission denied (os error 13)
```

This occurs because RUSTUP_HOME and CARGO_HOME were initialized by the
root user during `docker build`, but `cargo build` is executed as a
non-root user via 'docker run --user'.

Ensure these directories are writable by adjusting the permission after
the toolchain installation is complete.

Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
2026-01-09 14:01:20 +01:00
Fabiano Fidêncio
f8318c0542 kata-deploy: Remove unused dependency
We're depending solely on toml_edit, thus we can safely remove the toml
dependency.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-01-08 18:58:11 +01:00
Fupan Li
b3546f3a68 Merge pull request #12282 from kata-containers/set-required-ci
Set several tests as required ci
2026-01-08 20:34:39 +08:00
Mikko Ylinen
e02e226431 packaging: build OVMF for Intel TDX again
OVMF build for Intel TDX (aka "TDVF") was disabled in favor of Ubuntu/
CentOS pre-upstream releases of Intel TDX.

See 4292c4c3b1.

It's time to re-enable the build and move runtime configurations to
use it (the latter will be done in a later commit).

This is a partial revert of 4292c4c3b with the following changes:
- Stop calling OVMF for Intel TDX "TDVF" and follow the naming distros
use for TDX enabled build: OVMF.inteltdx.fd.
- Single binary OVMF.inteltdx.fd is supported using -bios QEMU param.
- Secure Boot infrastructure is disabled since Kata does not support it.

Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2026-01-08 10:21:47 +01:00
Alex Lyn
e4451baa84 tests: Set run-nerdctl-tests with qemu-runtime-rs required
run-nerdctl-tests (qemu-runtime-rs)

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-01-08 09:56:50 +08:00
Alex Lyn
56a21c33a3 tests: Set stability tests with qemu-runtime-rs required
run-containerd-stability (active, qemu-runtime-rs)
run-containerd-stability (lts, qemu-runtime-rs)

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-01-08 09:56:50 +08:00
Alex Lyn
679e31d884 tests: Set run-nydus CIs as required
run-basic-amd64-tests / run-nydus

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-01-08 09:56:50 +08:00
Fabiano Fidêncio
c4194538e2 versions: Bump QEMU to v10.2.0
QEMU v10.2.0 was released on December 24th, 2025.

The experimental GPU SNP / TDX are also pointing to v10.2.0 release with
their gpu-{snp,tdx}-20260107 branch.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-01-07 12:30:55 +01:00
Mikko Ylinen
99bc0f49cc use-cases: drop Intel QuickAssist instructions
While the use-case of Intel QuickAssist (QAT) accelerated crypto
and/or compression with k8s and Kata Containers is still valid,
the setup instructions are outdated:

Starting with Intel Xeon Gen4 (Sapphire Rapids), QAT driver
stack moved to in-tree drivers without a separete SR-IOV VF
driver.

Drop all the setup instructions but keep the use-cases doc
for reference. Users wanting to enable the use-case, should consult
with Intel QAT Device plugins or Intel QAT DRA driver authors.

Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2026-01-02 12:14:04 +02:00
Alex Lyn
0b1a5c6e93 tests: Make the tests coco-dev job with coco-dev-runtime-rs required
The nontee job (run-k8s-tests-coco-nontee) for qemu-coco-dev-runtime-rs
is running well and it's time to make it required when the CI runs.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2025-12-23 09:54:52 +08:00
Fabiano Fidêncio
51e9b7e9d1 nydus-snapshotter: Bump to v0.15.10
As it brings a fix that most likely can workaround the containerd /
nydus-snapshotter databases desynchronization.

Reference: https://github.com/containerd/nydus-snapshotter/pull/700

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-12-18 18:41:09 +01:00
Fabiano Fidêncio
03297edd3a kata-deploy: rust: Add list verb for runtimeclasses RBAC
The Rust kata-deploy binary calls list_runtimeclasses() during NFD
setup, but the ClusterRole only granted get and patch permissions.

Add the list verb to the runtimeclasses resource permissions to fix
the RBAC error:
  runtimeclasses.node.k8s.io is forbidden: User
  \"system:serviceaccount:kube-system:kata-deploy-sa\" cannot list
  resource \"runtimeclasses\" in API group \"node.k8s.io\" at the
  cluster scope

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-12-18 18:31:52 +01:00
Fabiano Fidêncio
0e534fa7fe versions: Update virtiofsd to v1.13.3
Update virtiofsd to its latest release.

Here we also need to update the alpine version used by the builder as we
need a version of musl-dev new enough to have wrappers for pread2 and
pwrite2. As bumping, bump to the latest.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-12-18 00:51:08 +01:00
Fabiano Fidêncio
320f1ce2a3 versions: Bump experimental {tdx,snp} QEMU
Let's bump experimental {tdx,snp} QEMU to the tags created Today in the
Confidential Containers repo, which match with QEMU 10.2.0-rc3.

This bump is mostly for early testing what will become 10.2.0, which
will be bumped everywhere then.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-12-17 17:42:04 +01:00
Hyounggyu Choi
f1b4327dba Merge pull request #12247 from fidencio/topic/ci-store-the-tarballs-we-rely-on-on-gchr-follow-up
build: Fix GPG key for gperf & Pass PUSH_TO_REGISTRY and GH_TOKEN to Docker builds
2025-12-17 13:53:58 +01:00
Fabiano Fidêncio
98c5276546 helm: runtimeclasses: Match the kata-deploy rust deployment
There we ensure labels are added to better deal with ownership of the
runtimeclasses.  It's not strictly needed here as helm does take care of
the ownership, but also doesn't hurt to follow what seems to be a common
practice.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-12-17 09:57:02 +01:00
Fabiano Fidêncio
6130d7330f ci: Run a nightly job using the kata-deploy rust
Let's shamelessly duplicate the nightly job to have at least nightly
runs using the rust implementation of kata-deploy.

The reason for doing that is to be pragmatic, as pragmatic as possible,
and avoid switching away of the scripts before 3.24.0 release, while
still testing both ways till the switch happens.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-12-17 09:57:02 +01:00
Fabiano Fidêncio
fbc29f3f5e kata-deploy: helm: Adapt to the rust binary
Differently than the scripts, which are called as `bash -c ...`, the
kata-deploy rust binary must be invoked directly we do not even have
shell in its container.

For now, the rust version is used in the used image has the "-rust"
suffix, which will help us to have both ways being used / tested for a
little while.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-12-17 09:57:02 +01:00
Fabiano Fidêncio
9d88c6b1d7 kata-deploy: Oxidize the script
kata-deploy shell script is not THAT bad and, to be honest, it's quite
handy for quick hacks and quick changes.  However, it's been
increasingly becoming harder to maintain as it's grown its scope from a
testing tool to the proper project's front door, lacking unit tests, and
with an abundacy of complex regular expressions and bashisms to be able
to properly parse the environment variables it consumes.

Morever, the fact it is a Frankstein's monster glued together using
python packages, golang binaries, and a distro dependent container makes
the situation VERY HARD to use it from a distroless container (thus,
avoiding security issues), preventing further integration with
components that require a higher standard of security than we've been
requiring.

With everything said, with the help of Cursor (mostly on generating the
tests cases), here comes the oxidized version of the script, which runs
from a distroless container image.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-12-17 09:57:02 +01:00
Fabiano Fidêncio
c9cd79655d build: Pass PUSH_TO_REGISTRY and GH_TOKEN to Docker builds
The ORAS cache helper needs PUSH_TO_REGISTRY to be set to 'yes' to
push new artifacts to the cache. However, this environment variable
was not being passed to the Docker container during agent, tools, and
busybox builds.

Moreover, for ghcr.io authentication, add support for using GH_TOKEN and
GITHUB_ACTOR as fallbacks when explicit credentials
(ARTEFACT_REGISTRY_USERNAME/PASSWORD) are not provided.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-12-16 21:58:16 +01:00
Fabiano Fidêncio
b11cea3113 build: Fix GPG key for gperf
The GPG key used for gperf was incorrectly set to the busybox
maintainer's key (Denis Vlasenko) instead of the gperf maintainer's
key (Marcel Schaible).

Wrong key (busybox): C9E9416F76E610DBD09D040F47B70C55ACC9965B
                     Denis Vlasenko <vda.linux@googlemail.com>

Correct key (gperf): EDEB87A500CC0A211677FBFD93C08C88471097CD
                     Marcel Schaible <marcel.schaible@studium.fernuni-hagen.de>

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-12-16 21:58:16 +01:00
Fabiano Fidêncio
6e01ee6d47 helm: Provide kata-remote runtime class
kata-remote is a runtime class that cloud-api-adaptor relies on to work.

kata-remote by itself does nothing, and that's the reason it's disabled
by default. We're only adding it here so cloud-api-adaptor charts can
simply do something like `--set shims.remote.enabled=true`.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-12-16 21:57:49 +01:00
Fabiano Fidêncio
0a0fcbae4a gatekeeper: Adjust to kata-tools
A few jobs have been renamed as part of the kata-tools split.
Let's add them all here.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-12-16 18:22:40 +01:00
Fabiano Fidêncio
a2534e7bc8 kata-tools: Release as its own tarball
We're only releasing those for amd64 as that's the only architecture
we've been building the packages for.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-12-16 12:55:07 +01:00
Fabiano Fidêncio
6d2f393be4 build: Split tools build from the other artefacts build
Let's ensure we can create a specific "tools" tarball, which will help
those who only need to pull those either for testing or production
usage.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-12-16 12:55:07 +01:00
Ruoqing He
1872af7c5a ci: Install cmake before building runtime-rs
cmake is required for libz-sys to compile (which is required by nydus).

Signed-off-by: Ruoqing He <heruoqing@iscas.ac.cn>
2025-12-16 11:26:07 +01:00
Fabiano Fidêncio
1388a3acda packaging: Add ORAS cache for gperf and busybox tarballs
To protect against upstream download failures for gperf and busybox,
implement ORAS-based caching to GHCR.

This adds:
- download-with-oras-cache.sh: Core helper for downloading with cache
- populate-oras-tarball-cache.sh: Script to manually populate cache
- warn() function to lib.sh for consistency

Modified build scripts to:
- Try ORAS cache first (from ghcr.io/kata-containers/kata-containers)
- Fall back to upstream download on cache miss
- Automatically push to cache when PUSH_TO_REGISTRY=yes

The cache is automatically populated during CI builds, and parallel
architecture builds check for existing versions before pushing to avoid
race conditions.

Forks benefit from upstream cache but can override with their own:
ARTEFACT_REPOSITORY=myorg/kata make agent-tarball

Generated-By: Cursor IDE with Claude
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-12-15 22:04:21 +01:00
Fabiano Fidêncio
a25a53c860 kata-deploy: sa: Fix permissions for patching nodefeaturerules
I've seen this happening with the GPU SNP CI every now and then, but I
don't really understand how this was not caught by the TDX / SNP CI
themselves before.

In any case, the error seen is:
```
  Error from server (Forbidden): error when applying patch:
  {"metadata":{"annotations":{"kubectl.kubernetes.io/last-applied-configuration":"{\"apiVersion\":\"nfd.k8s-sigs.io/v1alpha1\",\"kind\":\"NodeFeatureRule\",\"metadata\":{\"annotations\":{},\"name\":\"amd64-tee-keys\"},\"spec\":{\"rules\":[{\"extendedResources\":{\"sev-snp.amd.com/esids\":\"@cpu.security.sev.encrypted_state_ids\"},\"labels\":{\"amd.feature.node.kubernetes.io/snp\":\"true\"},\"matchFeatures\":[{\"feature\":\"cpu.security\",\"matchExpressions\":{\"sev.snp.enabled\":{\"op\":\"Exists\"}}}],\"name\":\"amd.sev-snp\"},{\"extendedResources\":{\"tdx.intel.com/keys\":\"@cpu.security.tdx.total_keys\"},\"labels\":{\"intel.feature.node.kubernetes.io/tdx\":\"true\"},\"matchFeatures\":[{\"feature\":\"cpu.security\",\"matchExpressions\":{\"tdx.enabled\":{\"op\":\"Exists\"}}}],\"name\":\"intel.tdx\"}]}}\n"}}}
  to:
  Resource: "nfd.k8s-sigs.io/v1alpha1, Resource=nodefeaturerules", GroupVersionKind: "nfd.k8s-sigs.io/v1alpha1, Kind=NodeFeatureRule"
  Name: "amd64-tee-keys", Namespace: ""
  for: "/opt/kata-artifacts/node-feature-rules/x86_64-tee-keys.yaml": error when patching "/opt/kata-artifacts/node-feature-rules/x86_64-tee-keys.yaml": nodefeaturerules.nfd.k8s-sigs.io "amd64-tee-keys" is forbidden: User "system:serviceaccount:kube-system:kata-deploy-sa" cannot patch resource "nodefeaturerules" in API group "nfd.k8s-sigs.io" at the cluster scope
```

And the fix is as simple as allowing patching and updating a
nodefeaturerule in our service account RBAC.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-12-15 12:01:20 +01:00
Alex Lyn
f4f61d5666 Merge pull request #12229 from fidencio/topic/kata-deploy-do-deprecations
kata-deploy: Remove deprecated features from 3.23.0
2025-12-15 19:00:07 +08:00
Hyounggyu Choi
b69da5f3ba gatekeeper: Make s390x e2e tests required again
Since the CI issue for s390x was resolved on Dec 5th,
the nightly test result has gone green for 10 consecutive days.
This commit puts the e2e tests for s390x again into the required job list.

Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
2025-12-15 11:12:25 +01:00
Fabiano Fidêncio
ded6d1636f kata-deploy: Remove deprecated features from 3.23.0
Let's remove the deprecated features that were marked for removal
after Kata Containers 3.23.0:

kata-deploy.sh:
- Remove non-arch-specific variable fallbacks (SHIMS, DEFAULT_SHIM,
  SNAPSHOTTER_HANDLER_MAPPING, ALLOWED_HYPERVISOR_ANNOTATIONS,
  PULL_TYPE_MAPPING, EXPERIMENTAL_FORCE_GUEST_PULL). Each arch now
  has its own default value.
- Remove CREATE_RUNTIMECLASSES and CREATE_DEFAULT_RUNTIMECLASS
  variables and associated functions (create_runtimeclasses,
  delete_runtimeclasses, adjust_shim_for_nfd). RuntimeClasses are
  now managed by Helm chart, not the daemonset script.
- Unsupported architectures now fail with an error instead of
  falling back to non-arch-specific defaults.

Helm chart:
- Remove all deprecated env values (createRuntimeClasses,
  createDefaultRuntimeClass, debug, shims, shims_*, defaultShim,
  defaultShim_*, allowedHypervisorAnnotations, snapshotterHandlerMapping,
  snapshotterHandlerMapping_*, agentHttpsProxy, agentNoProxy,
  pullTypeMapping, pullTypeMapping_*, _experimentalSetupSnapshotter,
  _experimentalForceGuestPull, _experimentalForceGuestPull_*).
- Remove backward compatibility code from _helpers.tpl that checked
  for legacy env values.
- Remove legacy env.shims check from runtimeclasses.yaml.
- Remove CREATE_RUNTIMECLASSES and CREATE_DEFAULT_RUNTIMECLASS env
  vars from kata-deploy.yaml and post-delete-job.yaml.
- Update RBAC to only include runtimeclasses get/patch permissions
  (needed for NFD patching), removing create/delete/list/update/watch.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-12-13 16:32:00 +01:00
Fabiano Fidêncio
c7d0c270ee release: Bump version to 3.24.0
Bump VERSION and helm-chart versions

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-12-12 18:15:41 +01:00
Fabiano Fidêncio
5b6a2d25bc podOverhead: Reduce memory overhead for GPU runtime classes
Now that we've bumped to QEMU 10.2.0-rc1, we can take advantage of a fix
that's present there, which fixes the double memory allocation for the
cases where GPUs are being cold-plugged.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-12-06 00:16:43 +01:00
Fabiano Fidêncio
aaa67df4dd versions: Bump experimental {tdx,snp} QEMU
Let's bump experimental {tdx,snp} QEMU to the tags created Today in the
Confidential Containers repo, which match with QEMU 10.2.0-rc1.

This bump is specially beneficial for us, as we can get rid of QEMU's
double memory allocation when **cold plugging** a GPU.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-12-05 18:58:35 +01:00
Fabiano Fidêncio
923f97bc66 rootfs: Temporarily revert "gpu: Handle root_hash.txt correctly"
This reverts commit e4a13b9a4a, as it
caused some issues with the GPU workflows.

Reverting it is better, as it unblocks other PRs.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-12-05 11:47:37 +01:00
Steve Horsman
d27af53902 Merge pull request #12185 from stevenhorsman/runtime-rs-required-checks
ci: Add qemu-runtime-rs AKS tests to required
2025-12-05 10:43:25 +00:00
stevenhorsman
403de2161f version: Update golang to 1.24.11
Needed to fix:
```
Vulnerability #1: GO-2025-4155
    Excessive resource consumption when printing error string for host
    certificate validation in crypto/x509
  More info: https://pkg.go.dev/vuln/GO-2025-4155
  Standard library
    Found in: crypto/x509@go1.24.9
    Fixed in: crypto/x509@go1.24.11
    Vulnerable symbols found:
      #1: x509.HostnameError.Error
```

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2025-12-04 22:50:07 +01:00