Description: the config for gpu operator for Nvidia kata containers device
plugin needs to be revised. The older one attributes to vgpu/kubevirt use case.
Signed-off-by: Rajat Chopra <rajatc@nvidia.com>
Fix two issues in kata-deploy-lifecycle.bats that caused failures on
k3s, k0s and rke2:
run_on_host():
- `kubectl run --rm -i` causes k3s/rke2 to inject session-recording
banners into stdout, polluting command output and breaking string
assertions. Replace with a create/wait/logs/delete sequence so only
the container's actual stdout is captured.
"Artifacts are fully cleaned up after uninstall":
- After a CRI restart the kubelet may briefly report "Unknown" for the
container runtime version. Retry for up to 60s before asserting.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Now that kata-deploy has a proper readiness probe (/readyz returns 200
only after install completes), replace the ad-hoc wait strategies with
kubectl wait --for=condition=Ready on the kata-deploy pods.
Note: helm --wait is ineffective for single-node clusters with
maxUnavailable=1 (the DaemonSet is considered ready with 0 ready pods),
so the CI uses kubectl wait on the pod readiness condition directly.
gha-run-k8s-common.sh:
- Drop the waitForProcess polling loop for Running pods
- Drop the `sleep 60s` with its FIXME comment
- Add kubectl wait --for=condition=Ready instead
helm-deploy.bash:
- Drop the extra `kubectl rollout status` after helm
- Drop the `sleep 60`
- The existing --wait on the helm command now suffices
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The kata-deploy DaemonSet pod had no Kubernetes health probes, so the
kubelet could not distinguish between "still installing" and "crashed",
and rolling updates would proceed to the next node before install
actually finished.
Add a lightweight HTTP health server (built on raw tokio TcpListener,
no new crate dependencies) that starts immediately in the install path:
/healthz — liveness: returns 200 as soon as the server binds
/readyz — readiness: returns 503 while installing, 200 after
install completes (artifacts extracted, CRI restarted,
node labeled)
Wire the Helm chart with startup, liveness, and readiness probes
(all individually toggleable). The startup probe allows up to 10
minutes for install to complete before the liveness probe takes over.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
On s390x, QEMU uses the CCW bus instead of PCI. The network device
hotplug path was hardcoded to find a PCI slot, which fails with
"no free slots on PCI bridges" on s390x.
Add CCW support to `hotplug_network_device`: when running on a
native CCW bus, allocate a CCW subchannel address and use `devno`
instead of PCI `bus`/`addr`/`vectors`.
Additionally, after hotplugging a network device, the guest kernel
needs time to probe the CCW device before the network interface
appears. Add a retry loop (up to 10 attempts, 100ms apart) to
`handle_interfaces` so that `update_interface` succeeds once the
guest has created the link.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
On nightly CI, run the NVIDIA GPU tests without setting nvrc.log=trace.
This gives us end-to-end test coverage that more closely matches how
users would actually run Kata Containers with NVIDIA GPUs, since trace
logging is not enabled by default in production.
NVRC trace logging remains enabled for PR runs, where the extra
verbosity is useful for debugging failures.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
As we're in the process to stabilise runtime-rs for the coming 4.0.0
release, we better start running as many tests as possible with that.
The TDX runtime-rs job is gated to nightly runs only (pr-number ==
"nightly") since we only have a single TDX machine and cannot afford
to run both qemu-tdx and qemu-tdx-runtime-rs on every PR.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The ITA_KEY secret was conditionally passed to TDX jobs for Intel
Trust Authority attestation, but it is no longer needed. Remove it
from all workflow files and the test helper export.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Architecture-specific release workflows were using the same concurrency
group when called from release.yaml, causing GitHub Actions to detect
a deadlock and cancel the builds.
Fix by appending architecture suffix to each workflow's concurrency
group, allowing parallel execution without conflicts.
Assisted-by: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
The codegen check ensures that generated files are up-to-date and
correspond to the tool versions used in CI. Requiring this check
prevents us from accidentally merging, e.g., proto changes without the
corresponding Rust/Go updates.
Signed-off-by: Markus Rudy <mr@edgeless.systems>
Apply same test configs we use in runtime-go config to runtime-rs config.
These are:
- runtime.static_sandbox_resource_mgmt = true
- hypervisor.clh.valid_hypervisor_paths includes cloud-hypervisor-glibc
- hypervisor.clh.path = cloud-hypervisor-glibc
Signed-off-by: Saul Paredes <saulparedes@microsoft.com>
Copy Fail" (CVE-2026-31431) is a high-severity local privilege escalation (LPE)
vulnerability found in the Linux kernel in April 2026, which affects all major
Linux distributions—including those using Long Term Support (LTS) kernels—released since 2017.
The bug allows an unprivileged user to gain root access, escape containers,
and modify the in-memory page cache reliably using a tiny 732-byte script
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Containerd 2.3.0 introduces config schema version 4 (see upstream
RELEASES.md and the version-4 server-plugin documentation). The default file
still uses the same split-CRI layout as version 3 (plugins under
io.containerd.cri.v1.runtime and io.containerd.cri.v1.images). Schema v4
mainly moves gRPC, TTRPC, debug, and metrics listener settings under
io.containerd.server.v1.*; kata-deploy does not edit those server tables except
for containerd log verbosity when DEBUG=true.
Fixes: #12936
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Now that all but agent-ctl (still WIP) of the tools are
in the root workspace, switch the default to that and add
the exception for agent-ctl as it's the odd one out.
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
With the workspace unification we've bumped anyhow
from 1.0.31 to 1.0.102, so update the code to reflect that
error implements `Display` now in the newer version.
Assisted-by: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Add trace-forwarder to be a workspace member to simplify the
dependency management.
Assisted-by: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
During the zizmor refactoring I changed the name of two jobs
to make all the architectures match. I forgot to update required_tests
and as a workflow only change the PR didn't check this, so update
them now.
Signed-off-by: stevenhorsman <steven@uk.ibm.com>