Add go vendored code for all packages to the vendor tarball.
This should be enough for people who need vendored code, e.g.
for hermetic builds.
The repo only tracks 4 go vendored code directories but the
script considers all go.mod files accross the repo, for the
sake of simplicity. The impact on the size of the tarball
is less than 20 mb.
It is now possible to stop tracking vendored code in git and
to get rid of `make vendor`.
Signed-off-by: Greg Kurz <groug@kaod.org>
Ensures go.mod and go.sum files are kept up-to-date on PRs that modify
Go code, go modules, or the Go version in versions.yaml.
The workflow can also be run directly from the GitHub UI, in order
to check the tidyness of the target branch.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
This is to silent :
warning: `.../.cargo/config` is deprecated in favor of `config.toml`
|
= help: if you need to support cargo 1.38 or earlier, you can symlink `config` to `config.toml`
We don't care for cargo 1.38 or earlier.
Signed-off-by: Greg Kurz <groug@kaod.org>
rootfs.sh stops passing a host GOPATH bind-mount into the inner
osbuilder docker run. Pass INSTALL_IN_GOPATH=false so
ci/install_yq.sh installs yq under /usr/local/bin in the container.
scripts/lib.sh resolves yq after sourcing install_yq.sh and fails
clearly if yq is still missing.
This avoids build issues on (managed) build hosts where HOME, for
example, resolves to /localhome/... while the image user record
still points at /home/... On those hosts the old flow could make
the daemon bind-mount a GOPATH path that does not exist or is not
writable on the host (e.g. mkdir or mount under /home/... denied).
Co-authored-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Upgrade netlink-packet-route and rtnetlink so IFLA_INET6_CONF matches the
kernel's 240-byte layout (DEVCONF_FORCE_FORWARDING). Adapt to API changes:
NeighbourAttribute::LinkLayerAddress and bool MulticastSnooping.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The cron-job test workload was missing `runtimeClassName: kata`, which
meant the cron job was not actually being executed under the Kata
runtime, defeating the purpose of the test.
Set it explicitly, consistent with the sibling `job.yaml` workload.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Description: the config for gpu operator for Nvidia kata containers device
plugin needs to be revised. The older one attributes to vgpu/kubevirt use case.
Signed-off-by: Rajat Chopra <rajatc@nvidia.com>
Fix two issues in kata-deploy-lifecycle.bats that caused failures on
k3s, k0s and rke2:
run_on_host():
- `kubectl run --rm -i` causes k3s/rke2 to inject session-recording
banners into stdout, polluting command output and breaking string
assertions. Replace with a create/wait/logs/delete sequence so only
the container's actual stdout is captured.
"Artifacts are fully cleaned up after uninstall":
- After a CRI restart the kubelet may briefly report "Unknown" for the
container runtime version. Retry for up to 60s before asserting.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Now that kata-deploy has a proper readiness probe (/readyz returns 200
only after install completes), replace the ad-hoc wait strategies with
kubectl wait --for=condition=Ready on the kata-deploy pods.
Note: helm --wait is ineffective for single-node clusters with
maxUnavailable=1 (the DaemonSet is considered ready with 0 ready pods),
so the CI uses kubectl wait on the pod readiness condition directly.
gha-run-k8s-common.sh:
- Drop the waitForProcess polling loop for Running pods
- Drop the `sleep 60s` with its FIXME comment
- Add kubectl wait --for=condition=Ready instead
helm-deploy.bash:
- Drop the extra `kubectl rollout status` after helm
- Drop the `sleep 60`
- The existing --wait on the helm command now suffices
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The kata-deploy DaemonSet pod had no Kubernetes health probes, so the
kubelet could not distinguish between "still installing" and "crashed",
and rolling updates would proceed to the next node before install
actually finished.
Add a lightweight HTTP health server (built on raw tokio TcpListener,
no new crate dependencies) that starts immediately in the install path:
/healthz — liveness: returns 200 as soon as the server binds
/readyz — readiness: returns 503 while installing, 200 after
install completes (artifacts extracted, CRI restarted,
node labeled)
Wire the Helm chart with startup, liveness, and readiness probes
(all individually toggleable). The startup probe allows up to 10
minutes for install to complete before the liveness probe takes over.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
On s390x, QEMU uses the CCW bus instead of PCI. The network device
hotplug path was hardcoded to find a PCI slot, which fails with
"no free slots on PCI bridges" on s390x.
Add CCW support to `hotplug_network_device`: when running on a
native CCW bus, allocate a CCW subchannel address and use `devno`
instead of PCI `bus`/`addr`/`vectors`.
Additionally, after hotplugging a network device, the guest kernel
needs time to probe the CCW device before the network interface
appears. Add a retry loop (up to 10 attempts, 100ms apart) to
`handle_interfaces` so that `update_interface` succeeds once the
guest has created the link.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
On nightly CI, run the NVIDIA GPU tests without setting nvrc.log=trace.
This gives us end-to-end test coverage that more closely matches how
users would actually run Kata Containers with NVIDIA GPUs, since trace
logging is not enabled by default in production.
NVRC trace logging remains enabled for PR runs, where the extra
verbosity is useful for debugging failures.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
As we're in the process to stabilise runtime-rs for the coming 4.0.0
release, we better start running as many tests as possible with that.
The TDX runtime-rs job is gated to nightly runs only (pr-number ==
"nightly") since we only have a single TDX machine and cannot afford
to run both qemu-tdx and qemu-tdx-runtime-rs on every PR.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The ITA_KEY secret was conditionally passed to TDX jobs for Intel
Trust Authority attestation, but it is no longer needed. Remove it
from all workflow files and the test helper export.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Architecture-specific release workflows were using the same concurrency
group when called from release.yaml, causing GitHub Actions to detect
a deadlock and cancel the builds.
Fix by appending architecture suffix to each workflow's concurrency
group, allowing parallel execution without conflicts.
Assisted-by: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
The codegen check ensures that generated files are up-to-date and
correspond to the tool versions used in CI. Requiring this check
prevents us from accidentally merging, e.g., proto changes without the
corresponding Rust/Go updates.
Signed-off-by: Markus Rudy <mr@edgeless.systems>
Apply same test configs we use in runtime-go config to runtime-rs config.
These are:
- runtime.static_sandbox_resource_mgmt = true
- hypervisor.clh.valid_hypervisor_paths includes cloud-hypervisor-glibc
- hypervisor.clh.path = cloud-hypervisor-glibc
Signed-off-by: Saul Paredes <saulparedes@microsoft.com>
Copy Fail" (CVE-2026-31431) is a high-severity local privilege escalation (LPE)
vulnerability found in the Linux kernel in April 2026, which affects all major
Linux distributions—including those using Long Term Support (LTS) kernels—released since 2017.
The bug allows an unprivileged user to gain root access, escape containers,
and modify the in-memory page cache reliably using a tiny 732-byte script
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Containerd 2.3.0 introduces config schema version 4 (see upstream
RELEASES.md and the version-4 server-plugin documentation). The default file
still uses the same split-CRI layout as version 3 (plugins under
io.containerd.cri.v1.runtime and io.containerd.cri.v1.images). Schema v4
mainly moves gRPC, TTRPC, debug, and metrics listener settings under
io.containerd.server.v1.*; kata-deploy does not edit those server tables except
for containerd log verbosity when DEBUG=true.
Fixes: #12936
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>