kata-containers

mirror of https://github.com/kata-containers/kata-containers.git synced 2026-07-01 22:50:54 +00:00

Author	SHA1	Message	Date
Fabiano Fidêncio	8d2ecaabb5	versions: Bump QEMU to v11.0.0 For more details see QEMU's release notes: https://www.qemu.org/2026/04/22/qemu-11-0-0/ GPU experimental variants are also using v11.0.0 plus one patch to solve issues related to NUMA mapping. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	5d3e1e6396	kata-deploy: verify kata-runtime label remains stable on rke2/k3s The retry loop added in `efd468df3f` still allows the install to declare success while inside the kubelet's post-restart re-register window. On rke2/k3s, `systemctl restart rke2-agent` restarts both containerd and the kubelet, but `wait_till_node_is_ready` polls `.status.conditions[Ready]` every 2 s and returns on the first `True` observation it sees. By default the kubelet only publishes node status every ~10 s, so that first `True` is almost always the stale value from before the restart — the kubelet hasn't actually finished restarting yet. `label_node_with_retry` then applies the label, sleeps 1 s, reads back "true" (still stale, kubelet still down), and returns Ok. Install completes, `/readyz` flips to 200, helm releases its `--wait`, and the bats test starts — and only then does the kubelet finish coming up, re-register the node, and clobber the label with its cached set. The lifecycle test sees an empty `katacontainers.io/kata-runtime` and fails: # Node label katacontainers.io/kata-runtime: not ok 1 Kata artifacts are present on host after install A single-shot verification can't distinguish "still stale true" from "truly stable true after kubelet re-register". Replace it with a stability window: after (re)applying the label, require it to remain at the expected value for STABILITY_CHECKS=6 consecutive observations spaced CHECK_INTERVAL=2 s apart (≈ 12 s — comfortably more than the kubelet's status-update period). If the value ever drifts inside the window, re-apply and restart the stability counter. Bounded by MAX_APPLY_ATTEMPTS=12, so worst case is ~3 min; happy path adds ~12 s to install. Also add a short polling loop to the test's own label assertion as belt-and-suspenders for any leftover transient race, matching the existing retry pattern used for the container-runtime version check. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-22 11:53:18 +02:00
Fabiano Fidêncio	7536f2c616	Merge pull request #13055 from kata-containers/topic/kata-deploy-only-install-what-will-be-used kata-deploy: only install what will actually be used	2026-05-21 17:53:09 +02:00
Fabiano Fidêncio	efd468df3f	kata-deploy: retry node labeling after CRI restart On rke2/k3s a CRI restart also restarts the kubelet, which may briefly re-register the node with its cached label set and clobber the kata-runtime label that was just applied via the API. Replace the single label_node call with a retry loop that verifies the label value after setting it. If the label is missing or has the wrong value, it is re-applied (up to 10 attempts with 2 s back-off). This fixes a race condition that became more visible after the switch to individual tarball extraction, which made install take slightly longer and shifted the kubelet re-registration timing window. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-20 20:52:36 +02:00
Fabiano Fidêncio	291e4d37be	kata-deploy: implement selective tarball extraction in installer Add zstd and tar as Rust dependencies and rewrite the artifact installation logic to extract only the component tarballs required by the enabled runtime classes. extract_component_tarballs reads shim-components.json to determine which kata-static-<name>.tar.zst files are needed for the selected shims and current architecture. Shared components (e.g. kernel, shim-v2-go) are listed by multiple shims and must only be unpacked once per install run. Deduplication is handled with an in-memory set passed through the call, avoiding any risk of stale on-disk state surviving across pod restarts. Within each tarball, opt/kata path prefixes are stripped and absolute symlink / hard-link targets are rewritten to point at the resolved installation directory, correctly handling MULTI_INSTALL_SUFFIX. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-20 20:52:36 +02:00
Fabiano Fidêncio	9a0acc6c4c	kata-deploy: ship individual component tarballs; drop merged tarball Update the Dockerfile to copy each kata-static-<name>.tar.zst directly into the image alongside shim-components.json, replacing the old artifact-extractor stage that unpacked a single merged tarball. Update the publish-kata-deploy-payload and release CI workflows to download individual per-component artifacts instead of waiting for a merged tarball, and simplify kata-deploy-build-and-upload-payload.sh accordingly. The kata-deploy image build is no longer blocked on the merge step. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-20 20:52:36 +02:00
Fabiano Fidêncio	87e55be4a3	kata-deploy: add shim-components.json component manifest Introduces the human-maintained shim-components.json that maps each runtime class to the list of kata-static-<name>.tar.zst component tarballs it needs per architecture. This is the source of truth read by the installer at deploy time to decide which tarballs to extract. Key design choices encoded here: - shim-v2-go vs shim-v2-rust: explicit per-shim, so a node running only Rust shims never extracts the Go shim binary. - virtiofsd and nydus are both listed for hypervisors that support configurable shared_fs (we cannot know which the user will choose). - fc/firecracker: no virtiofsd or nydus (devmapper only). - remote: only the shim binary (no local hypervisor artifacts). Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-20 20:52:36 +02:00
Fabiano Fidêncio	c87e327876	kata-deploy: split shim-v2 into shim-v2-go and shim-v2-rust Split the monolithic shim-v2 build target into separate shim-v2-go and shim-v2-rust targets in kata-deploy-binaries.sh, the local-build Makefile, and the four architecture CI workflows. The Go and Rust shims now each produce their own kata-static-<name>.tar.zst artifact, allowing downstream consumers to select only the shim variant they need. MEASURED_ROOTFS is set per-arch for the Rust job in CI. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-20 20:52:36 +02:00
stevenhorsman	3f27052184	kata-deploy: always add HEAD commit SHA tag to all builds Previously, the commit SHA tag was only added for specific components (agent, agent-ctl) by setting artefact_tag in individual install functions. This was inconsistent and error-prone. Now, the HEAD commit SHA is always added as a tag for all builds in the central tagging logic. This ensures: - All components get tagged with the commit SHA - The correct HEAD commit is used (not the last commit that modified a specific path) - Simpler, more maintainable code The git command uses `git -C` to change to the repo directory before running git log, which correctly returns the HEAD commit SHA regardless of which files were modified in recent commits. Assisted-by: IBM Bob Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-05-20 17:42:09 +01:00
stevenhorsman	76fc847c78	release: correct .cargo/config.toml reference in generate_vendor.sh The script was creating .cargo/config.toml but referencing .cargo/config in the vendor_dir_list, causing tar to fail with 'Cannot stat' error. Signed-off-by: stevenhorsman <steven@uk.ibm.com> Generated-By: IBM Bob	2026-05-19 18:23:53 +01:00
stevenhorsman	a4cfe32157	release: Bump version to 3.31.0 Bump VERSION and helm-charts versions. Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-05-19 15:32:50 +02:00
Fabiano Fidêncio	7c971f0c4c	Merge pull request #13069 from fidencio/topic/kata-deploy-prevent-eviction helm-chart: add priorityClassName to prevent kata-deploy eviction	2026-05-18 21:08:45 +02:00
Fabiano Fidêncio	5d40ba66ff	helm-chart: add priorityClassName to prevent kata-deploy eviction kata-deploy is a per-node infrastructure DaemonSet; if it gets evicted under node memory/CPU pressure the node loses its Kata runtime until the pod is rescheduled. Default to system-node-critical so the kubelet evicts lower-priority workloads first. The value is configurable via `priorityClassName` in values.yaml. Fixes: #13068 Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-18 15:14:06 +02:00
stevenhorsman	e3a00a2ec2	kata-deploy: fix binary location for agent-ctl Moving agent-ctl into the root workspace moves the target directory, so update this target to be in root, not src/tools Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-05-18 09:47:15 +01:00
Manuel Huber	ed4233bf91	rootfs: cdh: Update CDH to new version Update CDH to a newer version and: - adjust the NVIDIA root filesystem build to reflect the change from using libcryptsetup to using the cryptsetup binary. - adjust image-pull test cases to conduct parallel write operations on the /dev/trusted_store backed guest image pull location since issue #12721 has been solved on CDH side. Fixes #12721 Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-05-13 20:20:45 +02:00
Fabiano Fidêncio	93e02944fa	image-builder/nvidia: skip DAX header for virtio-blk-pci images The DAX header (2 MiB of NVDIMM metadata + a duplicate MBR) is unconditionally prepended to every image by set_dax_header(). NVIDIA images use virtio-blk-pci with disable_image_nvdimm=true, so the kernel reads MBR #1 directly and never touches the DAX metadata -- it is dead weight. Add a SKIP_DAX_HEADER environment variable (default "no") that, when set to "yes", skips the DAX header entirely: - Removes the 2 MiB DAX overhead from image size calculations in both the erofs and ext4 paths - Skips the set_dax_header() call, avoiding compilation and execution of the nsdax tool - Passes the variable through to containerised builds Enable SKIP_DAX_HEADER=yes for both install_image_nvidia_gpu() and install_image_nvidia_gpu_confidential() in the build pipeline. All other image builds are unaffected (default remains "no"). Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-10 17:18:05 +02:00
Fabiano Fidêncio	6b802a4e30	nvidia: switch GPU rootfs images to erofs Switch the NVIDIA GPU rootfs images (both standard and confidential) from ext4 to erofs (Enhanced Read-Only File System). Unlike ext4, which is a read-write filesystem mounted read-only by convention, erofs is structurally read-only -- no journal, no write metadata, no superblock write path. This eliminates accidental mutation and reduces the attack surface inside the guest VM, which is particularly important for confidential workloads using dm-verity. Introduce a DEFROOTFSTYPE_NV Makefile variable (set to erofs) for both Go and Rust runtimes, keeping the global DEFROOTFSTYPE as ext4 so non-NVIDIA configurations are unaffected. Update all six NVIDIA GPU configuration templates (base, SNP, TDX for both runtimes) to use @DEFROOTFSTYPE_NV@ instead of the global @DEFROOTFSTYPE@. Export FS_TYPE=erofs in install_image_nvidia_gpu() and install_image_nvidia_gpu_confidential() so the build pipeline produces erofs images via the image builder. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-10 17:18:05 +02:00
Fabiano Fidêncio	341a0d366c	kata-deploy: Fix containerd debug level path for config schema v4 Containerd 2.3 (config schema v4) uses the top-level [debug] table for log level configuration, not plugins."io.containerd.server.v1.debug" as was the case in the RC builds. Update containerd_debug_level_toml_path() to use .debug.level for all schema versions, matching the released containerd behavior. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-10 12:02:24 +02:00
stevenhorsman	87664c608d	version: Bump to latest 6.18 kernel Pick up the latest kernel that fixes CVE-2026-43284 Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-05-08 17:15:24 +01:00
Fabiano Fidêncio	0f3160276b	ci: k8s: skip no-op Helm uninstall on free runners In cleanup_kata_deploy, bail out early when no kata-deploy Helm release exists so baremetal-* pre-deploy cleanup on fresh clusters does not block on helm uninstall --wait (up to 10m). Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-07 13:40:55 +02:00
Fabiano Fidêncio	f5533950e6	kata-deploy: helm: cap container RSS via resources block Plumb a resources block into the kata-deploy DaemonSet container in the Helm chart so the cluster can size its memory footprint predictably. Defaults are sized from real /proc/<pid>/status numbers on an unpatched 3.30.0 build running on a ~220-vCPU GPU node: VmRSS: 9944 kB (~9.7 MiB) <- actual physical memory RssAnon: 2628 kB (~2.6 MiB) <- heap + dirty stack pages VmData: 464668 kB (~454 MiB) <- tokio multi-thread workers' reserved-but-untouched stacks Threads: 225 <- num_cpus()-driven worker pool That VmData number is the source of the original "kata-deploy is using 400 MB" reports: any monitoring layer that surfaces virtual data size, committed memory, or memory.usage_in_bytes on a kernel that includes mapped-but-untouched memory will happily reproduce ~400 MB even though only ~10 MiB is ever made resident. The earlier commits in this series (current_thread tokio, mimalloc, shared kube client, JSONPath removal, post-install re-exec) collapse VmData into the tens of MiB and drop the post-install resident set further. The defaults below are picked accordingly: requests: cpu: 25m # install is mostly I/O wait; the post-install # waiter is genuinely idle memory: 16Mi # ~2x headroom over the unpatched VmRSS we # measured, far more over the patched waiter Operators who hit OOMKilled on unusually large or churny clusters can override `resources` directly in their Helm values (or set it to {} to remove all requests and inherit cluster defaults). Fixes: https://github.com/kata-containers/kata-containers/discussions/12976 Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-07 13:40:55 +02:00
Fabiano Fidêncio	9e99b21ec5	kata-deploy: re-exec into a tiny post-install waiter After install completes the kata-deploy DaemonSet pod has nothing else to do for the rest of its lifetime — it just blocks on SIGTERM and then runs cleanup. Up to here, the install path has built up substantial peak heap (kube clients, deserialised Node/RuntimeClass objects, hyper + rustls TLS pools, parsed JSON / YAML), and on musl essentially none of that is ever returned to the kernel. Idling in the same process therefore pins the pod's RSS at the install peak indefinitely. Re-exec the binary into a hidden `internal-post-install-wait` action the moment install succeeds. execve(2) discards the entire address space, so the waiter starts up holding only the working set it actually needs (a config struct, the SIGTERM handler, and the health server). To avoid a probe-availability gap during the handover the install process clears FD_CLOEXEC on the health listener and passes the raw FD to the child via KATA_DEPLOY_HEALTH_FD. The child reattaches the FD as a tokio TcpListener and resumes serving /healthz and /readyz without ever closing the socket — the kubelet sees no failure. The detected container runtime is similarly threaded through KATA_DEPLOY_DETECTED_RUNTIME so the waiter doesn't have to re-query the apiserver. The new action is tagged `#[clap(hide = true)]` so `--help` doesn't expose it; users should never invoke it directly. Add the FD-inheritance helpers in health.rs: - prepare_listener_for_exec(): clears FD_CLOEXEC on a listener and returns its raw fd number. - listener_from_inherited_fd(): wraps an inherited fd back into a tokio::net::TcpListener (and re-sets FD_CLOEXEC so future host shellouts don't leak the socket). Fixes: https://github.com/kata-containers/kata-containers/discussions/12976 Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-07 13:40:55 +02:00
Fabiano Fidêncio	af03ab2228	kata-deploy: replace JSONPath node lookups with typed accessors The two pieces of node metadata kata-deploy actually reads are .status.nodeInfo.containerRuntimeVersion and a single label, both of which were being fetched through a homegrown JSONPath walker: - get_node_field() serialised the entire Node object back into a serde_json::Value tree on every call, - split_jsonpath() / get_jsonpath_value() then walked that tree by string key. Both the deep clone and the helpers themselves are unnecessary — kube's Node type is already strongly typed. Replace get_node_field() with two purpose-built accessors that read straight off the Node struct: - get_container_runtime_version(): pulls status.node_info.container_runtime_version with a clear error if the field isn't populated. - get_node_label(key): returns Option<String> directly from metadata.labels. Drop split_jsonpath, get_jsonpath_value, and their unit tests (which existed only to cover the JSONPath walker we no longer have). Update the three callers (config.rs, runtime/manager.rs, runtime/containerd.rs) to use the typed accessors. This removes the entire serde_json::Value clone-and-walk path from the hot read path and meaningfully cuts allocator churn during install. Fixes: https://github.com/kata-containers/kata-containers/discussions/12976 Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-07 13:40:55 +02:00
Fabiano Fidêncio	6cd842494c	kata-deploy: cap the tokio worker pool to 2 threads The default #[tokio::main] expands with flavor = "multi_thread" and worker_threads = num_cpus::get(). On a typical NVIDIA GPU node (200+ vCPUs) that allocates 200+ worker threads with ~2 MiB stacks each, which is the single largest contributor to the DaemonSet pod's VmData reservation — hundreds of MiB of address space mapped but never touched, easily reproducing the "kata-deploy is using ~400 MB" reports on any monitoring layer that surfaces VSZ / committed virtual memory. Switch to a fixed two-worker multi-thread runtime instead: #[tokio::main(flavor = "multi_thread", worker_threads = 2)] Two workers is exactly the right number for kata-deploy: - the install path is overwhelmingly I/O-bound and runs serially; one worker is enough to drive the install future itself, - install does shell out to `nsenter --target 1 systemctl restart containerd` (and friends) via the synchronous std::process:: Command::output(), which wedges the worker thread it runs on for tens of seconds; the second worker keeps the spawned health-server task able to answer kubelet probes inside timeoutSeconds while the first is blocked. flavor = "current_thread" would be tighter still on stacks (~4 MiB saved) but is fundamentally unsafe here: with a single runtime thread, any blocking host_systemctl call freezes the health server too, the kubelet fails the readiness probe, and the pod is restarted long before install completes. The CI lifecycle test reliably reproduces this as a 15-minute timeout waiting for the kata-deploy DaemonSet pod to become Ready. Net result vs. upstream's num_cpus()-driven pool on a 200-vCPU node: ~200 fewer worker threads, ~400 MiB less VmData reservation, while keeping kubelet probes responsive across the entire install path. Add the "sync" tokio feature here too so subsequent commits in the series can use tokio::sync primitives (OnceCell) without another features bump. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-07 13:40:55 +02:00
Fabiano Fidêncio	346119108e	kata-deploy: drop unused kube features The binary doesn't use kube::runtime (controllers, watchers, reflectors) or kube::derive (the CustomResource macro). Pulling them in only added transitive deps (kube-runtime, kube-derive, backon, educe, ahash, async-broadcast, ...) and inflated the binary's static data segment for no functional gain. Set default-features = false and select only what the binary actually calls into: the kube-client surface plus the rustls-tls backend that hyper-rustls already pulled in transitively. Behaviour is unchanged. Fixes: https://github.com/kata-containers/kata-containers/discussions/12976 Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-07 13:40:55 +02:00
Fabiano Fidêncio	1682b73e38	kata-deploy: Add qemu-nvidia-gpu-tdx-runtime-rs shim Register the new qemu-nvidia-gpu-tdx-runtime-rs shim across the kata-deploy stack so it is built, installed, and exposed as a RuntimeClass. This adds the shim to the Rust binary's RUST_SHIMS list (so it uses the runtime-rs binary), SHIMS list, the qemu-tdx-experimental share name mapping, and the x86_64 default shim set. The Helm chart gets the new shim entry in values.yaml, try-kata-nvidia-gpu.values.yaml, and the RuntimeClass overhead definition in runtimeclasses.yaml. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-07 10:33:26 +02:00
Fabiano Fidêncio	2280620cb9	kata-deploy: Add qemu-nvidia-gpu-snp-runtime-rs shim Register the new qemu-nvidia-gpu-snp-runtime-rs shim across the kata-deploy stack so it is built, installed, and exposed as a RuntimeClass. This adds the shim to the Rust binary's RUST_SHIMS list (so it uses the runtime-rs binary), SHIMS list, the qemu-snp-experimental share name mapping, and the x86_64 default shim set. The Helm chart gets the new shim entry in values.yaml, try-kata-nvidia-gpu.values.yaml, and the RuntimeClass overhead definition in runtimeclasses.yaml. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-07 10:33:26 +02:00
Fabiano Fidêncio	92a8cd56d1	kata-deploy: Add qemu-nvidia-gpu-runtime-rs shim Register the Rust NVIDIA GPU runtime as a kata-deploy shim so it gets installed and configured alongside the existing Go-based qemu-nvidia-gpu shim. Add qemu-nvidia-gpu-runtime-rs to the RUST_SHIMS list and the default enabled shims, create its RuntimeClass entry in the Helm chart, and include it in the try-kata-nvidia-gpu values overlay. The kata-deploy installer will now copy the runtime-rs configuration and create the containerd runtime entry for it. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-07 10:33:26 +02:00
Fabiano Fidêncio	acfb9f9762	Merge pull request #12954 from zvonkok/modular-makefile build: remove gha-adjust-to-use-prebuilt-components.sh	2026-05-07 10:32:32 +02:00
Greg Kurz	b44e56d3db	runtime: Remove vendor directory Now shipped in the vendored code tarball. Drop the git tree status check since it isn't needed anymore. Also stop building with `-mod=vendor`. This requires to expose GOMODCACHE as suggested by Fabiano Fidêncio. Signed-off-by: Greg Kurz <groug@kaod.org>	2026-05-06 09:47:30 +02:00
Greg Kurz	aa9145a762	generate_vendor: Add go vendored code Add go vendored code for all packages to the vendor tarball. This should be enough for people who need vendored code, e.g. for hermetic builds. The repo only tracks 4 go vendored code directories but the script considers all go.mod files accross the repo, for the sake of simplicity. The impact on the size of the tarball is less than 20 mb. It is now possible to stop tracking vendored code in git and to get rid of `make vendor`. Signed-off-by: Greg Kurz <groug@kaod.org>	2026-05-06 09:32:01 +02:00
Greg Kurz	6c3de068a4	generate_vendor: Adapt to modern cargo This is to silent : warning: `.../.cargo/config` is deprecated in favor of `config.toml` \| = help: if you need to support cargo 1.38 or earlier, you can symlink `config` to `config.toml` We don't care for cargo 1.38 or earlier. Signed-off-by: Greg Kurz <groug@kaod.org>	2026-05-06 09:31:54 +02:00
Alex Tibbles	8d7246e29a	kernel: bump kernel versions other than dragonball Applies fix for CVE-2026-31431 for non-dragonball configurations on current LTS 6.18. Signed-Off-By: Alex Tibbles <alex@bleg.org>	2026-05-05 09:30:46 +02:00
Fabiano Fidêncio	27c3dfbb8c	Merge pull request #12943 from fidencio/topic/kata-deploy-add-http-health-probes kata-deploy: add HTTP health probes (healthz/readyz)	2026-05-05 09:30:17 +02:00
Fabiano Fidêncio	d9722ba4be	Merge pull request #12960 from microsoft/saul/update_mariner_test_configs kata-deploy: configure_mariner: update test configs	2026-05-04 18:26:41 +02:00
Fabiano Fidêncio	49396b7991	kata-deploy: add HTTP health probes (healthz/readyz) The kata-deploy DaemonSet pod had no Kubernetes health probes, so the kubelet could not distinguish between "still installing" and "crashed", and rolling updates would proceed to the next node before install actually finished. Add a lightweight HTTP health server (built on raw tokio TcpListener, no new crate dependencies) that starts immediately in the install path: /healthz — liveness: returns 200 as soon as the server binds /readyz — readiness: returns 503 while installing, 200 after install completes (artifacts extracted, CRI restarted, node labeled) Wire the Helm chart with startup, liveness, and readiness probes (all individually toggleable). The startup probe allows up to 10 minutes for install to complete before the liveness probe takes over. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-03 22:09:08 +02:00
stevenhorsman	a1a6a9a150	release: Bump version to 3.30.0 Bump VERSION and helm-charts versions. Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-05-02 17:57:39 +01:00
Saul Paredes	cbb06545f7	kata-deploy: configure_mariner: also apply test config to runtime-rs Apply same test configs we use in runtime-go config to runtime-rs config. These are: - runtime.static_sandbox_resource_mgmt = true - hypervisor.clh.valid_hypervisor_paths includes cloud-hypervisor-glibc - hypervisor.clh.path = cloud-hypervisor-glibc Signed-off-by: Saul Paredes <saulparedes@microsoft.com>	2026-05-01 08:15:52 -07:00
Saul Paredes	564d381b79	kata-deploy: configure_mariner: correctly set static_sandbox_resource_mgmt static_sandbox_resource_mgmt is under the runtime config, not the hypervisor one. See `31f7438ecd/src/runtime/config/configuration-clh.toml.in (L439)` Signed-off-by: Saul Paredes <saulparedes@microsoft.com>	2026-05-01 08:15:52 -07:00
Zvonko Kaiser	803531dd9c	kernel: Bump Kernel Version Copy Fail" (CVE-2026-31431) is a high-severity local privilege escalation (LPE) vulnerability found in the Linux kernel in April 2026, which affects all major Linux distributions—including those using Long Term Support (LTS) kernels—released since 2017. The bug allows an unprivileged user to gain root access, escape containers, and modify the in-memory page cache reliably using a tiny 732-byte script Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-05-01 14:21:49 +00:00
Fabiano Fidêncio	96b68e77a7	kata-deploy: support containerd config schema version 4 and newer Containerd 2.3.0 introduces config schema version 4 (see upstream RELEASES.md and the version-4 server-plugin documentation). The default file still uses the same split-CRI layout as version 3 (plugins under io.containerd.cri.v1.runtime and io.containerd.cri.v1.images). Schema v4 mainly moves gRPC, TTRPC, debug, and metrics listener settings under io.containerd.server.v1.*; kata-deploy does not edit those server tables except for containerd log verbosity when DEBUG=true. Fixes: #12936 Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-04-30 16:23:43 +02:00
stevenhorsman	b61b3d2f20	kata-deploy: Update default tool binary location Now that all but agent-ctl (still WIP) of the tools are in the root workspace, switch the default to that and add the exception for agent-ctl as it's the odd one out. Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-04-30 08:46:22 +01:00
Zvonko Kaiser	35dfb11fe4	build: replace prebuilt-components sed hack with DEPS= Mutating the Makefile in-place to strip prereqs was fragile and limited to one target per invocation. DEPS= skips deps declaratively and propagates through recursive make, so multi-target builds can opt out in one shot. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-04-30 00:48:46 +00:00
Zvonko Kaiser	54c514e249	build: allow overriding rootfs/boot tarball prereqs via DEPS Skipping prereq rebuilds is useful when artifacts are already staged from a prior run (CI splitting work across jobs, local iteration). Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-04-29 23:59:05 +00:00
stevenhorsman	9cae783f14	kata-deploy: fix binary location for trace-forwarder Moving the trace-forwarder into the root workspace moves the target directory, so update this target. Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-04-29 13:27:09 +01:00
Aurélien Bombo	dc0f1795de	kata-deploy: remove useless unit tests These essentially merely test format!(), which is not our job. Signed-off-by: Aurélien Bombo <abombo@microsoft.com>	2026-04-28 10:58:01 -05:00
Aurélien Bombo	cf6a91a104	runtime-rs/config: rename cloud-hypervisor to clh This aligns on the previous commit and runtime-go. Signed-off-by: Aurélien Bombo <abombo@microsoft.com>	2026-04-28 10:58:01 -05:00
Aurélien Bombo	e4fbddb91a	ci: rename cloud-hypervisor to clh-runtime-rs This aligns on qemu-runtime-rs and makes more sense. Signed-off-by: Aurélien Bombo <abombo@microsoft.com>	2026-04-28 10:58:01 -05:00
Fabiano Fidêncio	a5e1521727	kernel: bake in Mellanox MLX5 Ethernet support The MLX5 Ethernet driver is useful well beyond the DPU/SmartNIC use case (any guest sitting on top of a Mellanox/ConnectX NIC benefits from it), yet the existing config fragment lived under dpu/ and was only pulled in when the kernel was built with `-D nvidia`. Promote it to a first-class common fragment so every Kata kernel gets MLX5 Ethernet built in. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-04-28 11:02:39 +02:00
Steve Horsman	d5785b4eba	Merge pull request #12872 from stevenhorsman/bump-rust-to-1.93 Bump rust to 1.93	2026-04-27 09:01:00 +01:00

1 2 3 4 5 ...

1722 Commits