Commit Graph

18979 Commits

Author SHA1 Message Date
Alex Lyn
10bbd9c79d test: skip TestContainerMemoryUpdate for sandbox api
Temporarily skip the `TestContainerMemoryUpdate` test case
for sandbox api.

This test case is currently skipped in other VMMs (e.g.,
QEMU, Cloud-Hypervisor) due to known issues and environmental
stability concerns.
To maintain consistency across the project, we are skipping it
for sandbox as well.

A follow-up PR will be dedicated to addressing these issues and
properly enabling/refining this test case for all VMMs.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-14 14:29:31 +08:00
Alex Lyn
a0ff1bf34f versions: bump containerd version to 2.3
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-13 11:58:34 +08:00
Alex Lyn
2123e1c4f4 tests: exclude TestContainerRestart from the cri-containerd test list
Creating a new container in the same sandbox VM after the previous
container has exited and been removed has never been supported by
kata-containers (neither with the go-based nor the rust-based runtime).
When the last container is removed the kata VM shuts down, so any
attempt to start a new container in the same sandbox fails.

This test exercises a use-case kata does not currently support, and it
has never been part of the passing list for good reason.  Mark it
explicitly excluded with a comment so it is clear this is a deliberate
omission rather than an oversight.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-13 11:58:34 +08:00
Alex Lyn
4c0db58859 ci: Re-enable run-containerd-sandboxapi job
The job was disabled because TestImageLoad was failing when using the
shim sandboxer with runc due to a containerd bug (config.json not
being written to the bundle directory).

Now that check_daemon_setup uses podsandbox for the runc sanity check,
the root cause of the failure is worked around on our side and the job
can be re-enabled.

Also update the runner to ubuntu-24.04.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-13 11:58:34 +08:00
Alex Lyn
061004dd23 tests: Use podsandbox sandboxer for the runc sanity check
The check_daemon_setup function verifies that containerd + runc are
functional before the real kata tests run. Using the shim sandboxer
for this runc check hits a known containerd bug where the OCI spec
is not populated before NewBundle is called, so config.json is never
written and containerd-shim-runc-v2 fails at startup.

See containerd/containerd#11640

The sandboxer choice is irrelevant for this sanity check, so use
podsandbox which works correctly with runc.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-13 11:58:34 +08:00
Alex Lyn
a5ce872287 runtime-rs: Align sandbox status with CRI expectations
Update the sandbox status reporting to align with containerd/CRI
requirements. This commit aims to address issue of `State Mapping`

Previously, internal state strings were returned, which containerd
could not recognize, causing running sandboxes to be misinterpreted
as SANDBOX_NOTREADY. This maps internal states to CRI constants:
- Running -> SANDBOX_READY
- Init | Stopped -> SANDBOX_NOTREADY

These changes ensure the sandbox status is both accurately interpreted
and fully compliant with the expected interface.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-13 11:58:34 +08:00
Alex Lyn
7eae9d0fd6 runtime-rs: Update sandbox status to include created_at field
Ensure the `created_at` timestamp is correctly propagated in
the sandbox status.

Although `created_at` is present in the `SandboxStatus` and
`SandboxStatusResponse` data structures, it was previously
omitted during the status transition.

This commit completes the implementation by passing the value
recorded during sandbox initialization.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-13 11:58:34 +08:00
Alex Lyn
610f538c8e runtime-rs: Avoid shutting down sandbox on container exit
Prevent the sandbox from being prematurely shut down when a standard
workload container exits.

Previously, the shutdown logic incorrectly triggered a sandbox shutdown
whenever the container list became empty. This resulted in unintended
lifecycle termination for non-transient sandboxes.

This change refines the `need_shutdown_sandbox()` criteria in
`virt_container/src/container_manager/manager.rs` to only initiate a
shutdown under specific conditions:
- The shutdown request is explicit (`req.is_now`).
- The request targets the sandbox itself (`req.container_id ==
  self.sid`).

By removing the implicit dependency on the empty container list, we
ensure the sandbox remains active as expected after workload containers
finish execution.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-13 11:58:34 +08:00
Alex Lyn
c26aecd173 runtime-rs: Block WaitSandbox until sandbox exits
Rework sandbox waiting so the WaitSandbox path blocks on sandbox
lifetime rather than directly borrowing the hypervisor wait call.

Once stop has been observed, the cached exit result is returned to
later waiters. While the sandbox is still alive, waiters subscribe to
the internal stop notifier and sleep until shutdown or VM exit records
the final result.

Together with the preceding support commits, this keeps the overall
behaviour identical to the original WaitSandbox fix while making the
dependency chain explicit.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-13 11:58:34 +08:00
Alex Lyn
f10836be85 runtime-rs: Add sandbox exit notifier in VirtSandbox
Add an internal exit_notify_tx channel to VirtSandbox and initialise
it in both the regular and restore constructors.

The later WaitSandbox rework needs a way to block until sandbox stop
has been observed without polling runtime state. This commit only
wires in the notifier so the follow-on behaviour change can subscribe
to a dedicated stop signal.

No WaitSandbox behaviour changes are made here yet.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-13 11:58:34 +08:00
Alex Lyn
5ae7b2d8b5 runtime-rs: Introduce a cached sandbox exit information
Introduce an exit_info field in SandboxInner so sandbox teardown can
store a stable exit result in runtime state.

The follow-on WaitSandbox rework needs a place to keep the final
SandboxExitInfo after the sandbox has already stopped. Without that
cached result, later waiters would have no consistent value to return
once the original stop event has passed.

This change only adds the state holder. Behaviour changes follow in
later commits.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-13 11:58:34 +08:00
Aurélien Bombo
dcafae9645 Merge pull request #13032 from kata-containers/sprt/fix-virtiofsd-args
runtime-rs: align virtiofsd args on runtime-go
2026-05-12 19:55:54 -05:00
Dan Mihai
3799473041 Merge pull request #13010 from microsoft/danmihai1/label-references
genpolicy: support env variable values sourced from metadata.labels values
2026-05-12 15:41:11 -07:00
Aurélien Bombo
555b7738fe runtime-rs: align virtiofsd args on runtime-go
Runtime-go doesn't hardcode --sandbox none --seccomp none [1],
so mirror that in runtime-rs.

 [1]: 733ccb3254/src/runtime/virtcontainers/virtiofsd.go (L183)

Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
2026-05-12 12:51:32 -05:00
Greg Kurz
733ccb3254 Merge pull request #12996 from stevenhorsman/swap-agent-ctl-to-skopeo&umoci
agent-ctl: Swap rootfs bundle pull implementation
2026-05-12 19:12:27 +02:00
Zvonko Kaiser
7d25934fef Merge pull request #13019 from fidencio/topic/nvidia-rootfs-use-erofs-instead-of-ext4
nvidia: rootfs: (try to) use erofs for the image instead of ext4
2026-05-12 18:54:21 +02:00
Steve Horsman
2b329074f1 Merge pull request #13023 from manuelh-dev/mahuber/nim-journal-fix
tests: nvidia: avoid NIM journal dumps on success
2026-05-12 09:32:07 +01:00
Fabiano Fidêncio
ea5755572c Merge pull request #13026 from stevenhorsman/fix-fixed-datae-stale-issue
ci: correct environment variable syntax in stale issues workflow
2026-05-11 20:57:41 +02:00
stevenhorsman
37e7bf0773 ci: correct environment variable syntax in stale issues workflow
The stale issues workflow was using shell syntax ${AGE} instead of
GitHub Actions syntax ${{ env.AGE }} for the days-before-issue-stale
parameter. This prevented the workflow from correctly reading the
calculated AGE value.

Also added days-before-stale: -1 to disable default stale behavior
and ensure only issue-specific settings apply.

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Assisted-By: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-05-11 09:31:36 +01:00
Manuel Huber
c265e4905f tests: nvidia: avoid NIM journal dumps on success
BATS_TEST_COMPLETED is per-test and remains empty in teardown_file.
Track file-level state so successful NIM runs skip the journal dump
while setup or test failures still include node diagnostics.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-05-10 09:10:01 -07:00
Fabiano Fidêncio
93e02944fa image-builder/nvidia: skip DAX header for virtio-blk-pci images
The DAX header (2 MiB of NVDIMM metadata + a duplicate MBR) is
unconditionally prepended to every image by set_dax_header(). NVIDIA
images use virtio-blk-pci with disable_image_nvdimm=true, so the
kernel reads MBR #1 directly and never touches the DAX metadata --
it is dead weight.

Add a SKIP_DAX_HEADER environment variable (default "no") that, when
set to "yes", skips the DAX header entirely:
- Removes the 2 MiB DAX overhead from image size calculations in
  both the erofs and ext4 paths
- Skips the set_dax_header() call, avoiding compilation and
  execution of the nsdax tool
- Passes the variable through to containerised builds

Enable SKIP_DAX_HEADER=yes for both install_image_nvidia_gpu() and
install_image_nvidia_gpu_confidential() in the build pipeline. All
other image builds are unaffected (default remains "no").

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-10 17:18:05 +02:00
Fabiano Fidêncio
b72bb7243e image-builder: bump base image from Fedora 42 to 44
Fedora 42 reaches end-of-life in May 2026. Move the image-builder
container to Fedora 44, which is the current stable release.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-10 17:18:05 +02:00
Fabiano Fidêncio
6b802a4e30 nvidia: switch GPU rootfs images to erofs
Switch the NVIDIA GPU rootfs images (both standard and confidential)
from ext4 to erofs (Enhanced Read-Only File System).

Unlike ext4, which is a read-write filesystem mounted read-only by
convention, erofs is structurally read-only -- no journal, no write
metadata, no superblock write path. This eliminates accidental
mutation and reduces the attack surface inside the guest VM, which
is particularly important for confidential workloads using dm-verity.

Introduce a DEFROOTFSTYPE_NV Makefile variable (set to erofs) for
both Go and Rust runtimes, keeping the global DEFROOTFSTYPE as ext4
so non-NVIDIA configurations are unaffected.

Update all six NVIDIA GPU configuration templates (base, SNP, TDX
for both runtimes) to use @DEFROOTFSTYPE_NV@ instead of the global
@DEFROOTFSTYPE@.

Export FS_TYPE=erofs in install_image_nvidia_gpu() and
install_image_nvidia_gpu_confidential() so the build pipeline
produces erofs images via the image builder.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-10 17:18:05 +02:00
Fabiano Fidêncio
bfcd249f40 image-builder: add erofs dm-verity support and lz4hc compression
Add full dm-verity and measured rootfs support to
create_erofs_rootfs_image(), bringing it to parity with the ext4 path.

Unlike ext4, which is a read-write filesystem mounted read-only by
convention, erofs is structurally read-only -- no journal, no write
metadata, no superblock write path.

This is a natural fit for dm-verity: erofs never attempts writes, so
verity never has to reject anything. With ext4, the kernel must skip
journal replay on verity-protected devices, which is a fragile
assumption.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-10 17:18:05 +02:00
Fabiano Fidêncio
d2e0555cf0 image-builder: refactor dm-verity setup into shared functions
Extract build_kernel_verity_params() and setup_verity() from the
inline block inside create_rootfs_image() into top-level functions.

This is a pure refactoring with no behavior change. The verity logic
is moved verbatim, with the only difference being that
build_kernel_verity_params() now takes the image path as an explicit
parameter instead of capturing it from the enclosing scope.

The extracted functions will be reused by create_erofs_rootfs_image()
in a subsequent commit to add dm-verity support for erofs images.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-10 17:18:05 +02:00
manuelh-dev
2ffd1538a2 Merge pull request #13021 from fidencio/topic/kata-deploy-log-level-containerd-version-4
kata-deploy: Fix containerd debug level path for config schema v4
2026-05-10 07:28:26 -07:00
Fabiano Fidêncio
341a0d366c kata-deploy: Fix containerd debug level path for config schema v4
Containerd 2.3 (config schema v4) uses the top-level [debug] table
for log level configuration, not plugins."io.containerd.server.v1.debug"
as was the case in the RC builds.

Update containerd_debug_level_toml_path() to use .debug.level for all
schema versions, matching the released containerd behavior.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-10 12:02:24 +02:00
Fabiano Fidêncio
46b46589a6 Merge pull request #13020 from manuelh-dev/mahuber/nim-op-placement
tests: nvidia: place NIM service into namespace
2026-05-10 12:01:58 +02:00
Manuel Huber
1c081ff434 tests: nvidia: place NIM service into namespace
Place the NIM service into our test namespace. We are still observing
various situations where for some reasons, the NIM service appears in
the default namespace in our CI.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-05-10 07:36:23 +00:00
Fabiano Fidêncio
905303b6b0 Merge pull request #13013 from BbolroC/filter-vfio-gk-only-runtime-rs
runtime-rs: filter VFIO devices only in guest-kernel mode
2026-05-08 23:49:50 +02:00
Fabiano Fidêncio
a447a1fb03 Merge pull request #13015 from stevenhorsman/kernel-6.18.28-bump
version: Bump to latest 6.18 kernel
2026-05-08 21:12:50 +02:00
Fabiano Fidêncio
f7be57efe2 Merge pull request #13007 from manuelh-dev/mahuber/dbg-nim-svc
tests: nvidia: Wait for NIM operator pod and print
2026-05-08 20:58:51 +02:00
stevenhorsman
87664c608d version: Bump to latest 6.18 kernel
Pick up the latest kernel that fixes CVE-2026-43284

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-05-08 17:15:24 +01:00
Hyounggyu Choi
754707fe83 runtime-rs: filter VFIO devices only in guest-kernel mode
After #12857, the VFIO-AP hotplug test fails because runtime-rs
unconditionally removes all /dev/vfio/* devices from the OCI spec
before sending it to the kata agent. The agent then rejects
the container creation with:

```
Missing devices in OCI spec
```

Filter devices from the OCI spec conditionally based on the
vfio_mode configuration (e.g. guest-kernel). Also factor the
filtering logic out into a separate function and add unit tests.

Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
2026-05-08 15:39:16 +02:00
Fabiano Fidêncio
8e65e89ade Merge pull request #13011 from kata-containers/fix-warnings
runtime-rs: Fix warnings in rust runtime
2026-05-08 15:12:53 +02:00
Fabiano Fidêncio
a541827a7e Merge pull request #12984 from fidencio/topic/network-pair-use-name-for-lookup
runtime-rs: network: use provided name for virt interface lookup
2026-05-08 14:31:58 +02:00
Fabiano Fidêncio
09bbc70302 Merge pull request #13002 from manuelh-dev/mahuber/unrequire-nim-svc
gatekeeper: Unrequire NVIDIA GPU test (temporary)
2026-05-08 10:02:00 +02:00
Fabiano Fidêncio
2879619d07 Merge pull request #12981 from fidencio/topic/kata-deploy-reduce-memory-consumption
kata-deploy: reduce memory consumption
2026-05-08 09:51:47 +02:00
Alex Lyn
1441b2b84a runtime-rs: Fix warnings in rust runtime
So many unformatted rust codes cause uncommitted change files in
rust runtime and its libs or agent sources, which can be easily
found just by `cargo fmt --all`.

Let's reduce such noisy bad experiences

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-08 14:56:00 +08:00
Manuel Huber
714adec3f8 tests: nvidia: Wait for NIM operator pod and print
Wait for the NIM operator pod to run before deploying NIM services.
Add a temporary debug function to print resource placement into the
different namespaces. Remove this function again when the NIM tests
are stabilized.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-05-08 06:27:48 +00:00
Ubuntu
b95be5332a genpolicy: env variables from metadata.labels
Add basic genpolicy support for container environment variables sourced
from metadata.labels.

In this implementation, the relevant labels must be available as input
to the policy tool. This is slightly different from the way variables
sourced from metadata.annotations are treated by the tool: when the
relevant annotation is not available as input, the generated Policy
allows any value. Depending on metadata.labels use cases that we might
encounter maybe the labels will be handled the same way as the
annotations in the future.

Signed-off-by: Dan Mihai <dmihai@microsoft.com>
2026-05-07 23:35:56 +00:00
Dan Mihai
e71cf4d4ca genpolicy: call get_annotations() when/if needed
Call get_annotations() only when/if the annotations get used.

The new structure of the code fits better with the future calls to a
similar get_labels() function.

Signed-off-by: Dan Mihai <dmihai@microsoft.com>
2026-05-07 22:55:54 +00:00
Dan Mihai
39b9c318e2 tests: k8s: merge two policy-pod test cases
One of these test cases was a subset of the other, so remove that
redundancy.

Signed-off-by: Dan Mihai <dmihai@microsoft.com>
2026-05-07 22:39:23 +00:00
stevenhorsman
e92d954b51 agent-ctl: Swap rootfs bundle pull implementation
Switch the rootfs bundle pull implementatio from using image-rs to
use skopeo and umoci to remove the really long crate dependency
tail that image-rs brings.

Generated-by: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-05-07 21:11:27 +01:00
Manuel Huber
edfb6f5716 gatekeeper: Unrequire NVIDIA GPU test (temporary)
Temporarily unrequire the NVIDIA GPU test. We are experiencing
situations in which two NIM service instances get deployed almost
at the same time into the kata-containers-k8s-tests namespace
(expected current context) and into the default namespace. This
causes the NIM operator to create two deployments in the two
namespaces and to then schedule two pods at the same time. This
usually causes the NIM pod in the default namespace to fail and to
linger.
We can't explain yet why this does not happen in the TEE CI path
and why this is happening at all.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-05-07 14:39:24 +02:00
Fabiano Fidêncio
8dde5f39b7 tests: dump kata-deploy pod describe+logs on install timeout
When kubectl wait times out the pod never reached Ready, so the
existing log collection (which runs after wait succeeds) produces
"-- No entries --" with zero useful information.

Capture kubectl describe and kubectl logs (including previous
container) immediately on timeout so the next CI run shows exactly
why the pod is stuck (ImagePullBackOff, OOMKilled, probe failures,
containerd restart hang, etc.).

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-07 13:40:55 +02:00
Fabiano Fidêncio
0f3160276b ci: k8s: skip no-op Helm uninstall on free runners
In cleanup_kata_deploy, bail out early when no kata-deploy Helm release
exists so baremetal-* pre-deploy cleanup on fresh clusters does not
block on helm uninstall --wait (up to 10m).

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-07 13:40:55 +02:00
Fabiano Fidêncio
f5533950e6 kata-deploy: helm: cap container RSS via resources block
Plumb a resources block into the kata-deploy DaemonSet container in
the Helm chart so the cluster can size its memory footprint
predictably.

Defaults are sized from real /proc/<pid>/status numbers on an
unpatched 3.30.0 build running on a ~220-vCPU GPU node:

  VmRSS:    9944 kB  (~9.7 MiB)   <- actual physical memory
  RssAnon:  2628 kB  (~2.6 MiB)   <- heap + dirty stack pages
  VmData: 464668 kB  (~454 MiB)   <- tokio multi-thread workers'
                                     reserved-but-untouched stacks
  Threads: 225                    <- num_cpus()-driven worker pool

That VmData number is the source of the original "kata-deploy is
using 400 MB" reports: any monitoring layer that surfaces virtual
data size, committed memory, or memory.usage_in_bytes on a kernel
that includes mapped-but-untouched memory will happily reproduce
~400 MB even though only ~10 MiB is ever made resident. The earlier
commits in this series (current_thread tokio, mimalloc, shared kube
client, JSONPath removal, post-install re-exec) collapse VmData into
the tens of MiB and drop the post-install resident set further.

The defaults below are picked accordingly:

  requests:
    cpu: 25m            # install is mostly I/O wait; the post-install
                        # waiter is genuinely idle
    memory: 16Mi        # ~2x headroom over the unpatched VmRSS we
                        # measured, far more over the patched waiter

Operators who hit OOMKilled on unusually large or churny clusters can
override `resources` directly in their Helm values (or set it to {}
to remove all requests and inherit cluster defaults).

Fixes: https://github.com/kata-containers/kata-containers/discussions/12976

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-07 13:40:55 +02:00
Fabiano Fidêncio
9e99b21ec5 kata-deploy: re-exec into a tiny post-install waiter
After install completes the kata-deploy DaemonSet pod has nothing else
to do for the rest of its lifetime — it just blocks on SIGTERM and then
runs cleanup. Up to here, the install path has built up substantial
peak heap (kube clients, deserialised Node/RuntimeClass objects, hyper
+ rustls TLS pools, parsed JSON / YAML), and on musl essentially none
of that is ever returned to the kernel. Idling in the same process
therefore pins the pod's RSS at the install peak indefinitely.

Re-exec the binary into a hidden `internal-post-install-wait` action
the moment install succeeds. execve(2) discards the entire address
space, so the waiter starts up holding only the working set it actually
needs (a config struct, the SIGTERM handler, and the health server).

To avoid a probe-availability gap during the handover the install
process clears FD_CLOEXEC on the health listener and passes the raw
FD to the child via KATA_DEPLOY_HEALTH_FD. The child reattaches the
FD as a tokio TcpListener and resumes serving /healthz and /readyz
without ever closing the socket — the kubelet sees no failure.

The detected container runtime is similarly threaded through
KATA_DEPLOY_DETECTED_RUNTIME so the waiter doesn't have to re-query
the apiserver. The new action is tagged `#[clap(hide = true)]` so
`--help` doesn't expose it; users should never invoke it directly.

Add the FD-inheritance helpers in health.rs:

  - prepare_listener_for_exec(): clears FD_CLOEXEC on a listener and
    returns its raw fd number.
  - listener_from_inherited_fd(): wraps an inherited fd back into a
    tokio::net::TcpListener (and re-sets FD_CLOEXEC so future host
    shellouts don't leak the socket).

Fixes: https://github.com/kata-containers/kata-containers/discussions/12976

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-07 13:40:55 +02:00
Fabiano Fidêncio
af03ab2228 kata-deploy: replace JSONPath node lookups with typed accessors
The two pieces of node metadata kata-deploy actually reads are
.status.nodeInfo.containerRuntimeVersion and a single label, both of
which were being fetched through a homegrown JSONPath walker:

  - get_node_field() serialised the entire Node object back into a
    serde_json::Value tree on every call,
  - split_jsonpath() / get_jsonpath_value() then walked that tree by
    string key.

Both the deep clone and the helpers themselves are unnecessary — kube's
Node type is already strongly typed. Replace get_node_field() with two
purpose-built accessors that read straight off the Node struct:

  - get_container_runtime_version(): pulls
    status.node_info.container_runtime_version with a clear error if
    the field isn't populated.
  - get_node_label(key): returns Option<String> directly from
    metadata.labels.

Drop split_jsonpath, get_jsonpath_value, and their unit tests (which
existed only to cover the JSONPath walker we no longer have). Update
the three callers (config.rs, runtime/manager.rs, runtime/containerd.rs)
to use the typed accessors.

This removes the entire serde_json::Value clone-and-walk path from the
hot read path and meaningfully cuts allocator churn during install.

Fixes: https://github.com/kata-containers/kata-containers/discussions/12976

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-07 13:40:55 +02:00