Commit Graph

6101 Commits

Author SHA1 Message Date
RuoqingHe
cfc1836a31 Merge pull request #12672 from stevenhorsman/agent-security-fixes
agent: Bump tracing-subscriber
2026-03-20 17:37:16 +08:00
Steve Horsman
7ab6e11e10 Merge pull request #12678 from kata-containers/dependabot/go_modules/src/runtime/google.golang.org/grpc-1.79.3
build(deps): bump google.golang.org/grpc from 1.72.0 to 1.79.3 in /src/runtime
2026-03-20 08:49:35 +00:00
Steve Horsman
e475fb2116 Merge pull request #12680 from kata-containers/dependabot/go_modules/src/tools/csi-kata-directvolume/google.golang.org/grpc-1.79.3
build(deps): bump google.golang.org/grpc from 1.63.2 to 1.79.3 in /src/tools/csi-kata-directvolume
2026-03-20 08:49:27 +00:00
stevenhorsman
38a655487f vsock-exporter: Switch bincode for serde_json
bincode is not maintained, so switch to serde_json to
resolve RUSTSEC-2025-0141

Assisted-By: Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-03-19 10:45:17 +00:00
stevenhorsman
e1d7d5bef8 agent: Remove async-std
It's a dev-dependency that doesn't seem to be used, so
remove it and resolve RUSTSEC-2025-0052

Assisted-By: Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-03-19 10:45:17 +00:00
stevenhorsman
e4eda5e1d8 agent: Bump tracing-subscriber
- Bump tracing-subscriber to 0.3.20 to resolve RUSTSEC-2025-0055
- Switch deprecated `slog_info!` for `slog::info!`

Generated-By: Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-03-19 10:45:17 +00:00
stevenhorsman
d06dadd8ef docs: Spelling updates
Either fixing typos, or including program/repo name in
backticks

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-03-19 10:22:54 +00:00
dependabot[bot]
2f5415d8f5 build(deps): bump google.golang.org/grpc
Bumps [google.golang.org/grpc](https://github.com/grpc/grpc-go) from 1.63.2 to 1.79.3.
- [Release notes](https://github.com/grpc/grpc-go/releases)
- [Commits](https://github.com/grpc/grpc-go/compare/v1.63.2...v1.79.3)

---
updated-dependencies:
- dependency-name: google.golang.org/grpc
  dependency-version: 1.79.3
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-03-19 10:03:45 +00:00
dependabot[bot]
3876a80208 build(deps): bump google.golang.org/grpc in /src/runtime
Bumps [google.golang.org/grpc](https://github.com/grpc/grpc-go) from 1.72.0 to 1.79.3.
- [Release notes](https://github.com/grpc/grpc-go/releases)
- [Commits](https://github.com/grpc/grpc-go/compare/v1.72.0...v1.79.3)

---
updated-dependencies:
- dependency-name: google.golang.org/grpc
  dependency-version: 1.79.3
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-03-19 10:03:30 +00:00
stevenhorsman
2a4227e02e kata-ctl: Try fixing unused_assignement error
`allow(unused_assignments)` isn't working as it's
in macro generated code, so referencing the command
in the error, to use it

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-03-17 16:04:58 +00:00
stevenhorsman
ca7cdcd732 kata-ctl: Rewrite path_join test
This test was failing clippy by calling .unwrap() after
an .is_ok(), but after I looked at it, it seemed a bit messy,
so I split it up and tried rewriting it to make it more readable
IMHO.

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-03-17 16:04:58 +00:00
stevenhorsman
501578cc5a agent: Remove non-idiomatic unwrap
Calling .unwrap() after an .is_some() check is considered non-idiomatic in
as it performs redundant work and makes the code more verbose.

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-03-17 16:04:58 +00:00
Alex Lyn
833b72470c Merge pull request #12647 from sprt/gp-improve
genpolicy: Improve emptyDir storage options and mount point validation
2026-03-17 13:56:42 +08:00
Manuel Huber
660e3bb653 gpu: Obsolete the NVIDIA initrd build
As the NVIDIA stack has shifted to using an image for both the
confidential and non-confidential variants, we retire the initrd
build.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-03-16 21:29:58 -04:00
Manuel Huber
169f92ff09 agent: cdh: Update CDH and API
With the new CDH version, the secure_mount API changes.
Further, the new CDH version no longer uses the luks-encrypt-storage
script but utilizes libcryptsetup as well as mkfs.ext4 and dd. Hence, adapt
some of the CDH and Kata components build steps

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-03-16 09:43:17 -07:00
Zvonko Kaiser
6a853a9684 gpu: Bump NVRC
We have a new release add this one to the next
Kata release.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>

Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-03-15 09:53:32 -07:00
Zvonko Kaiser
8ff5d164c6 runtime: make CDI annotation vendor-agnostic with lookup table
Replace hardcoded NVIDIA vendor ID (0x10de) and class (0x030) checks
with a vendor-agnostic lookup table (cdiDeviceKind) that maps PCI
vendor/class pairs to CDI device kinds. This makes it straightforward
to add support for new device types by adding entries to the table.

Refactor siblingAnnotation to resolve device BDFs once upfront and
reuse them for both CDI type detection and sibling matching, eliminating
redundant sysfs reads. Devices not in the lookup table (e.g. NVSwitches)
are skipped with errNoSiblingFound, while known device types that fail
to match a sibling produce a hard error.

Consolidate the hot-plug and cold-plug device loops into a single loop
over extracted container paths, removing duplicated filtering logic.

Export GetPCIDeviceProperty from the device drivers package to allow
vendor/class lookup from sysfs in the container annotation path.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-03-15 09:53:32 -07:00
Zvonko Kaiser
d4c21f50b5 gpu: Bump default memory to 8G for GPU runtimes
We need enough inital memory to prepare more complex
platforms like HGX H100 or HGX B200 systems.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-03-15 09:53:32 -07:00
Zvonko Kaiser
5c9683f006 gpu: Remove devtmpfs.mount=0
With the newest NVRC release this is solved and does
not need to be overriden.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-03-15 09:53:32 -07:00
Zvonko Kaiser
d22c314e91 gpu: Increase dial_timeout=1200
For cold-plug when running with nerdctl the timeouts in the config
are being used, increase the dial_timeout (e.g. for CreateSandbox) to match
create_container_timeout.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-03-15 09:53:32 -07:00
stevenhorsman
f25fa6ab25 runtime: bump go.mod version
Update the runtime's go.mod go version to 1.25.8 to
keep in sync with versions.yaml

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-03-12 08:53:40 +00:00
dependabot[bot]
d366d103cc build(deps): bump quinn-proto in /src/tools/agent-ctl
Bumps [quinn-proto](https://github.com/quinn-rs/quinn) from 0.11.8 to 0.11.14.
- [Release notes](https://github.com/quinn-rs/quinn/releases)
- [Commits](https://github.com/quinn-rs/quinn/compare/quinn-proto-0.11.8...quinn-proto-0.11.14)

---
updated-dependencies:
- dependency-name: quinn-proto
  dependency-version: 0.11.14
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-03-11 16:04:34 +00:00
Dan Mihai
04f180434e Merge pull request #12640 from burgerdev/genpolicy-workspace
genpolicy: add to Cargo workspace
2026-03-11 09:02:39 -07:00
Steve Horsman
ba0f5b98fe Merge pull request #12643 from stevenhorsman/bump-golang-to-1.25.8
versions: bump golang to 1.25.8
2026-03-11 08:53:21 +00:00
Markus Rudy
6643b258bb genpolicy: update oci-client to v0.16.1
The older version we used transitively depends on an unmaintained crate.

Signed-off-by: Markus Rudy <mr@edgeless.systems>
2026-03-11 09:30:48 +01:00
Markus Rudy
8dfeeea924 genpolicy: add to Cargo workspace
This commit adds the genpolicy utility to the root workspace. For now,
only dependencies that are already in the root workspace are consumed
from there, the genpolicy-specific ones should be added later.

Signed-off-by: Markus Rudy <mr@edgeless.systems>
2026-03-11 09:30:46 +01:00
Markus Rudy
fc4eaf8b66 runtime-rs: specify the subpackage to build
Before this change, `make test` for runtime-rs used to test all crates
in the root workspace (due to the `--all` flag). This was not intended
but happened to be mostly working. However, genpolicy needs additional
steps before it can build, so this behavior blocks adding genpolicy to
the root workspace.

The solution here is to only build the inteded packages. For the build
and run commands, this is the runtime-rs crate itself. For testing, we
need to include the sub-crates, too, which needs a bit of cargo metadata
scraping.

Signed-off-by: Markus Rudy <mr@edgeless.systems>
2026-03-11 09:28:24 +01:00
Aurélien Bombo
2a15cfc5ec genpolicy: Improve emptyDir storage options and mount point validation
These are two changes following a Copilot review on #10559:

1. Restore the p_storage.driver != "blk" check in allow_storage_options():
   - An early version of #10599 hardcoded p_storage.driver to "blk".
   - Hence that check needed to be removed to validate "blk" storage options.
   - The final version of #10599 hardcodes p_storage.driver to "" to
     account for both "blk" and "scsi", and checks storage options in
     allow_block_storage().
   - Hence that check should be restored to preserve the original behavior.

https://github.com/kata-containers/kata-containers/pull/10559#discussion_r2907646552

2. Don't use a regex to validate emptyDir storage mount points:
   - It's risky to use a regex to validate a path that has base64-encoded
     components.
   - We can infer the exact path anyway so the regex is redundant.

https://github.com/kata-containers/kata-containers/pull/10559#discussion_r2907646582

Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
2026-03-10 11:22:10 -05:00
Dan Mihai
f9a8eb6ecc genpolicy: allow_mount improvements for emptyDir
1. Reduce the complexity of the new allow_mount rules for emptyDir.

2. Reverse the order of the two allow_mount versions, as a hint to the
   rego engine that the first version is more often matching the input.

3. Remove `p_mount.source != ""` from mount_source_allows, because:
 - Policy rules typically test the values from input, not values read
   from Policy.
 - mount_source_allows is no longer called for emptyDir mounts after
   these changes, so p_mount.source is not empty.

Signed-off-by: Dan Mihai <dmihai@microsoft.com>
2026-03-09 14:52:17 -05:00
Aurélien Bombo
a98e328359 tests: Add test for trusted ephemeral data storage
This tests the feature on CoCo machines.

Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
2026-03-09 14:52:17 -05:00
Aurélien Bombo
9fe03fb170 genpolicy: Support trusted ephemeral data storage
* Introduces a new cluster_config setting encrypted_emptydir defaulting to true.
 * Adapts genpolicy for encrypted emptyDirs.

Crucially, the rules.rego change checks that the mount and the storage are
well-formed together:

 * i_storage.source matches a known regex.
 * i_storage.mount_point == $(spath)/BASE64(i_storage.source)
 * i_storage.mount_point == p_storage.mount_point
 * i_storage.mount_point == i_mount.source

Note that policy enforcement is necessary to prevent rogue device injection.
E.g. the agent could not blindly encrypt all block devices as some use cases
only need dm-verity.

Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
2026-03-09 14:52:17 -05:00
Aurélien Bombo
eaa711617e agent: Support trusted ephemeral data storage
Handles block-based emptyDirs plugged via virtio-blk and virtio-scsi by
encrypting and formatting them.

Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
2026-03-09 14:52:17 -05:00
Aurélien Bombo
a4fd32a29a runtime: Support trusted ephemeral data storage
* Introduces the `emptydir_mode` config flag to allow instructing the runtime
   to create a block device for emptyDir volumes.
 * The block device is created in the original emptyDir folder on the host
   so that Kubelet can monitors its disk usage and evict the pod if it exceeds
   its sizeLimit. This matches runc and virtio-fs.
 * The block device's disk image file is sparse to minimize host disk
   footprint.

Fixes: #10560

Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
2026-03-09 14:52:17 -05:00
Alex Lyn
fb743a304c runtime: Support plugging a disk as an image file
Some VMMs support plugging a disk as an image file instead of a block device,
so we adapt the runtime to support that.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
Co-authored-by: Aurélien Bombo <abombo@microsoft.com>
2026-03-09 14:52:17 -05:00
stevenhorsman
8ae0e36737 versions: bump golang to 1.25.8
Bump the builder image and versions to resolve CVEs:
- GO-2026-4601
- GO-2026-4602
- GO-2026-4603

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-03-09 09:10:01 +00:00
Alex Lyn
62b0f63e37 dragonball: Generate unique TAP names to avoid conflicts
The vhost-kern net unit test used a fixed TAP interface name
("test_vhosttap"). When tests run in parallel or a previous run
leaves the interface behind, TAP creation can fail with
EBUSY ("Resource busy"), making CI flaky.

Introduce a unique_tap_name() helper in the tests and use it to
generate a per-test TAP name (based on pid/thread/counter),
avoiding name collisions and stabilizing CI.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-03-06 17:33:40 +08:00
Alex Lyn
1c8c0089da dragonball: fix flaky signal_handler test using libc::raise
The signal_handler test was intermittently failing because it used
kill(pid, sig), which sends signals asynchronously to the process.
This created a race condition where the child thread could exit and
be joined before the signal was delivered or processed.

This fix including:
1. Replaces `kill` with `libc::raise` to ensure signals are delivered
   synchronously to the calling thread.
2. Reorders triggers to verify standard signals before installing
   seccomp filters.
3. Guarantees that metrics are incremented before the child thread
   terminates and is joined by the main thread.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-03-06 09:28:56 +08:00
Alex Lyn
d0718f6001 dragonball: Fix unnecessary parentheses around type
warning: unnecessary parentheses around type
   --> src/dragonball/dbs_legacy_devices/src/serial.rs:245:39
    |
245 |         let out: Arc<Mutex<Option<Box<(dyn std::io::Write + Send +
'static)>>>> =
    |                                       ^
^
    |
    = note: `#[warn(unused_parens)]` (part of `#[warn(unused)]`) on by
default
help: remove these parentheses

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-03-06 09:28:56 +08:00
Alex Lyn
b4161198ee dragonball: Remove unused imports variables in dbs_pci
Fix warnings of unused imports as below:
```
warning: unused imports: `DEVICE_ACKNOWLEDGE`, `DEVICE_DRIVER_OK`,
`DEVICE_DRIVER`, `DEVICE_FEATURES_OK`, and `DEVICE_INIT`
    --> src/dragonball/dbs_pci/src/virtio_pci.rs:1177:9
     |
1177 |         DEVICE_ACKNOWLEDGE, DEVICE_DRIVER, DEVICE_DRIVER_OK,
DEVICE_FEATURES_OK, DEVICE_INIT,
     |         ^^^^^^^^^^^^^^^^^^  ^^^^^^^^^^^^^  ^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^  ^^^^^^^^^^^
     |
     = note: `#[warn(unused_imports)]` (part of `#[warn(unused)]`) on by
default
```

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-03-06 09:28:56 +08:00
Alex Lyn
ca4e14086f runtime-rs: Fix warnings of unformatted codes
Fix warnings from unformattted codes.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-03-06 09:28:56 +08:00
Alex Lyn
ce800b7c37 dragonball: Fix flaky test_vhost_user_net_virtio_device_activate hang
The vhost-user-net tests could hang in CI because
VhostUserNet::new_server() blocks indefinitely on listener.accept()
when the slave fails to connect in time
(e.g. due to scheduler delays or flaky socket paths). This also caused
panics when connect_slave() returned None and the test unwrapped it.

Fix the tests by:
- using a `/tmp`, absolute, unique unix socket path per test run
  retrying slave connect with a deadline
- running new_server() in a separate thread and waiting via
  recv_timeout() to ensure the test never blocks indefinitely

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-03-06 09:28:56 +08:00
Alex Lyn
a988b10440 dragonball: Fix flaky test_vhost_user_net_virtio_device_normal hang
It aims to fix flaky test hang by implementing thread timeouts.

The `test_vhost_user_net_virtio_device_normal` was hanging in CI
when master/slave threads drifted.

This commit stabilizes the test by:
- Using `tempfile` and unique paths to ensure socket isolation.
- Adding a 5s deadline for slave connections to handle CI jitter.
- Running `new_server` in a separate thread with a `recv_timeout`
  to prevent the CI pipeline from deadlocking.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-03-06 09:28:56 +08:00
Alex Lyn
f36218d566 dragonball: Fix flaky test_inner_stream_timeout in inner backend
The `test_inner_stream_timeout` test case was prone to failure due to a
race condition between the main thread and the background handler. The
test relied on hardcoded `thread::sleep` durations, which could cause
the second read operation to time out (150ms window) before the main
thread performed its write (after a 300ms sleep) under high system load.

This commit stabilizes the test by:
1. Replacing fixed sleep durations with a `Condvar` and a `stage`
   variable to implement a deterministic state machine.
2. Synchronizing the threads so that the main thread only writes data
   after the background handler has confirmed it is ready or has
   completed its previous phase.
3. Ensuring the read timeout is explicitly managed between different
   validation stages to prevent accidental `TimedOut` errors.

This change eliminates the flakiness and ensures the test passes
consistently across different CIenvironments.

Fixes #12618

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-03-06 09:28:56 +08:00
Alex Lyn
c8a39ad28d dragonball: Fix flaky test_epoll_manager by improving synchronization
This commit aims to address issues of "Infinite loop in epoll_manager
tests" and improve stablity.

Root causes as below:
1. Using `handle_events(-1)` caused the worker thread to block forever
   if an event was missed or if the internal `kick()` signal was not
   accounted for correctly.
2. Relying on event counts was unreliable because internal signals could
   fluctuate the total count, causing the it to enter an infinite loop.
3. Using `EventSet::OUT` on an EventFd is often continuously ready,
   leading to non-deterministic trigger behavior.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-03-06 09:28:56 +08:00
Fabiano Fidêncio
2fff33cfa4 Merge pull request #12628 from stevenhorsman/agent-ctl-bump-aws-lc-rs
agent-ctl: Update aws-lc-rs
2026-03-05 20:52:03 +01:00
Fabiano Fidêncio
079fac1309 Merge pull request #12591 from fidencio/topic/kernel-add-mmio-back-to-the-unified-kernels
kernel: include mmio fragment in unified build for firecracker
2026-03-05 13:45:41 +01:00
stevenhorsman
c57f2be18e agent-ctl: Update aws-lc-rs
aws-lc has mutliple high severity CVEs:
- GHSA-vw5v-4f2q-w9xf
- GHSA-65p9-r9h6-22vj
- GHSA-hfpc-8r3f-gw53

so try and bump to the latest `aws-lc-rs` crate to pull in the available fixed versions

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-03-05 10:02:22 +00:00
Fabiano Fidêncio
8f35c31b30 Merge pull request #12542 from fidencio/topic/genpolicy-distribute-different-settings-rather-than-patching-for-ci
genpolicy: settings.d drop-ins and scenario example drop-ins
2026-03-05 07:37:30 +01:00
Fabiano Fidêncio
83dd7dcc75 runtimes: reject virtio-blk-mmio when confidential_guest is true
Virtio-mmio transport is not hardened for confidential computing (unlike
virtio-pci). Reject config that would use virtio-blk-mmio for rootfs/block
when confidential_guest is set, so CoCo guests only use virtio-blk-pci.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-03-04 21:41:27 +01:00
Fabiano Fidêncio
d40afe592c genpolicy: add settings drop-in directory and RFC 6902 JSON Patch support
Allow genpolicy -j to accept a directory instead of a single file.
When given a directory, genpolicy loads genpolicy-settings.json from it
and applies all genpolicy-settings.d/*.json files (sorted by name) as
RFC 6902 JSON Patches. This gives precise control over settings with
explicit operations (add, remove, replace, move, copy, test), including
array index manipulation and assertions.

Ship composable drop-in examples in drop-in-examples/:
- 10-* files set platform base settings (non-CoCo, AKS, CBL-Mariner)
- 20-* files overlay specific adjustments (OCI version, guest pull)
Users copy the combination they need into genpolicy-settings.d/.

Replace the old adapt_common_policy_settings_* jq-patching functions
in tests_common.sh with install_genpolicy_drop_ins(), which copies the
right combination of 10-* and 20-* drop-ins for the CI scenario.
Tests still generate 99-test-overrides.json on the fly for per-test
request/exec overrides.

Packaging installs 10-* and 20-* drop-ins from drop-in-examples/ into
the tarball; the default genpolicy-settings.d/ is left empty.

Made-with: Cursor
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-03-04 20:13:21 +01:00