Commit Graph

19427 Commits

Author SHA1 Message Date
Steve Horsman
20bcff185f Merge pull request #13254 from kata-containers/dependabot/go_modules/src/runtime/go.mongodb.org/mongo-driver-1.17.7
build(deps): bump go.mongodb.org/mongo-driver from 1.14.0 to 1.17.7 in /src/runtime
2026-06-22 11:17:29 +01:00
Fabiano Fidêncio
f9682356ce Merge pull request #13216 from Apokleos/hotunplug-blk
runtime-rs: Add support for hot-unplugging block devices
2026-06-22 12:14:30 +02:00
Fabiano Fidêncio
337b600268 Merge pull request #13256 from fidencio/release/3.32.0
release: Bump version to 3.32.0
3.32.0
2026-06-22 10:33:25 +02:00
Alex Lyn
9550a323ac Merge pull request #13245 from kata-containers/unify-nix-version
Unify nix version
2026-06-22 15:25:10 +08:00
Alex Lyn
7aaa4e63d1 Merge pull request #13241 from PiotrProkop/exit-code
agent: report 128+signal as exit code for signal-terminated processes
2026-06-22 09:13:24 +08:00
Fabiano Fidêncio
dc70b93573 release: Bump version to 3.32.0
Bump VERSION and helm-charts versions.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-06-22 01:15:24 +02:00
PiotrProkop
c2d737c9d7 agent: report 128+signal as exit code for signal-terminated processes
When a container process is terminated by a signal, the agent's SIGCHLD
reaper stored the raw signal number as the process exit code. As a result
a process killed by SIGKILL(9) reported exit code 9 instead of the
conventional 137 (128+9).

Apply the standard shell convention of 128+signal_number so that
signal-terminated processes report the expected exit codes, e.g.
SIGKILL(9) -> 137, SIGTERM(15) -> 143, SIGINT(2) -> 130. This mimics
runc, which encodes wait-status exit codes the same way:
https://github.com/opencontainers/runc/blob/v1.4.3/libcontainer/utils/utils.go#L19

Both runc and this new Kata behaviour follow the conventional exit code
semantics documented at https://tldp.org/LDP/abs/html/exitcodes.html.

The conversion is factored into a small helper and covered by a unit
test. The runtime and shim already pass the exit code through unchanged,
so no further changes are needed for the corrected value to surface.

Fixes: signal-terminated containers reporting raw signal numbers

Signed-off-by: PiotrProkop <pprokop@nvidia.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-21 16:34:17 +02:00
Fabiano Fidêncio
374a867774 Merge pull request #13196 from microsoft/cameronbaird/upstream/runtime-go-clh-templating
runtime: Enable VM Templating Support for CLH
2026-06-21 16:31:19 +02:00
Alex Lyn
0a63aebea9 runtime-rs: Implement remove_device for block device hot removal
Replace the "Not yet implemented" stub in QemuInner::remove_device()
with a working implementation that calls hotunplug_device() to perform
the QMP-level device removal, then cleans up the internal devices list
via retain() to remove stale coldplug entries.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-06-20 22:08:57 +08:00
Alex Lyn
d4212bcb74 runtime-rs: Add hotunplug_device dispatcher for device type routing
Introduce hotunplug_device() as the device-type dispatcher that routes
hot removal requests to the appropriate QMP method. Currently supports
Block and BlockModern device types, which are forwarded to
Qmp::hotunplug_block_device(). All other device types return an
explicit "unsupported" error.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-06-20 22:08:57 +08:00
Alex Lyn
281b6aa61a runtime-rs: Add hotunplug_block_device for block device hot removal
Implement QMP-level block device hot-unplug by issuing device_del to
remove the frontend device and blockdev_del to remove the backend
blockdev node. For virtio-blk-ccw on s390x, the CCW subchannel slot
is also released.

Since QMP device_del is asynchronous and only initiates the removal
request, introduce wait_for_device_deleted() to poll for the
DEVICE_DELETED event before tearing down the backend. This prevents
blockdev_del from failing with "Node is still in use".

If blockdev_del fails, the error is logged but CCW cleanup still
proceeds before the error is propagated, ensuring consistent
subchannel state.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-06-20 22:08:57 +08:00
Alex Lyn
431720025c runtime-rs: Enhance hotplug_block_device error handling and rollback
Improve the reliability of block device hotplug by ensuring that
blockdev-add nodes are properly cleaned up when subsequent device_add
operations fail.

To address this, A new method of device_add_with_rollback is introduced
to do device_add and do properly cleaned up when it fails.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-06-20 22:08:57 +08:00
dependabot[bot]
399c863cd2 build(deps): bump go.mongodb.org/mongo-driver in /src/runtime
Bumps [go.mongodb.org/mongo-driver](https://github.com/mongodb/mongo-go-driver) from 1.14.0 to 1.17.7.
- [Release notes](https://github.com/mongodb/mongo-go-driver/releases)
- [Commits](https://github.com/mongodb/mongo-go-driver/compare/v1.14.0...v1.17.7)

---
updated-dependencies:
- dependency-name: go.mongodb.org/mongo-driver
  dependency-version: 1.17.7
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-06-20 10:22:56 +00:00
Cameron Baird
730307f32c factory: Default to normal sandbox boot path when factory init not done
The behavior we had before was that, for a starting k8s pod,
it sees enable_template=true and therefore:

1. Tries NewFactory with fetchOnly=true
2. When that fails (because template.Fetch fails to find the artifacts,
	we retry with fetchOnly=false. This creates a direct factory
	which creates the template from scratch
	(hence we pay a full pod sandbox boot time here)
	and then restores from that. Hence the boot times
	are strictly worse on this path.

Now, even when enable_template=true, we don't try to force a direct factory.
Instead we just revert to the standard sandbox boot path.

Signed-off-by: Cameron Baird <cameronbaird@microsoft.com>
2026-06-19 18:00:02 +00:00
Cameron Baird
65a5f272f8 ci: Introduce tests for VM template factory
Add k8s-vm-templating-test.bats which exercises pod create
with the factory initialized on the target node.

Signed-off-by: Cameron Baird <cameronbaird@microsoft.com>
2026-06-19 18:00:02 +00:00
Cameron Baird
c0f9744225 runtime: Implement support for VM Template factory in clh
Add support for VM Template factory on the clh path.

In order to support snapshot/restore-based VM templating,
the following changes were needed:
1. For clh.go, implement SaveVM, PauseVM, restoreVM, ResumeVM
2. Remove initrd config check for VM Templating path. The
        root disk image (when using image mode) is created in memory
        and therefore captured in the VM snapshot.
3. Truncate the memory file to the size of the VM at factory VM
        create time. This allows CLH to use the memory file
        as the backing for the template VM memory, allowing O(1)
        snapshot times.
4. CLH uses memory zones as backing for its memory on the template paths
5. Update StartVM in CLH to use the restore path when template is
        configured and available

Signed-off-by: Cameron Baird <cameronbaird@microsoft.com>
2026-06-19 18:00:02 +00:00
stevenhorsman
d09d1959c2 libs: Update mem-agent to use nix workspace version
Now that the workspace version has been updated,
switch the mem-agent to pick up the new workspace version

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-06-19 03:49:16 -07:00
stevenhorsman
531877f28f deps: Upgrade nix crate from 0.26.4 to 0.31.3
Upgrade the nix crate across the workspace to version 0.30.1 to address
security vulnerabilities and adopt safer file descriptor handling patterns.

### Breaking Changes in nix 0.28.0

1. **File Descriptor Type Changes**
   - Functions now return `OwnedFd` instead of `RawFd` (i32)
   - Functions requiring file descriptors now expect types implementing `AsFd` trait
   - This provides RAII-based automatic cleanup and prevents fd leaks

2. **API Signature Changes**
   - `pipe()`, `pipe2()`, `openpty()` now return `OwnedFd` tuples
   - `socket()` returns `OwnedFd` instead of `RawFd`
   - `open()`, `memfd_create()` return `OwnedFd`
   - `setns()`, `write()`, `fcntl()` require `AsFd` trait
   - `madvise()` requires `NonNull<c_void>` instead of raw pointer
   - `bind()`, `listen()`, `connect()` require `AsFd` and `Backlog` type

3. **Module Feature Flags**
   - Modules now require explicit feature flags (mman, reboot, etc.)

### Additional Breaking Changes in nix 0.30.1

1. **symlinkat() API Change**
   - `dirfd` parameter now requires `AsFd` trait instead of `Option<RawFd>`
   - Use `BorrowedFd::borrow_raw(libc::AT_FDCWD)` for current directory

2. **Type Alias Deprecation**
   - `MemFdCreateFlag` renamed to `MFdFlags` for consistency

### Changes Made

**Workspace Configuration (Cargo.toml)**
- Updated nix to 0.30.1 with features: fs, mount, sched, process, ioctl,
  signal, socket, feature, user, hostname, term, event, mman, reboot

**File Descriptor Handling Patterns**
- Use `BorrowedFd::borrow_raw(raw_fd)` to wrap RawFd for AsFd requirements
- Use `.as_fd().as_raw_fd()` to extract raw fd without ownership transfer
- Use `.into_raw_fd()` only when ownership transfer is needed
- Use `NonNull::new().unwrap()` for madvise pointer conversion

**Deprecated API Replacements**
- `eventfd()` → `EventFd::from_value_and_flags()`
- `Errno::from_i32()` → `Errno::from_raw()`
- `listen(fd, backlog)` → `listen(&fd, Backlog::new(backlog).unwrap())`
- `MemFdCreateFlag` → `MFdFlags`

Generated by: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-06-19 03:49:16 -07:00
stevenhorsman
ac508b093d runtime-rs: Use workspace nix version
See if we can sync to use the workspace version for easier
dependency management

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-06-19 03:49:16 -07:00
stevenhorsman
2b8b09469d dragonball: Use workspace nix version
See if we can sync to use the workspace version for easier
dependency management

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-06-19 03:49:16 -07:00
stevenhorsman
b37b81bb75 lib: Use workspace nix version
We have a note in the workspace Cargo.toml that the version
there needs to be in sync with the libs versions, so just update
them to use the workspace version rather than manually managing this.

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-06-19 03:49:16 -07:00
Steve Horsman
ec55a74969 Merge pull request #11485 from kata-containers/zvonkok-patch-1
Create SECURITY.md
2026-06-19 09:19:26 +01:00
Hyounggyu Choi
103b0b2cbc Merge pull request #13078 from SantoshMadhukar-K/improved-test-coverage
test: Improve test coverage for device handlers
2026-06-19 03:52:14 +02:00
SantoshMadhukar-K
736e07d18e test: Improve test coverage for device handlers
Add comprehensive test coverage for the device handler modules under
src/agent/src/device, including matcher behavior, edge cases, and
shared helper coverage across block, network, nvdimm, scsi, and vfio
device paths.

Assisted-by: IBM Bob

Signed-off-by: SantoshMadhukar-K <SantoshMadhukar.Khandyana@ibm.com>
2026-06-18 07:18:36 -07:00
stevenhorsman
4bbbcb813e doc: Create SECURITY.md
Explicit SECURITY.md that reflects Kata’s rolling-release model
(monthly cadence, no long-term branches) and sets clear expectations
for reporters and downstream users.
With the SECURITY.md in place we need also the SECURITY_CONTACTS

- Add alternative reporting method (email) for non-GitHub users
- Add section for downstream distributions and vendors with early notification details
- Clarify that timelines are independent objectives, not sequential steps
- Reorder disclosure process to emphasize patch releases are exceptions
- Update git tag command in version table (remove unnecessary pipe)
- Expand FAQ with downstream distribution and non-GitHub reporter questions
- Update timestamp to reflect current changes (2026-04-01)
- Update SECURITY_CONTACTS with email contact and downstream notification info
- Clarify CVE assignment process through GitHub

Signed-off-by: Anastassios Nanos <ananos@nubificus.co.uk>
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-06-18 14:23:52 +01:00
Steve Horsman
49ce886f20 Merge pull request #13242 from charludo/fix/runtime-rs-safe-path
runtime-rs: change `safe-path` dependency from crates.io to workspace
2026-06-18 11:39:19 +01:00
Charlotte Hartmann Paludo
b4be5fdcca runtime-rs: change safe-path dependency from crates.io to workspace
`safe-path` is resolved from the local workspace in all other workspace
member crates. This commit changes the dependency to a local one for
runtime-rs as well.

Signed-off-by: Charlotte Hartmann Paludo <git@charlotteharludo.com>
Co-authored-by: Markus Rudy <mr@edgeless.systems>
2026-06-18 06:32:06 +02:00
Steve Horsman
66e938e02d Merge pull request #13244 from BbolroC/use-ibm-actionspz-runners-for-publishing-jobs
GHA: Use IBM ActionsPZ runners for publish jobs on s390x
2026-06-17 15:45:20 +01:00
Hyounggyu Choi
308eb34af6 GHA: Use IBM ActionsPZ runners for publish jobs on s390x
Let's use the ActionsPZ runners for the following jobs:
- publish-kata-deploy-image-s390x
- publish-kata-monitor-image-s390x

to improve CI experiences.

Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
2026-06-17 15:34:39 +02:00
Alex Lyn
47ac08b419 Merge pull request #13239 from Apokleos/remove-9p
runtime-rs: Remove unused msize_9p totally from configurations
2026-06-17 20:17:52 +08:00
Greg Kurz
f0f8233759 Merge pull request #13237 from gkurz/osbuilder-version
osbuilder: Simplify version fetching
2026-06-17 13:56:13 +02:00
Greg Kurz
c3d98fe323 osbuilder: Simplify version fetching
`tools/osbuilder/VERSION` points to the root `VERSION` file,
just like the code does. Use that file.

Signed-off-by: Greg Kurz <groug@kaod.org>
2026-06-17 10:08:23 +02:00
Alex Lyn
854eef0312 runtime-rs: Remove unused msize_9p totally from configurations
As virtio-9p is deprecated already, and its msize_9p should be
deprecated too. This commit aims to remove the unused msize_9p.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-06-17 14:53:48 +08:00
Fabiano Fidêncio
0ddb2ee1f1 Merge pull request #13160 from LandonTClipp/kata_visible_devices
feat(agent): translate KATA_VISIBLE_DEVICES into CDI GPU requests
2026-06-16 19:10:35 +02:00
Fabiano Fidêncio
3ca5742338 Merge pull request #13129 from pmores/fix-default_memory_annontation
runtime-rs: fix default_memory annonation processing
2026-06-16 18:11:19 +02:00
Fabiano Fidêncio
3e98f925cf Merge pull request #13142 from davidweisse/dav/genpolicy-pod-resources
genpolicy: support pod-level resources
2026-06-16 15:31:50 +02:00
davidweisse
ac56ea21d8 genpolicy: support pod-level resources
Add support for resource requests and limits in the PodSpec.

Fixes #12816

Signed-off-by: davidweisse <98460960+davidweisse@users.noreply.github.com>
2026-06-16 15:30:22 +02:00
Fabiano Fidêncio
774e698aeb Merge pull request #12293 from Apokleos/graceful-errors
runtime-rs: make OOM watcher and signal handling lifecycle-aware
2026-06-16 15:02:54 +02:00
Fabiano Fidêncio
c76c82ce1c Merge pull request #13229 from hgowda-amd/skip-qos-tests-snp-tdx-runtime-rs
tests: skip Guaranteed QoS test for SNP/TDX runtime-rs
2026-06-16 14:02:51 +02:00
Fabiano Fidêncio
492d604daf Merge pull request #13214 from fidencio/topic/block-volume-readonly-propagation
runtime(-rs):  Propagate host block device read-only flag to the VMM
2026-06-16 13:39:23 +02:00
Pavel Mores
9b31e06c20 runtime-rs: bump the byte-unit dependency version
The unit tests added by the previous commit exposed a malfunction of the
byte-unit crate on big-endian systems(*), causing s390x CI to fail.
Bump the dependency's version to include a fix.

Signed-off-by: Pavel Mores <pmores@redhat.com>
2026-06-16 13:15:23 +02:00
Pavel Mores
5ba5046e97 runtime-rs: fix default_memory annonation processing
The annotation value is implicitly in MiB but when presented to the
byte-unit crate this is interpreted as bytes.  When a common value like
2048, meant to mean 2048 MiB but interpreted as 2048 B, is then converted
to MiB the result is zero which is less than the minimal allowable memory
and the runtime fails to launch.

This is fixed by adding a detection whether the annotation value contains
units or not.  If it doesn't it's first converted to MiB and the rest of
the processing then goes like before.

This way we allow for the implicit MiB units when no units are given, thus
keeping compatibility with existing go shim behaviour, while also allowing
for any legal units to be given as well.

We take the opportunity to add some unit tests as well.

Signed-off-by: Pavel Mores <pmores@redhat.com>
2026-06-16 13:15:23 +02:00
LandonTClipp
4a9da5d37a chore(docs): Add info on building and running custom artifacts
I created this over the course of testing my VISIBLE_CDI_DEVICES
changes. I think this will be useful to folks who don't understand the
right way to deploy custom artifacts.

Signed-off-by: LandonTClipp <lclipp@coreweave.com>
2026-06-16 11:44:09 +02:00
LandonTClipp
a1dd28cb52 feat(runtime): plumb VISIBLE_CDI_DEVICES through the Go runtime
Add a `visible_cdi_devices` TOML option to the Go runtime so the
agent.visible_cdi_devices=true kernel parameter is emitted to the guest
when enabled. Wire the option through the NVIDIA GPU configuration
templates and add tests verifying the kernel-params flow.

Signed-off-by: LandonTClipp <lclipp@coreweave.com>
2026-06-16 11:44:09 +02:00
LandonTClipp
b49eb577b2 feat(runtime-rs): expose visible_cdi_devices in config
Declare the `visible_cdi_devices` agent option (kernel param
agent.visible_cdi_devices) in kata-types so runtime-rs can opt into
emitting it to the guest, and expose it in the three NVIDIA GPU
configuration templates (qemu, qemu-snp, qemu-tdx) at runtime-rs/config/.

The agent consumes the corresponding VISIBLE_CDI_DEVICES env var to
drive CDI device requests.

Signed-off-by: LandonTClipp <lclipp@coreweave.com>
2026-06-16 11:44:09 +02:00
LandonTClipp
676fc90d0b feat(agent): translate VISIBLE_CDI_DEVICES into CDI device requests
Add an opt-in `visible_cdi_devices` agent option that lets a container
select which of the VM's CDI-known devices it sees via a
VISIBLE_CDI_DEVICES env var. The schema is `<cdi-kind>=<devices>`
(e.g. "nvidia.com/gpu=all", or "kata.com/gpu=0,1"), with multiple kinds
delimited by ':'.

When enabled, the agent maps the value to CDI device requests and feeds
them through the existing CDI injection path, so device nodes, mounts,
env and createContainer hooks from the guest CDI spec (e.g.
/var/run/cdi/nvidia.yaml, generated by NVRC/nvidia-ctk) are applied.
The variable is intentionally distinct from NVIDIA_VISIBLE_DEVICES and
does not promise identical semantics.

If a requested kind is present in the guest CDI registry but the
specific device index is not, the agent fails fast rather than waiting
for the CDI-spec watch/timeout path. An entirely absent kind falls
through to the existing wait/timeout behavior.

Defaults to false; containers that don't set the env var are unaffected.

Signed-off-by: LandonTClipp <lclipp@coreweave.com>
2026-06-16 11:44:09 +02:00
Alex Lyn
8fc1a16225 runtime-rs: Make signal_process idempotent for exited init processes
Address the issue where signal_process returns an INTERNAL error when
the container's init process has already exited, and ensure teardown
is never aborted by signal failures.

Introduce is_no_such_process_error() to detect "no such process"
conditions (ESRCH/ENOENT codes or equivalent messages). When the init
process is already gone, treat it as success with an info log instead
of an error.

In stop_process(), never propagate signal failures. During sandbox
shutdown the agent connection is often already closed, causing
AgentConnectionClosed errors that bypass is_no_such_process_error().
If stop_process() aborts on such errors, cleanup_container() is skipped
and leftover mounts cause "Resource busy" failures in sandbox cleanup.
Restore "always proceed to cleanup" semantics: log the failure as a
warning, but never skip resource cleanup.

Resource cleanup must be best-effort and idempotent regardless of kill
outcome.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-06-16 15:12:28 +08:00
Alex Lyn
44dd2b1f34 runtime-rs: Refine OOM watcher error reporting for sandbox teardown
This commit refines the error handling within the OOM watcher to
distinguish between genuine failures and errors that occur as a natural
consequence of sandbox shutdown via the helper is_normal_shutdown_error.
Previously, various connection-related errors during teardown were logged
as warnings, contributing to noisy logs.

It aims to improve OOM error handling, distinguish error types:
The logic now differentiates between "normal shutdown" errors (e.g.,
Connection reset by peer, broken pipe) and actual OOM watcher failures.

This enhancement makes OOM event logs more informative and less prone to
clutter during normal sandbox termination.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-06-16 15:12:24 +08:00
Alex Lyn
3095bd379b runtime-rs: Introduce cancellation for OOM watcher during teardown
This commit introduces an explicit cancellation mechanism for the OOM
watcher loop within VirtSandbox. This addresses the issue where the
watcher continues to poll for OOM events even when the sandbox is being
stopped, leading to spurious "Connection reset by peer" errors.

Key changes:
(1) A CancellationToken is added to VirtSandbox to signal the watcher
loop when the sandbox is undergoing teardown.
(2) The OOM watcher loop in VirtSandbox::start() is now wrapped in a
tokio::select! statement. This allows it to concurrently listen for
two events:
- cancel_token.cancelled(): Triggered when the sandbox/VM is stopping.
- agent.get_oom_event(): The regular OOM event polling.
(3) In the sandbox stop/teardown path, cancel_token.cancel() is called
before stopping the VM. This ensures the OOM watcher loop exits cleanly
via the cancellation token, preventing the occurrence of ECONNRESET/EOF
errors on a closed channel.

This change improves the robustness of OOM event handling during sandbox
lifecycle management.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-06-16 12:56:54 +08:00
Alex Lyn
0ffdc576d3 runtime-rs: Introduce a helper to check if process/container exists
Returns `true` if the error indicates that the target process/container
no longer exists.

This is used to determine if an operation, like signaling a process,
failed because the target is no longer available. The function checks
for standard OS error codes (`ESRCH`, `ENOENT`) and common error message
patterns.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-06-16 12:56:54 +08:00