Commit Graph

18471 Commits

Author SHA1 Message Date
LandonTClipp
fd896e4e76 ci: Add kata-dictionary.txt to required_tests.yaml
This makes it so that changes to the kata-dictionary.txt file only trigger the
static checks to run.

Signed-off-by: LandonTClipp <11232769+LandonTClipp@users.noreply.github.com>
2026-04-15 14:48:01 +01:00
LandonTClipp
56cdfa831f docs: Add annotation config to doc site
Adding the pod annotation config to the doc site. A symlink is created
at docs/pod-annotations.md that points to
how-to/how-to-set-sandbox-config-kata.md so that the URL for this file will be
created at `/pod-annotations`. Also adding brief contrbuting guidelines and
how-to's for running the documentation site locally for local previews.

Signed-off-by: LandonTClipp <11232769+LandonTClipp@users.noreply.github.com>
2026-04-15 14:48:01 +01:00
Fabiano Fidêncio
57898de1fe Merge pull request #12845 from fidencio/topic/fix-signed-image-tests
tests: Update images used for signed tests
2026-04-15 14:47:58 +02:00
Fabiano Fidêncio
ba9a02897e genpolicy: make allowed cgroup v2 mount extras configurable
Newer kernels and containerd versions (>= 2.2.3) may add extra mount
options to /sys/fs/cgroup that genpolicy does not embed in the policy
(e.g. nsdelegate, memory_recursiveprot). This causes the Kata agent to
reject CreateContainerRequest with PERMISSION_DENIED because the
check_mount rules require an exact match.

Rather than hard-coding the allowed extras in Rego, make them
configurable via genpolicy-settings.json under
cluster_config.cgroup_mount_extras_allowed. The corresponding Rego rule
(check_mount 4) reads the list from policy_data.cluster_config and
allows only those named options beyond the policy-embedded set.

To support this, cluster_config is now included in PolicyData so that
it gets serialized into the Rego policy_data object at generation time.

This follows the established pattern of keeping site- and
version-specific tunables in genpolicy-settings.json so they can be
overridden via JSON-Patch drop-ins without touching the Rego source.

A policy test case is added to verify that the default allowed extras
(nsdelegate, memory_recursiveprot) are accepted and that unknown extras
are rejected.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-15 13:24:21 +02:00
Fabiano Fidêncio
d29b77e953 tests: Update images used for signed tests
I've updaed the images on the Confidential Containers side, in order to
add arm64 support, but I didn't realize it'd break tests not using
those.

Apologies!

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-15 12:11:37 +02:00
Fabiano Fidêncio
2d57b89857 Merge pull request #12805 from stevenhorsman/stale-bot-improvements
Stale bot improvements
2026-04-14 23:20:41 +02:00
Fabiano Fidêncio
672d3f2b0f workflows: Use docker buildx to build and push auth test image
skopeo copy with --override-arch fails with "authentication required"
during blob existence checks at the destination, regardless of how
credentials are provided (--dest-creds, --authfile, REGISTRY_AUTH_FILE).
This is a known issue with skopeo 1.13.x when copying from manifest
list sources.

Replace the skopeo/buildah approach with docker/build-push-action,
which is already proven in this repo (build-kubectl-image.yaml) and
handles multi-arch builds and Quay pushes reliably. The workflow now
builds a trivial FROM busybox image using buildx with QEMU emulation.

Fixes: b0abe5999 ("workflows: Add workflow to create auth registry test image")
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
2026-04-14 22:44:30 +02:00
Fabiano Fidêncio
09ef32eaf1 Merge pull request #12827 from fidencio/topic/kata-deploy-custom-containerd-config
kata-deploy: Allow overriding containerd config path and file name
2026-04-14 22:23:33 +02:00
stevenhorsman
5ea30b33ae workflows: stale-issue: Increase operations-per-run
At a rate of default 30 per run, with over 1.5k issues, it will take
us over 50 days to do a pass of the issues we have, so increase
operations-per-run as suggested in the workflow by github to
reduce this. Based on the stats of the latest run, we are not too
close to hitting the API rate limit:
```
Github API rate used: 32
Github API rate remaining: 3693; reset at: Thu Apr 09 2026 10:23:31 GMT+0000 (Coordinated Universal Time)
```
so I think this should be okay.

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-04-14 16:25:35 +01:00
stevenhorsman
a0359326e9 workflow: Bump stale action version
v9 is based on Node.js 20 which is deprecated, so update to the
latest to pick up a Node.js 24 version before Github removes Node 20

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-04-14 16:25:35 +01:00
Fabiano Fidêncio
0713b2d5d3 Merge pull request #12828 from kata-containers/dependabot/pip/docs/pillow-12.2.0
build(deps): bump pillow from 12.1.1 to 12.2.0 in /docs
2026-04-14 17:23:07 +02:00
Fabiano Fidêncio
661cfd7efa Merge pull request #12800 from kata-containers/dependabot/go_modules/src/runtime/go.opentelemetry.io/otel/sdk-1.43.0
build(deps): bump go.opentelemetry.io/otel/sdk from 1.40.0 to 1.43.0 in /src/runtime
2026-04-14 17:22:47 +02:00
dependabot[bot]
b54f02aa6c build(deps): bump pillow from 12.1.1 to 12.2.0 in /docs
Bumps [pillow](https://github.com/python-pillow/Pillow) from 12.1.1 to 12.2.0.
- [Release notes](https://github.com/python-pillow/Pillow/releases)
- [Changelog](https://github.com/python-pillow/Pillow/blob/main/CHANGES.rst)
- [Commits](https://github.com/python-pillow/Pillow/compare/12.1.1...12.2.0)

---
updated-dependencies:
- dependency-name: pillow
  dependency-version: 12.2.0
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-04-14 14:40:14 +00:00
Steve Horsman
8289aaf0c7 Merge pull request #12831 from kata-containers/topic/ci-move-out-of-nodejs-20
ci: Update GitHub Actions to Node.js 24 compatible versions
2026-04-14 14:59:03 +01:00
Fabiano Fidêncio
c087eb92ec ci: Update GitHub Actions to Node.js 24 compatible versions
Node.js 20 is deprecated on GitHub Actions runners and will be
forced to Node.js 24 starting June 2nd, 2026. Update all affected
actions to versions that natively support Node.js 24:

- actions/upload-artifact: v4.6.2 -> v6.0.0
- actions/download-artifact: v4.3.0 -> v7.0.0
- docker/build-push-action: v5.4.0 -> v7.0.0
- docker/login-action: v3.4.0 -> v4.1.0
- docker/setup-buildx-action: v3.10.0 -> v4.0.0
- docker/setup-qemu-action: v3.6.0 -> v4.0.0
- geekyeggo/delete-artifact: v5.1.0 -> v6.0.0
- azure/login: v2.3.0 -> v3.0.0
- azure/setup-kubectl: v4.0.1 -> v5.0.0
- nick-fields/retry: v3.0.2 -> v4.0.0

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
2026-04-14 15:48:45 +02:00
Fabiano Fidêncio
7e464f13a5 Merge pull request #12830 from fidencio/topic/workflows-create-auth-registry-image
workflows: Add workflow to create auth registry test image
2026-04-14 11:28:23 +02:00
Fabiano Fidêncio
b0abe59993 workflows: Add workflow to create auth registry test image
Add a manually-triggered workflow that builds and pushes a multi-arch
busybox-based image to quay.io/kata-containers/confidential-containers-auth
for use as an authenticated container image in CI tests.

The workflow uses skopeo to copy per-arch images and buildah to create
and push the multi-arch manifest.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
2026-04-14 10:59:12 +02:00
Fabiano Fidêncio
b0a87880e7 Merge pull request #12826 from fidencio/topic/fix-concurrent-map-access-in-wait
runtime: Fix concurrent map read/write panic in Wait()
2026-04-14 08:48:52 +02:00
Fabiano Fidêncio
df1d02d3cf kata-deploy: Allow overriding containerd config path and file name
Add two new Helm values under `containerd`:
- `configDir`: overrides the host directory where the containerd
  config lives, taking precedence over the k8sDistribution-based
  auto-detection.
- `configFileName`: overrides the containerd config file name,
  propagated to the kata-deploy binary via the new
  CONTAINERD_CONFIG_FILE_NAME environment variable.

These are useful for non-standard containerd setups that don't match
any of the built-in k8sDistribution presets (k8s, k3s, rke2, k0s,
microk8s).

The config file name override only affects the default runtime branch
in get_containerd_paths(). The k0s/microk8s/k3s/rke2 branches are
left untouched since those runtimes have mandatory file naming
conventions.

Also fixes a spurious leading space in the k3s containerdConfPath
branch.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
2026-04-13 22:31:55 +02:00
Fabiano Fidêncio
b17dd2a902 runtime: Fix concurrent map read/write panic in Wait()
Wait() was releasing s.mu immediately after getContainer(), then
calling getExec() — which reads c.execs — without holding any lock.
Concurrent Exec() or Delete() calls that write to c.execs under s.mu
triggered a "concurrent map read and map write" fatal panic.

Add a dedicated sync.RWMutex to the container struct that protects the
execs map. getExec() now acquires a read lock internally, and all
writes go through new setExec()/deleteExec() helpers that acquire the
write lock. This keeps the locking concern local to the map and avoids
complicating the s.mu usage in Wait().

Add a regression test (TestConcurrentExecAccess) that exercises
concurrent getExec reads against setExec/deleteExec writes; this
reliably reproduces the panic under the race detector without the fix.

Fixes: #12825

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-13 21:14:28 +02:00
Fabiano Fidêncio
4c567a9c05 ci: Reduce TEE test scope for PR runs
TEE hardware (TDX, SEV-SNP) is very limited in CI. Running the full
test suite on every PR consumes these resources unnecessarily, since
most tests exercises what is already exercised by the -coco-dev CIs.

Introduce a `tee-test-scope` workflow input (small/full) and a new
`baremetal-small-tee` K8S_TEST_HOST_TYPE that runs only the 12 tests
that are TEE-relevant: attestation tests (encrypted/authenticated/
signed image pull, confidential attestation) plus policy and trusted
ephemeral data storage tests.

PR runs default to "small" (12 tests), nightly runs use "full" (59
tests), and manual dispatch offers a dropdown to choose.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
2026-04-13 20:26:46 +02:00
dependabot[bot]
b303600283 build(deps): bump go.opentelemetry.io/otel/sdk in /src/runtime
Bumps [go.opentelemetry.io/otel/sdk](https://github.com/open-telemetry/opentelemetry-go) from 1.40.0 to 1.43.0.
- [Release notes](https://github.com/open-telemetry/opentelemetry-go/releases)
- [Changelog](https://github.com/open-telemetry/opentelemetry-go/blob/main/CHANGELOG.md)
- [Commits](https://github.com/open-telemetry/opentelemetry-go/compare/v1.40.0...v1.43.0)

---
updated-dependencies:
- dependency-name: go.opentelemetry.io/otel/sdk
  dependency-version: 1.43.0
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-04-13 10:36:44 +00:00
Fabiano Fidêncio
bd6377a038 Merge pull request #12614 from manuelh-dev/mahuber/image-signing-nim
tests: nvidia: Enforce image signing for NIM test
2026-04-11 14:48:04 +02:00
Fabiano Fidêncio
5eb7844183 Merge pull request #12430 from stevenhorsman/cargo-deny-static-checks
static-checks: Rework cargo deny check
2026-04-11 12:05:53 +02:00
stevenhorsman
8be3a24112 ci: Update cargo-deny in gatekeeper
Update the name and move it to the static checks as we don't
need to ensure it's running for none code changes.

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-04-11 08:46:32 +01:00
stevenhorsman
9448988783 workflow: Update cargo deny check
The cargo deny generated action doesn't seem to work
and seems unnecessarily complex, so try using
EmbarkStudios/cargo-deny-action instead

Fixes: #11218
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-04-11 08:46:32 +01:00
stevenhorsman
a0410e0d5c static-checks: Update cargo deny config
The previous config is not valid, so update it based on information
from https://embarkstudios.github.io/cargo-deny/checks/cfg.html

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-04-11 08:46:32 +01:00
stevenhorsman
a32c6fd9ff mem-agent: Add package metadata
Make the authors, edition and license be inherited from the
workspace

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-04-11 08:46:32 +01:00
stevenhorsman
5bcc006447 runtime-rs: Add missing license
The ch-config crate was missing a license

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-04-11 08:46:32 +01:00
Manuel Huber
7daeb78b67 tests: nvidia: Enforce image signing for NIM test
Validate container image signatures for the NIM test using NVIDIA's
public signing key.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-04-11 09:22:50 +02:00
Fabiano Fidêncio
3ce3644c3c Merge pull request #12807 from PiotrProkop/blk-sector-rust
runtime-rs: allow specifying logical/physical sector size for block devices
2026-04-11 00:42:45 +02:00
Fabiano Fidêncio
6f3c11aec4 Merge pull request #12808 from fidencio/topic/agent-allow-configuring-launch-process-timeout
agent: Make launch_process_timeout configurable
2026-04-11 00:36:01 +02:00
Fabiano Fidêncio
d4a042a155 Merge pull request #12813 from fitzthum/bump-gc-ma-sigs
Bump guest components to pickup additional signature support
2026-04-10 23:57:19 +02:00
Fabiano Fidêncio
78fa4c88e2 Merge pull request #12814 from fidencio/topic/nvidia-always-do-vcpu-pinning
runtime: Set `enable_vcpus_pinning = true` for NVIDIA configs
2026-04-10 23:47:44 +02:00
Fabiano Fidêncio
7244389ad4 runtime: Set enable_vcpus_pinning = true for NVIDIA configs
So we can have a better performance by default.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-10 16:41:34 +02:00
Fabiano Fidêncio
1d77c4e60f Merge pull request #12752 from LizZhang315/add-overheadEnabled
helm: add overheadEnabled switch for runtimeclass
2026-04-10 16:40:42 +02:00
Tobin Feldman-Fitzthum
ff26a6b876 versions: update image-rs to pickup signature fixes
The new version of image-rs supports more types of signed images. First,
we added supported for a few more key types. Second, we added support
for multi-arch images where the manifest digest is signed but the
individual arch manifest is not. These images are relatively common, so
let's pickup the fix asap.

Signed-off-by: Tobin Feldman-Fitzthum <tfeldmanfitz@nvidia.com>
2026-04-10 06:54:58 -07:00
Tobin Feldman-Fitzthum
2588a0e5a5 agent-ctl: bump image-rs version
I don't think agent-ctl will benefit from the new image-rs features, but
let's update it to be complete.

Signed-off-by: Tobin Feldman-Fitzthum <tfeldmanfitz@nvidia.com>
2026-04-10 06:52:53 -07:00
Fabiano Fidêncio
e8f34a2b26 agent: Update protocol
This is not related to this PR, but rather to #12734, which ended up not
running the `make src/agent generate-protocols`.

While here, let's also fix it.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-10 14:47:01 +02:00
Fabiano Fidêncio
36a2d8e7f2 agent: Make launch_process_timeout configurable
The hardcoded DEFAULT_LAUNCH_PROCESS_TIMEOUT of 6 seconds in the kata
agent is insufficient for environments with NVIDIA GPUs and NVSwitches,
where the attestation-agent needs significantly more time to collect
evidence during initialization (e.g. ~2 seconds per NVSwitch).

When the timeout expires, the agent (PID 1) exits with an error, causing
the guest kernel to perform an orderly shutdown before the
attestation-agent has finished starting.

Make this timeout configurable via the kernel parameter
agent.launch_process_timeout (in seconds), preserving the 6-second
default for backward compatibility. The Go runtime is wired up to pass
this value from the TOML config's [agent.kata] section through to the
kernel command line.

The NVIDIA GPU configs set the new default to 15 seconds.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
2026-04-10 14:47:01 +02:00
PiotrProkop
82de35c720 runtime-rs: allow specifying logical/physical sector size for block devices
Add two new configuration knobs that control the logical and physical
sector sizes advertised by virtio-blk devices to the guest:

  block_device_logical_sector_size  (config file)
  block_device_physical_sector_size (config file)

  io.katacontainers.config.hypervisor.blk_logical_sector_size  (annotation)
  io.katacontainers.config.hypervisor.blk_physical_sector_size (annotation)

The annotation names are abbreviated relative to the config file keys
because Kubernetes enforces a 63-character limit on annotation name
segments, and the full names would exceed it.

Both settings default to 0 (let QEMU decide). When set, they are passed
as logical_block_size and physical_block_size in the QMP device_add
command during block device hotplug.

Setting logical_sector_size smaller then container filesystem
block size will cause EINVAL on mount. The physical_sector_size can
always be set independently.

Values must be 0 or a power of 2 in the range [512, 65536]; other
values are rejected with an error at sandbox creation time.

Signed-off-by: PiotrProkop <pprokop@nvidia.com>
2026-04-10 11:14:51 +02:00
Fabiano Fidêncio
fd6375d8d5 Merge pull request #12806 from kata-containers/topic/ci-run-runtime-rs-on-SNP
ci: Run qemu-snp-runtime-rs tests in the CI
2026-04-10 11:01:20 +02:00
LizZhang315
2312f67c9b helm: add overheadEnabled switch for runtimeclass
Add a global and per-shim configurable switch to enable/disable
the overhead section in generated RuntimeClasses. This allows users
to omit overhead when it's not needed or managed externally.

Priority: per-shim > global > default(true).

Signed-off-by: LizZhang315 <123134987@qq.com>
2026-04-10 10:26:11 +02:00
Fabiano Fidêncio
218077506b Merge pull request #12769 from RuoqingHe/runtime-rs-allow-install-on-riscv
runtime-rs: Allow installation on RISC-V platforms
2026-04-10 10:24:40 +02:00
Fabiano Fidêncio
dca89485f0 Merge pull request #12802 from stevenhorsman/bump-golang-1.25.9
versions: bump golang to 1.25.9
2026-04-10 06:50:35 +02:00
Fabiano Fidêncio
72fb41d33b kata-deploy: Symlink original config to per-shim runtime copy
Users were confused about which configuration file to edit because
kata-deploy copied the base config into a per-shim runtime directory
(runtimes/<shim>/) for config.d support, leaving the original file
in place untouched.  This made it look like the original was the
authoritative config, when in reality the runtime was loading the
copy from the per-shim directory.

Replace the original config file with a symlink pointing to the
per-shim runtime copy after the copy is made.  The runtime's
ResolvePath / EvalSymlinks follows the symlink and lands in the
per-shim directory, where it naturally finds config.d/ with all
drop-in fragments.  This makes it immediately obvious that the
real configuration lives in the per-shim directory and removes the
ambiguity about which file to inspect or modify.

During cleanup, the symlink at the original location is explicitly
removed before the runtime directory is deleted.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-09 17:16:40 +02:00
Steve Horsman
9e8069569e Merge pull request #12734 from Apokleos/rm-v9p-rs
runtime-rs: Remove virtio-9p Shared Filesystem Support
2026-04-09 16:15:55 +01:00
Fabiano Fidêncio
5e1ab0aa7d tests: Support runtime-rs QEMU cmdline format in attestation test
The k8s-confidential-attestation test extracts the QEMU command line
from journal logs to compute the SNP launch measurement. It only
matched the Go runtime's log format ("launching <path> with: [<args>]"),
but runtime-rs logs differently ("qemu args: <args>").

Handle both formats so the test works with qemu-snp-runtime-rs.

Made-with: Cursor
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-09 16:35:08 +02:00
Fabiano Fidêncio
3b155ab0b1 ci: Run runtime-rs tests for SNP
As we're in the process to stabilise runtime-rs for the coming 4.0.0
release, we better start running as many tests as possible with that.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-09 16:35:08 +02:00
stevenhorsman
31f9a5461b versions: bump golang to 1.25.9
Bump the go version to resolve CVEs:
- GO-2026-4947
- GO-2026-4946
- GO-2026-4870
- GO-2026-4869
- GO-2026-4865
- GO-2026-4864

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-04-09 08:59:40 +01:00