Commit Graph

1082 Commits

Author SHA1 Message Date
Fabiano Fidêncio
68cc7f8e70 ci: remove unmaintained CoCo stability test workflows
The ci-coco-stability.yaml workflow has its weekly schedule
commented out with a note that the workload is not maintained.
Remove the entire chain: ci-coco-stability.yaml, ci-weekly.yaml,
run-kata-coco-stability-tests.yaml, and the kubernetes stability
test scripts that were only used through this path.

The local containerd stability tests (tests/stability/gha-run.sh)
remain as they are actively used by basic-ci workflows.

Made-with: Cursor
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-23 08:46:12 +02:00
Fabiano Fidêncio
e0d98fafe3 ci: remove disabled run-cri-containerd-tests-arm64 job
This job in ci.yaml has been unconditionally disabled (if: false)
with no tracking issue or path to re-enablement.

Made-with: Cursor
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-23 08:46:12 +02:00
Fabiano Fidêncio
c7e3f95883 tests: remove disabled tracing tests and CI job
The run-tracing job in basic-ci-amd64.yaml has been disabled
(if: false) due to issue #9763, with no path to re-enablement.
Remove the job definition and the backing
tests/functional/tracing/ directory.

Made-with: Cursor
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-23 08:46:12 +02:00
Fabiano Fidêncio
8a93cf8f17 tests: remove disabled VFIO tests and CI job
The run-vfio job in basic-ci-amd64.yaml has been disabled
(if: false) due to issues #9764, #9851, and #9940, with no
path to re-enablement. Remove the job definition and the
backing tests/functional/vfio/ directory.

Made-with: Cursor
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-23 08:46:12 +02:00
Fabiano Fidêncio
8e685f22c6 ci: remove orphan run-kata-deploy-tests-on-aks.yaml workflow
This reusable workflow (workflow_call) has no caller anywhere in
the repository, making it dead code.

Made-with: Cursor
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-23 08:46:12 +02:00
Fabiano Fidêncio
b74f2c0a9c tests: remove metrics tests and workflow
The run-metrics.yaml workflow is a reusable workflow_call with no
caller in the repository, making it effectively dead code. Remove
the workflow, the entire tests/metrics/ directory (~586 files
including vendored Go for checkmetrics), and the "metrics"
self-hosted runner label from actionlint.yaml.

Made-with: Cursor
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-23 08:46:12 +02:00
Saul Paredes
baf0f16804 ci: k8s-tests: test mariner and runtime-rs
Disable policy tests when using mariner and runtime-rs. These are not supported yet.

Signed-off-by: Saul Paredes <saulparedes@microsoft.com>
2026-04-21 14:08:21 -07:00
Aurélien Bombo
d64fce3998 Revert "ci: k8s: Adjust timeout on free runners"
This reverts commit 8d6f1d6f34.
2026-04-20 15:36:35 -05:00
Fabiano Fidêncio
d6f0b15578 ci: erofs: restrict to runtime-rs only
The erofs snapshotter configuration is node-wide (a single containerd
drop-in) and cannot be split per runtime handler.  The Go runtime does
not support fsmerged EROFS — it rejects fsmeta.erofs mount sources with
"unsupported mount source" — so erofs is only usable with runtime-rs.

Drop qemu-coco-dev (Go) from the erofs CI matrix and add a check in
kata-deploy's configure_erofs_snapshotter() that inspects the
SNAPSHOTTER_HANDLER_MAPPING: if any Go shim is explicitly mapped to
erofs, emit a prominent warning and bail out with a clear error telling
the operator to fix the mapping.

Since all shims are now guaranteed to be runtime-rs when erofs is
active, remove the conditional is_rust_shim gating and always emit the
full erofs configuration (differ options, default_size,
max_unmerged_layers=1).

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-19 13:24:31 +02:00
Fabiano Fidêncio
9c803d86a6 ci: erofs: Bump containerd to v2.3
To ensure we're using the latest released version of the project, as I
think we're missing patches on v2.2.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-19 13:24:31 +02:00
Fabiano Fidêncio
cdd09c3c65 ci: enable erofs tests with runtime-rs
Now that erofs snapshotter has added , let's make sure this is tested.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-19 13:24:31 +02:00
Fabiano Fidêncio
35e48fdfd1 ci: run qemu-coco-dev-runtime-rs tests on arm64
Add qemu-coco-dev-runtime-rs to the arm64 k8s test matrix so that the
CoCo non-TEE configuration is exercised on aarch64 runners.

Also enable auto-generated policy for qemu-coco-dev on aarch64 (matching
the existing x86_64 behavior) and register the new job as a required
gatekeeper check.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
2026-04-18 00:48:13 +02:00
Fabiano Fidêncio
861f15cdc4 build: add arm64 coco-dev build dependencies
Build coco-guest-components, pause-image, and rootfs-image-confidential
for arm64, which are required by qemu-coco-dev-runtime-rs.

Enable MEASURED_ROOTFS on the arm64 shim-v2 build, add the aarch64 case
to install_kernel() so the default kernel is built as a unified kernel
(with confidential guest support, like x86_64), and adjust the kernel
install naming so only CCA builds get the -confidential suffix.

Also wire rootfs-image-confidential-tarball into the aarch64 local-build
Makefile.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
2026-04-18 00:48:13 +02:00
Fabiano Fidêncio
e1f8b8e8b4 build: add arm64 tools build (genpolicy only)
The arm64 build workflow was missing the tools build entirely.
Add build-tools-asset and create-kata-tools-tarball jobs mirroring
the amd64 workflow so that genpolicy and the other tools are
available for coco-dev tests that need auto-generated policy.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
2026-04-18 00:48:02 +02:00
Steve Horsman
1db12f8ccf Merge pull request #12812 from stevenhorsman/tee-test-refactor
ci: Refactor confidential TEE support
2026-04-17 11:12:13 +01:00
stevenhorsman
1dc57c6cef ci: increase stale issues workflow frequency
Update the stale issues workflow to run more frequently:
- Weekdays: Every 4 hours (6x per day) at 00:00, 06:00, 12:00, 18:00 UTC
- Weekends: Every hour (24x per day)

Previously ran once daily at midnight UTC. This change reduces the time
it will take for us to get through our backlog, particularly increasing
the runs at the weekend, when we should have less other CI running,
which it could impact due to GH API rate limiting.

Assisted-by: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-04-16 20:50:38 +01:00
dependabot[bot]
c044403409 build(deps): bump tim-actions/wip-check from 1.0.0 to 1.1.0
Bumps [tim-actions/wip-check](https://github.com/tim-actions/wip-check) from 1.0.0 to 1.1.0.
- [Release notes](https://github.com/tim-actions/wip-check/releases)
- [Commits](1c2a1ca6c1...8c84f59872)

---
updated-dependencies:
- dependency-name: tim-actions/wip-check
  dependency-version: 1.1.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-04-16 10:48:41 +00:00
stevenhorsman
ff246f9538 ci: Remove deploy_snapshotter
Snapshotter deployment is a no-op now that
kata-deploy handles this, so clean up this code.

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-04-16 09:21:04 +01:00
Fabiano Fidêncio
8d6f1d6f34 ci: k8s: Adjust timeout on free runners
I've seen several cases of the CLH tests just being killed due to the 60
minutes timeout. Let's bump it to 75 and see how it goes.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-15 17:09:30 +02:00
Fabiano Fidêncio
2d57b89857 Merge pull request #12805 from stevenhorsman/stale-bot-improvements
Stale bot improvements
2026-04-14 23:20:41 +02:00
Fabiano Fidêncio
672d3f2b0f workflows: Use docker buildx to build and push auth test image
skopeo copy with --override-arch fails with "authentication required"
during blob existence checks at the destination, regardless of how
credentials are provided (--dest-creds, --authfile, REGISTRY_AUTH_FILE).
This is a known issue with skopeo 1.13.x when copying from manifest
list sources.

Replace the skopeo/buildah approach with docker/build-push-action,
which is already proven in this repo (build-kubectl-image.yaml) and
handles multi-arch builds and Quay pushes reliably. The workflow now
builds a trivial FROM busybox image using buildx with QEMU emulation.

Fixes: b0abe5999 ("workflows: Add workflow to create auth registry test image")
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
2026-04-14 22:44:30 +02:00
stevenhorsman
5ea30b33ae workflows: stale-issue: Increase operations-per-run
At a rate of default 30 per run, with over 1.5k issues, it will take
us over 50 days to do a pass of the issues we have, so increase
operations-per-run as suggested in the workflow by github to
reduce this. Based on the stats of the latest run, we are not too
close to hitting the API rate limit:
```
Github API rate used: 32
Github API rate remaining: 3693; reset at: Thu Apr 09 2026 10:23:31 GMT+0000 (Coordinated Universal Time)
```
so I think this should be okay.

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-04-14 16:25:35 +01:00
stevenhorsman
a0359326e9 workflow: Bump stale action version
v9 is based on Node.js 20 which is deprecated, so update to the
latest to pick up a Node.js 24 version before Github removes Node 20

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-04-14 16:25:35 +01:00
Fabiano Fidêncio
c087eb92ec ci: Update GitHub Actions to Node.js 24 compatible versions
Node.js 20 is deprecated on GitHub Actions runners and will be
forced to Node.js 24 starting June 2nd, 2026. Update all affected
actions to versions that natively support Node.js 24:

- actions/upload-artifact: v4.6.2 -> v6.0.0
- actions/download-artifact: v4.3.0 -> v7.0.0
- docker/build-push-action: v5.4.0 -> v7.0.0
- docker/login-action: v3.4.0 -> v4.1.0
- docker/setup-buildx-action: v3.10.0 -> v4.0.0
- docker/setup-qemu-action: v3.6.0 -> v4.0.0
- geekyeggo/delete-artifact: v5.1.0 -> v6.0.0
- azure/login: v2.3.0 -> v3.0.0
- azure/setup-kubectl: v4.0.1 -> v5.0.0
- nick-fields/retry: v3.0.2 -> v4.0.0

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
2026-04-14 15:48:45 +02:00
Fabiano Fidêncio
b0abe59993 workflows: Add workflow to create auth registry test image
Add a manually-triggered workflow that builds and pushes a multi-arch
busybox-based image to quay.io/kata-containers/confidential-containers-auth
for use as an authenticated container image in CI tests.

The workflow uses skopeo to copy per-arch images and buildah to create
and push the multi-arch manifest.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
2026-04-14 10:59:12 +02:00
Fabiano Fidêncio
4c567a9c05 ci: Reduce TEE test scope for PR runs
TEE hardware (TDX, SEV-SNP) is very limited in CI. Running the full
test suite on every PR consumes these resources unnecessarily, since
most tests exercises what is already exercised by the -coco-dev CIs.

Introduce a `tee-test-scope` workflow input (small/full) and a new
`baremetal-small-tee` K8S_TEST_HOST_TYPE that runs only the 12 tests
that are TEE-relevant: attestation tests (encrypted/authenticated/
signed image pull, confidential attestation) plus policy and trusted
ephemeral data storage tests.

PR runs default to "small" (12 tests), nightly runs use "full" (59
tests), and manual dispatch offers a dropdown to choose.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
2026-04-13 20:26:46 +02:00
stevenhorsman
9448988783 workflow: Update cargo deny check
The cargo deny generated action doesn't seem to work
and seems unnecessarily complex, so try using
EmbarkStudios/cargo-deny-action instead

Fixes: #11218
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-04-11 08:46:32 +01:00
Fabiano Fidêncio
3b155ab0b1 ci: Run runtime-rs tests for SNP
As we're in the process to stabilise runtime-rs for the coming 4.0.0
release, we better start running as many tests as possible with that.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-09 16:35:08 +02:00
Fabiano Fidêncio
461907918d kata-deploy: pin nydus-snapshotter via versions.yaml
Resolve externals.nydus-snapshotter version and url in the Docker image build
with yq from the repo-root versions.yaml instead of Dockerfile ARG defaults.

Drop the redundant workflow that only enforced parity between those two sources.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-07 10:07:06 +08:00
Fabiano Fidêncio
ccfdf5e11b Merge pull request #12754 from llink5/fix/docker26-networking-9340
runtime: fix Docker 26+ networking by rescanning after Start
2026-04-03 13:15:38 +02:00
Alex Lyn
4a1c2b6620 Merge pull request #12309 from kata-containers/stale-issues-by-date
workflows: Create workflow to stale issues based on date
2026-04-03 09:31:34 +08:00
llink5
f7878cc385 runtime: fix Docker 26+ networking by rescanning after Start
Docker 26+ configures container networking (veth pair, IP addresses,
routes) after task creation rather than before. Kata's endpoint scan
runs during CreateSandbox, before the interfaces exist, resulting in
VMs starting without network connectivity (no -netdev passed to QEMU).

Add RescanNetwork() which runs asynchronously after the Start RPC.
It polls the network namespace until Docker's interfaces appear, then
hotplugs them to QEMU and informs the guest agent to configure them
inside the VM.

Additional fixes:
- mountinfo parser: find fs type dynamically instead of hardcoded
  field index, fixing parsing with optional mount tags (shared:,
  master:)
- IsDockerContainer: check CreateRuntime hooks for Docker 26+
- DockerNetnsPath: extract netns path from libnetwork-setkey hook
  args with path traversal protection
- detectHypervisorNetns: verify PID ownership via /proc/pid/cmdline
  to guard against PID recycling
- startVM guard: rescan when len(endpoints)==0 after VM start

Fixes: #9340

Signed-off-by: llink5 <llink5@users.noreply.github.com>
2026-04-02 21:23:16 +02:00
Steve Horsman
58101a2166 Merge pull request #12656 from stevenhorsman/actions/checkout-bump
workflows: Update actions/checkout version
2026-04-01 17:34:39 +01:00
Aurélien Bombo
78289d19f7 gha: Pin actionlint version
Pin to the latest released version as a security measure.

Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
2026-03-31 10:51:17 -05:00
Aurélien Bombo
3122fa651e gha: Avoid noisy deployment logs in PRs
GitHub recently announced that developers can now use environments without
auto-deployment, which allows us to avoid the noisy deployment logs in our
PRs:

https://github.blog/changelog/2026-03-19-github-actions-late-march-2026-updates/#github-actions-now-allows-developers-to-use-environments-without-auto-deployment

Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
2026-03-31 10:51:13 -05:00
stevenhorsman
99eaa8fcb1 workflows: Create workflow to stale issues based on date
The standard stale/action is intended to be run regularly with
a date offset, but we want to have one we can run against a specific
date in order to run the stale bot against issues created since a particular
release milestone, so calculate the offset in one step and use it in the next.

At the moment we want to run this to stale issues before 9th October 2022 when Kata 3.0 was release, so default to this.

Note the stale action only processes a few issues at a time to avoid rate limiting, so why we want a cron job to it can get through
the backlog, but also to stale/unstale issues that are commented on.

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-03-31 15:57:37 +01:00
stevenhorsman
b3179bdd8e workflows: Update actions/checkout version
Update the action to resolve the following warning in GHA:
> Node.js 20 actions are deprecated. The following actions are running
> on Node.js 20 and may not work as expected:
> actions/checkout@11bd71901b.
> Actions will be forced to run with Node.js 24 by default starting June 2nd, 2026.

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-03-30 10:45:28 +01:00
Fabiano Fidêncio
514a2b1a7c Merge pull request #12264 from fidencio/topic/nvidia-gpu-cc-use-nydus-snapshotter
nvidia: cc: Use nydus-snapshotter
2026-03-23 12:50:15 +01:00
Fabiano Fidêncio
864f181faf Merge pull request #12694 from manuelh-dev/mahuber/nv-test-timeout
tests: nvidia: Increase run test timeout
2026-03-23 09:13:20 +01:00
Alex Lyn
d2c2ec6e23 Merge pull request #12633 from LandonTClipp/docs_materialx
docs: Move to mkdocs-material, port Helm to docs site
2026-03-23 09:29:25 +08:00
Fabiano Fidêncio
740d380b8e tests: nvidia: cc: Use nydus-snapshotter
So we can test what we just changed in the config files.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-03-22 10:10:34 +01:00
Agam Dua
91d6c39f06 kernel: Fix debug build and add debug symbols to installation
Fixed a bug with the debug kernel build where common/ was repeated
after the common path variable, resulting in the debug
confs never being picked up.

This exposed a subsequent bug where the debug conf
was included in other builds, this is also fixed by creating a
separate directory for debug confs with one file at the moment,
debug.conf that contains debug configurations and bpf specific
configs.

To enable kernel builds (specifically for bpf) the dwarves package was added
to the kernel dockerfile for the pahole package.

Signed-off-by: Agam Dua <agam_dua@apple.com>
2026-03-20 14:50:23 -07:00
Agam Dua
5ab0744c25 ci: Add pipeline for building and distributing the debug kernel
Add the debug kernel to the kata tarball alongside the other kernels.

Also update the kernel README documentation to describe the new debug
kernel build process.

Signed-off-by: Agam Dua <agam_dua@apple.com>
2026-03-20 14:50:23 -07:00
LandonTClipp
795869152d docs: Move to mkdocs-material, port Helm to docs site
This supersedes https://github.com/kata-containers/kata-containers/pull/12622.
I replaced Zensical with mkdocs-materialx. Materialx is a fork of mkdocs-material
created after mkdocs-material was put into maintenance mode. We'll use this
platform until Zensical is more feature complete.

Added a few of the existing docs into the site to make a more user-friendly flow.

Signed-off-by: LandonTClipp <11232769+LandonTClipp@users.noreply.github.com>
2026-03-20 14:51:39 -05:00
Manuel Huber
8903b12d34 tests: nvidia: Increase run test timeout
Increase the timeout as a few new features and tests are going to be
onboarded for the NVIDIA GPU CI.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-03-20 11:12:52 -07:00
Manuel Huber
ae59cf26a0 kata-deploy: Check kata-tarball size limits
For kata tarballs we eventually release to GitHub, check their
size against the GitHub size limit. With this, we fail in case of
an ongoing release process in 'CI | Publish Kata Containers payload'
instead of only later on in the 'Release Kata Containers' action,
and we fail during PR builds, avoiding this situation at all.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-03-20 10:40:55 -07:00
stevenhorsman
e62df07b6a static-checks: Delete kata-spell-check
The old hunspell based spell-check was causing contributors
challenges and proving a barrier to doc updates. We've replaced
it with a cspell based-solution, so clean up the old approach.

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-03-19 10:22:54 +00:00
stevenhorsman
c2cedd7c02 workflows: Add spellcheck workflow
Add a separate spellcheck workflow, so we can replace
the complex hunspell approach embedded in static-checks

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-03-19 10:22:54 +00:00
Manuel Huber
660e3bb653 gpu: Obsolete the NVIDIA initrd build
As the NVIDIA stack has shifted to using an image for both the
confidential and non-confidential variants, we retire the initrd
build.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-03-16 21:29:58 -04:00
Aurélien Bombo
f8e234c6f9 Merge pull request #12650 from kata-containers/sprt/remove-csi
ci: Stop building/deploying CSI driver
2026-03-16 16:53:02 -05:00