After containerd 2.0.4, privileged containers handle sysfs mounts a bit
differently, so we can end up with the policy expecting RO and the input
having RW.
The sandbox needs to get privileged mounts when any container in the pod
is privileged, not only when the pause container itself is marked
privileged. So we now compute that and pass it into get_mounts.
One downside: we’re relaxing policy checks (accepting RO/RW mismatch for
sysfs) and giving the pause container privileged mounts whenever the pod
has any privileged workload. For Kata, that means a slightly broader
attack surface for privileged pods—the pause container sees more than it
strictly needs, and we’re being more permissive on sysfs.
It’s a trade-off for compatibility with newer containerd; if you need
maximum isolation, you may want to avoid privileged pods or tighten
policy elsewhere.
Fixes: #12532
Signed-off-by: Markus Rudy <mr@edgeless.systems>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
When kata-deploy is deployed with cloud-api-adaptor, it
defaults to qemu instead of configuring the remote shim.
Support ppc64le to enable it correctly when shims.remote.enabled=true
Signed-off-by: Amulyam24 <amulmek1@in.ibm.com>
The VcpuThreadIds struct expects a mapping from vcpu_id to thread_id,
but get_ch_vcpu_tids() was inserting (tid, vcpu_id) instead of
(vcpu_id, tid).
This caused move_vcpus_to_sandbox_cgroup() to interpret vcpu IDs
(0, 1, 2...) as process IDs when sandbox_cgroup_only=false, leading
to failed attempts to read /proc/0/status.
Fixes: #12479
Signed-off-by: Chiranjeevi Uddanti <244287281+chiranjeevi-max@users.noreply.github.com>
Nowadays on arm64 we use a modern QEMU version which supports the features we
require for NVDIMM, so we remove the arm64-specific code and use the generic
implementation.
Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
This disables virtio-pmem support for Cloud Hypervisor by changing
Kata config defaults and removing the relevant code paths.
Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
Use rstest parameterized tests for QEMU variants, other hypervisors,
and unknown/empty shim cases.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
When copying artifacts from the container to the host, detect source
entries that are symlinks and recreate them as symlinks at the
destination instead of copying the target file.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
This test uses YAML files from a different directory than the other
k8s CI tests, so annotations have to be added into these separate
files.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
When qemu exits prematurely, we usually see a message like
msg="Cannot start VM" error="exiting QMP loop, command cancelled"
This is an indirect hint, caused by the QMP server shutting down. It
takes experience to understand what it even means, and it still does not
show what's actually the problem.
With this commit, we're taking the error return from the qemu
subprocess and surface it in the logs, if it's not nil. This means we
automatically capture any non-zero exit codes in the logs.
Signed-off-by: Markus Rudy <mr@edgeless.systems>
Turn test_toml_value_types into a parameterized test with one case per type
(string, bool, int). Merge the two invalid-TOML tests (get and set) into one
rstest with two cases, and the two "not an array" tests into one rstest
with two cases.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
When writing containerd drop-in or other TOML (e.g. initially empty file),
the serialized document could start with many newlines.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Replace multiple #[test] functions for snapshotter and erofs version
checks with parameterized #[rstest] #[case] tests for consistency and
easier extension.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
With some older kernels some fs implementations don't handle empty
options strings well. This leads to failures in "setup rootfs" step.
E.g. `cgroup: cgroup2: unknown option ""`.
This is fixed by mapping empty string to `None` before passing to
`nix::mount`.
Signed-off-by: Jacek Tomasiak <jtomasiak@arista.com>
Signed-off-by: Jacek Tomasiak <jacek.tomasiak@gmail.com>
Disable provenance and SBOM when building per-arch kata-deploy images so
each tag is a single image manifest. quay.io rejects pushing multi-arch
manifest lists that include attestation manifests (400 manifest invalid).
Add a note in the release script documenting this.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The default RuntimeClass (e.g. kata) is meant to point at the default shim
handler (e.g. kata-qemu-$tee). We were building it in a separate block and
only sometimes adding the same TEE nodeSelectors as the shim-specific
RuntimeClass, leading to kata ending up without the SE/SNP/TDX
nodeSelector while kata-qemu-$tee had it.
The fix is to stop duplicating the RuntimeClass definition, having a
single template that renders one RuntimeClass (name, handler, overhead,
nodeSelectors).
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
When NFD is detected (deployed by the chart or existing in the cluster),
apply shim-specific nodeSelectors only for TEE runtime classes (snp,
tdx, and se).
Non-TEE shims keep existing behavior (e.g. runtimeClass.nodeSelector for
nvidia GPU from f3bba0885 is unchanged).
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Per-arch images were failing publish-multiarch-manifest with 'X is a manifest
list' because Buildx now enables attestations by default, so each arch tag
became an image index. Use 'docker buildx imagetools create' instead of
'docker manifest create' so we can merge those indexes into the final
multi-arch manifest while keeping provenance and SBOM on per-arch images.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
We depend on GPU Operator v26.3 release, which is not out yet.
Although we have been testing with it, it's not yet publicly available,
which would break anyone actually trying to use the GPU runtime classes.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Enhance k8s-configmap.bats and k8s-credentials-secrets.bats to test that ConfigMap and Secret updates propagate to volume-mounted pods.
- Enhanced k8s-configmap.bats to test ConfigMap propagation
* Added volume mount test for ConfigMap consumption
* Added verification that ConfigMap updates propagate to volume-mounted pods
- Enhanced k8s-credentials-secrets.bats to test Secret propagation
* Added verification that Secret updates propagate to volume-mounted pods
Fixes#8015
Signed-off-by: Ajay Victor <ajvictor@in.ibm.com>
In containerd config v3 the CRI plugin is split into runtime and images,
and setting the snapshotter only on the runtime plugin is not enough for image
pull/prepare.
The images plugin must have runtime_platform.<runtime>.snapshotter so it
uses the correct snapshotter per runtime (e.g. nydus, erofs).
A PR on the containerd side is open so we can rely on the runtime plugin
snapshotter alone: https://github.com/containerd/containerd/pull/12836
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
In CI the default GPG keyring is often read-only or missing, so
'gpg --import' of the cached keyring fails and verification cannot
succeed. Use a temporary GNUPGHOME for import and verify so cached
gperf can be verified without writing to the system keyring.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The CC runtime classes kata-qemu-nvidia-gpu-snp and kata-qemu-nvidia-gpu-tdx
are mutually exclusive with kata-qemu-nvidia-gpu, as dictated by the gpu
cc mode setting. In order to properly support a cluster that has both CC and
non-CC nodes, we use a node selector so the scheduling is consistent with the
GPU mode. The GPU operator sets a label nvidia.com/cc.ready.state=[true, false]
to indicate the gpu mode setting
Fixes#12431
Signed-off-by: Joji Mekkattuparamban <jojim@nvidia.com>
genpolicy pulls image manifests from nvcr.io to generate policy and was
failing with 'UnauthorizedError' because it had no registry credentials.
Genpolicy (src/tools/genpolicy) uses docker_credential::get_credential()
in registry.rs, which reads from DOCKER_CONFIG/config.json. Add
setup_genpolicy_registry_auth() to create a Docker config with nvcr.io
auth (NGC_API_KEY) and set DOCKER_CONFIG before running genpolicy so it
can authenticate when pulling manifests.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add push-oras-tarball-cache workflow that runs on push to main when
versions.yaml changes (and on workflow_dispatch). It populates the
ghcr.io ORAS cache with gperf and busybox tarballs from versions.yaml.
Remove the push_to_cache call from download-with-oras-cache.sh since
it was never triggered in CI. Cache population is now done solely by the
new workflow and by populate-oras-tarball-cache.sh when run manually.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Separatly added hypervisor devices to cgroup to
omit not relevant warnings and fail if none of them
are available.
Also fix a testcase reload removed kernel modules to later testcases
and skip some tests on ARM because lack of virtualization support
Fixes#6656
Signed-off-by: Balint Tobik <btobik@redhat.com>