This allows adding different runners in case the powerful one goes down
for one reason or another.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Systemd-managed cgroups use the slice:prefix:name format, which is
not a filesystem path. Calling MoveTo() on such paths fails with
"invalid group path" and can abort cleanup before Delete() runs.
In some cases, this causes pod teardown delays.
Skip MoveTo for systemd-formatted sandbox/overhead cgroup paths when
sandbox_cgroup_only is true; systemd moves tasks on unit deletion.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
With cold-plug becoming by design the only supported mode with the
update of NVRC to v0.1.1, resolving references to hot-plug.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Enable post-install verification in kata-deploy CI tests. When
HELM_VERIFY_DEPLOYMENT is set, a simple verification pod is created
that runs with the Kata runtime to confirm deployment succeeded.
The verification pod prints kernel info and exits - success indicates
the Kata runtime is properly configured and functional.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add optional verification that runs after kata-deploy installation.
When a pod spec is provided via --set-file verification.pod=<file>,
a verification job runs after install/upgrade to validate deployment.
The user is fully responsible for the verification pod content:
- Pod name, runtimeClassName, annotations, and verification logic
- Pod must exit 0 on success, non-zero on failure
The verification job simply:
1. Waits for kata-deploy DaemonSet to be ready
2. Applies the user-provided pod spec
3. Waits for the pod to complete
4. Shows logs and cleans up
Usage:
helm install kata-deploy ... \
--set-file verification.pod=/path/to/your-pod.yaml
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
To unlock the release, move the job to publish kata payload after push to an alternate runner(IBM owned) for ppc64le.
Signed-off-by: Amulyam24 <amulmek1@in.ibm.com>
The new NVRC version works for CC and non-CC use cases,
no --feature confidential needed anymore.
Bump versions.yaml and adjust deployment instructions.
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Disable NVDIMM. When using GPU passthrough, using NVDIMM would create
a r/o file-backed memory region. When using a GPU, QEMU tries to DMA-
map guest memory for the device, resulting in a mapping error:
memory listener initialization failed: Region mem0:
vfio_container_dma_map ... -22 (Invalid argument).
For the CC configs, NVDIMM is disabled by default in qemu_amd64.go
with a warning, but we also explicitly disable the setting in the
shim configuration file.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
We don't need to store the kernel headers anymore. We do need to store
the kernel modules, instead.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
We've done some bad file based driver determination,
now with versions.yaml there is a single source of truth.
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
We need to package the build modules for the rootfs
to be able to consume it. We package the whole
/lib/modules/$(uname -r) directory strip=2.
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
We want to have deterministic behaviour and only
one valid driver version acceptable via versions.yaml
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
We actually never installed yq to the kernel build,
there are some path that use yq but were never hit,
for the GPU use-case we need to read values from versions.yaml
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
In preparation for coco v0.18.0, bump the version of image-rs we use in
agent-ctl to match what we have in versions.yaml.
Drop the snapshotter-overlayfs feature. This was dropped from image-rs
when we removed enclave-cc support.
Signed-off-by: Tobin Feldman-Fitzthum <tfeldmanfitz@nvidia.com>
Before cutting the Kata release that will be used with CoCo v0.18.0,
let's bump the versions of Trustee and guest-components to latest.
Signed-off-by: Tobin Feldman-Fitzthum <tfeldmanfitz@nvidia.com>
This is needed as the 580 driver doesn't build against 6.18.x, and the
590 driver is not yet fully working for our case, thus we stick to the
previous version that worked before.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Bump both the kernel and kernel-confidential versions from v6.12.x and
v6.16.x to v6.18.4, aligning with the new LTS release.
Kernel 6.18 introduced several configuration changes that required
updates to our kernel config fragments:
* CRYPTO_FIPS dependencies changed:
- In 6.12: depended on !CRYPTO_MANAGER_DISABLE_TESTS
- In 6.18: now depends on CRYPTO_SELFTESTS (which requires EXPERT)
Added CONFIG_EXPERT=y and CONFIG_CRYPTO_SELFTESTS=y to crypto.conf
to satisfy the new dependency chain.
* CONFIG_EXPERT is a naughty one, as it disables / enables a bunch
of things behind ones back, probably just to prove a point that
it is for experts ;-) ... regardless, a reasonable amount of
options had to be re-added in order to make sure anything ends
up broken.
* Legacy iptables support:
Kernel 6.18 requires explicit legacy xtables/iptables configs for
IP_NF_* options. Added CONFIG_NETFILTER_XTABLES_LEGACY,
CONFIG_IP_NF_IPTABLES_LEGACY, and CONFIG_IP6_NF_IPTABLES_LEGACY
to netfilter.conf.
* Module signing dependencies:
Added CONFIG_MODULES=y and other required dependencies to
module_signing.conf to ensure MODULE_SIG can be properly enabled.
* Whitelist updates:
- Added CONFIG_NF_CT_PROTO_DCCP (removed in 6.18+)
- Added CONFIG_CRYPTO_SELFTESTS, CONFIG_NETFILTER_XTABLES_LEGACY,
CONFIG_IP_NF_IPTABLES_LEGACY, CONFIG_IP6_NF_IPTABLES_LEGACY
(added in 6.18+, not present in older kernels like 6.12)
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
A few minor changes to the Zensical config that makes navigation easier. Also
fixed a couple of bugs with local serving and added some quality of life
features to Zensical.
Signed-off-by: LandonTClipp <11232769+LandonTClipp@users.noreply.github.com>
This commit adds a Github workflow for building a Github Pages site for the markdown
files in the docs/ directory. Zensical is a new markdown-based static site generation
framework built by the creators of Material for Mkdocs. https://zensical.org/
This commit does not clean the doc structure, so site navigation is initially going to
be messy.
Signed-off-by: LandonTClipp <11232769+LandonTClipp@users.noreply.github.com>
Remove the agent hotplug timeout parameter from the kernel
command line. Having shifted to VFIO cold-plug, this parameter is
no longer needed.
Remove the no longer required parameter for TDX and thus align the
SNP and TDX configurations.
Add a parameter to avoid the kernel to mount the /dev tmpfs. NVRC
and later on kata-agent attempt this. While kata-agent does not
panic when mounting /dev fails, NVRC makes mounting /dev a hard
requirement.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
set_container_command() previously appended command arguments
one-by-one with
'.command += [...]'. This makes the helper non-idempotent and can
lead to unexpected command arrays when invoked multiple times.
Update the helper to set the full command array in a single yq v4
expression and print the target YAML path plus the command being
applied to simplify debugging when tests fail.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
The pod config file created by new_pod_config() was generated via
mktemp using the template "pod-config.yaml.in.XXX", which produces
filenames that do not end with ".yaml" (e.g. pod-config.yaml.in.ABC).
If the random combination of special suffix with ".Csv" or ".Xml", etc.
the following operations with yq will fail.
Some helpers and tooling assume the config path ends with ".yaml".
Switch the mktemp template to place the random suffix before the
extension so the returned path always ends with ".yaml".
Fixes: #12268, #12319
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
This is a suggestion from Choi, so we can easily test with a specific
kubectl version and also easily understand which kubectl version is
being used in case of failure.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
This image will be used by our helm charts to verify that a
kata-containers deployment is correct.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Enhance the wait_for_migration implementation to reliably wait for
QEMU migration completion and avoid the previous `sleep(280ms)`
delay.
(1) Add an initial fast-path query to return immediately if
migration is already completed/failed/cancelled.
(2) Use a hard deadline to enforce timeouts deterministically.
(3) Implement adaptive polling with backoff and a maximum interval
to reduce QMP load while keeping responsiveness.
(4) Unify migration status handling and return clear errors on
failed/cancelled states.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Return information about current migration process. And the input
and output as below:
{ 'command': 'query-migrate', 'returns': 'MigrationInfo' }
But note that the Qemu API is valid within qapi-rs(v0.15+)
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
The detailed information about the updated versions as below:
```
qapi = { version = "0.15", features = ["qmp", "async-tokio-all"] }
qapi-spec = "0.3.2"
qapi-qmp = "0.15.0"
```
and it will correct some corresonding structures.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Change the secure_storage_integrity option's default value to true.
With this, integrity protection for encrypted block device contents
will be requested from the confidential data hub by default, see the
agent's cdh_handler_trusted_storage function in rpc.rs.
This behavior can be disabled by explicitly setting the
agent.secure_storage_integrity parameter to 0 or false via kernel
command line parameters.
This will affect the trusted storage implementation for the guest-pull
mechanism, and it will affect future implementations using this code
path, such as implementations for ephemeral secure storage.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
In some builds we are seeing:
```
error: could not create temp file /opt/rustup/tmp/r2xu46kwuyc7k2kr_file: Permission denied (os error 13)
```
in the agent-ctl build, so try and port a fix from #12313 to the tools build
to try and resolve this.
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Fixes deploying kata-containers using k3s. The deploy script fails with /opt/kata-artifacts/scripts/kata-deploy.sh: line 397: [: too many arguments
Signed-off-by: Federico A. Corazza <git@facorazza.com>
yamllint complains that there is only one space before the comment,
so add a second to prevent this annoying message showing up.
Signed-off-by: stevenhorsman <steven@uk.ibm.com>