kata-containers

mirror of https://github.com/kata-containers/kata-containers.git synced 2026-02-22 06:43:41 +00:00

Author	SHA1	Message	Date
Fabiano Fidêncio	622b912369	release: Add kata-lifecycle-manager chart to release process Update the release workflow and scripts to package and publish the kata-lifecycle-manager Helm chart alongside kata-deploy. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-02-05 12:00:19 +01:00
Fabiano Fidêncio	62fef5a5e4	fixup! helm: Add kata-lifecycle-manager chart for Argo Workflows-based upgrades	2026-02-05 11:55:44 +01:00
Fabiano Fidêncio	2e9ed9aa4c	helm: Add kata-lifecycle-manager chart for Argo Workflows-based upgrades This chart installs an Argo WorkflowTemplate for orchestrating controlled, node-by-node upgrades of kata-deploy with verification and automatic rollback on failure. The workflow processes nodes sequentially rather than in parallel to ensure fleet consistency. This design choice prevents ending up with a mixed-version fleet where some nodes run the new version while others remain on the old version. If verification fails on any node, the workflow stops immediately before touching remaining nodes. Alternative approaches considered: - withParam loop with semaphore (max-concurrent: 1): Provides cleaner UI with all nodes visible at the same level, but Argo's semaphore only controls concurrency, not failure propagation. When one node fails and releases the lock, other nodes waiting on the semaphore still proceed. - withParam with failFast: true: Would be ideal, but Argo only supports failFast for DAG tasks, not for steps with withParam. Attempting to use it results in "unknown field" errors. - Single monolithic script: Would guarantee sequential execution and fail-fast, but loses per-node visibility in the Argo UI and makes debugging harder. The chosen approach uses recursive Argo templates (upgrade-node-chain) which naturally provides fail-fast behavior because if any step in the chain fails, the recursion stops. Despite the nesting in the Argo UI, each node's upgrade steps remain visible for monitoring. A verification pod is required to validate that Kata is functioning correctly on each node after upgrade. The chart will fail to install without one. Users must provide the verification pod when installing kata-lifecycle-manager using --set-file defaults.verificationPod=./pod.yaml. The pod can also be overridden at workflow submission time using a base64-encoded workflow parameter. When passing the verification pod as a workflow parameter, base64 encoding is required because multi-line YAML with special characters does not survive the journey through Argo CLI and shell script parsing. The workflow validates prerequisites before touching any nodes. If no verification pod is configured, the workflow fails immediately with a clear error message. This prevents partial upgrades that would leave the cluster in an inconsistent state. During helm upgrade, kata-deploy's verification is explicitly disabled (--set verification.pod="") because: - kata-deploy's verification is cluster-wide, designed for initial install - kata-lifecycle-manager does per-node verification with proper placeholder substitution (${NODE}, ${TEST_POD}) - Running kata-deploy's verification on each node would be redundant and could fail due to unsubstituted placeholders On verification failure, the workflow triggers an automatic helm rollback, waits for kata-deploy to stabilize, uncordons the node, and marks it with a rolled-back status annotation. The workflow then exits with an error so the failure is clearly visible. The upgrade flow per node: 1. Prepare: Annotate node with upgrade status 2. Cordon: Mark node unschedulable 3. Drain (optional): Evict pods if enabled 4. Upgrade: Run helm upgrade with --reuse-values 5. Wait: Wait for kata-deploy DaemonSet pod ready 6. Verify: Run verification pod with substituted placeholders 7. Complete: Uncordon and update annotations Draining is disabled by default because running Kata VMs continue using their in-memory binaries after upgrade. Only new workloads use the upgraded binaries. Users who prefer to evict all workloads before maintenance can enable draining. Known limitations: - Fleet consistency during rollback: Because kata-deploy uses a DaemonSet that is updated cluster-wide, nodes that pass verification are uncordoned and can accept new workloads before all nodes are verified. If a later node fails verification and triggers a rollback, workloads that started on already-verified nodes continue running with the new version's in-memory binaries while the cluster reverts to the old version. This is generally acceptable since running VMs continue functioning and new workloads use the rolled-back version. A future improvement could implement a two-phase approach that cordons all nodes upfront and only uncordons after all verifications pass. The chart requires Argo Workflows v3.4+ and uses multi-arch container images supporting amd64, arm64, s390x, and ppc64le. Usage: # Install kata-lifecycle-manager with verification pod (required) helm install kata-lifecycle-manager ./kata-lifecycle-manager \ --set-file defaults.verificationPod=./my-verification-pod.yaml # Label nodes for upgrade kubectl label node worker-1 katacontainers.io/kata-lifecycle-manager-window=true # Trigger upgrade argo submit -n argo --from workflowtemplate/kata-lifecycle-manager \ -p target-version=3.25.0 \ -p node-selector="katacontainers.io/kata-lifecycle-manager-window=true" \ -p helm-namespace=kata-system # Monitor progress argo watch @latest -n argo Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-02-05 11:50:04 +01:00
Dan Mihai	20ca4d2d79	runtime: DEFDISABLEBLOCK := true 1. Add disable_block_device_use to CLH settings file, for parity with the already existing QEMU settings. 2. Set DEFDISABLEBLOCK := true by default for both QEMU and CLH. After this change, Kata Guests will use by default virtio-fs to access container rootfs directories from their Hosts. Hosts that were designed to use Host block devices attached to the Guests can re-enable these rootfs block devices by changing the value of disable_block_device_use back to false in their settings files. 3. Add test using container image without any rootfs layers. Depending on the container runtime and image snapshotter being used, the empty container rootfs image might get stored on a host block device that cannot be safely hotplugged to a guest VM, because the host is using the same block device. 4. Add block device hotplug safety warning into the Kata Shim configuration files. Signed-off-by: Dan Mihai <dmihai@microsoft.com> Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Cameron McDermott <cameron@northflank.com>	2026-01-28 19:47:49 +01:00
Alex Lyn	fb7390ce3c	doc: update enabling full debug method The enable_debug parameter was explicitly set to false rather than being commented out (e.g., # enable_debug = true). As the previous enabling method failed to account for this explicit setting, it was rendered invalid. This commit updates the matching logic to correctly handle and toggle the explicit false value. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-01-22 17:44:57 +08:00
Fabiano Fidêncio	c1216598e8	static-checks: Fix kata-deploy reference Let's just point to the official documentation rather than explaining exactly how to deploy (and the current text was very outdated). Removing fluentd / minikube examples is out of context of this commit. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-01-19 15:09:20 +01:00
Manuel Huber	6b70923e55	docs: Update NVIDIA GPU passthrough QEMU scenario With cold-plug becoming by design the only supported mode with the update of NVRC to v0.1.1, resolving references to hot-plug. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-01-16 13:50:10 +01:00
LandonTClipp	197231456f	docs: Navigation improvements and bug fixes to Pages A few minor changes to the Zensical config that makes navigation easier. Also fixed a couple of bugs with local serving and added some quality of life features to Zensical. Signed-off-by: LandonTClipp <11232769+LandonTClipp@users.noreply.github.com>	2026-01-13 11:17:58 -06:00
Manuel Huber	df2896c298	docs: Create NVIDIA GPU passthrough QEMU scenario Create a new page for a reference implementation for Kubernetes using QEMU, the go shim and an NVIDIA rootfs. The new page contains information on: - components involved in the NVIDIA (TEE) GPU scenario - orchestration flow for GPU passthrough scenarios - deployment guidance Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-01-09 19:02:56 +01:00
Manuel Huber	43627805f4	docs: Improve structure and flow of NVIDIA guide - Apply a few structural/grouping changes and improve flow - Group build sections together - Move usage examples to last section Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-01-09 19:02:56 +01:00
Manuel Huber	65aa99f291	docs: Fix trusted-image-storage reference The sample uses a volume device name which does not exist, hence fix. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-01-09 11:41:18 +00:00
Mikko Ylinen	99bc0f49cc	use-cases: drop Intel QuickAssist instructions While the use-case of Intel QuickAssist (QAT) accelerated crypto and/or compression with k8s and Kata Containers is still valid, the setup instructions are outdated: Starting with Intel Xeon Gen4 (Sapphire Rapids), QAT driver stack moved to in-tree drivers without a separete SR-IOV VF driver. Drop all the setup instructions but keep the use-cases doc for reference. Users wanting to enable the use-case, should consult with Intel QAT Device plugins or Intel QAT DRA driver authors. Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>	2026-01-02 12:14:04 +02:00
stevenhorsman	e44c4d901f	doc: Fix uninlined_format_args in examples Clippy is recommending that format args are inlined for better clarity, so ensure our docs include this Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2025-12-22 19:49:27 +00:00
Alex Lyn	82e8e9fbe0	doc: add block device's settings to the doc page Add the block device specific annotations which is dedicated within runtime-rs for num_queues and queue_sie to the document to help users set the two parameters. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2025-12-11 21:10:22 +01:00
stevenhorsman	b9cb667687	doc: Document our Toolchain policy Create an initial version of our toolchain policy as agreed in Architecture Committee meetings and the PTG Fixes: #9841 Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2025-12-02 14:28:29 +00:00
Anton Ippolitov	23c46b8a00	docs: Update devmapper containerd plugin name The Firecracker installation docs had an outaded containerd configuration for the devmapper plugin. This commit updates the instructions so that they are compatible with more recent versions of containerd. Signed-off-by: Anton Ippolitov <anton.ippolitov@datadoghq.com>	2025-11-05 18:42:29 +01:00
ssc	551caad4b1	docs: add guide on VM templating usage in runtime-rs - Explained the concept and benefits of VM templating - Provided step-by-step instructions for enabling VM templating - Detailed the setup for using snapshotter in place of VirtioFS for template-based VM creation - Added performance test results comparing template-based and direct VM creation Signed-off-by: ssc <741026400@qq.com>	2025-10-30 15:18:31 +08:00
Fabiano Fidêncio	103f80c7f5	readme: install: Drop outdated documentation kata-deploy helm chart is THE way to deploy kata-containers on kubernetes environments, and kubernetes environments is basically the only reliably tested deployment we have. For now, let's just drop documentation that is outdated / incorrect, and in the future let's ensure we update the linked docs, as we work on update / upgrade for the helm chart. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2025-10-29 09:41:57 +01:00
Aurélien Bombo	93eef5b253	docs: Document behavior of procfs and sysfs mounts The claims in the doc come from #808 and #886. Signed-off-by: Aurélien Bombo <abombo@microsoft.com>	2025-10-21 08:50:06 -05:00
Aurélien Bombo	033299e46d	docs: Document behavior of BlockDevice hostPath volumes This is a follow-up to #11832. Signed-off-by: Aurélien Bombo <abombo@microsoft.com>	2025-10-21 08:50:06 -05:00
wangxinge	8e1b33cc14	docs: add document for seccomp This commit adds a document to use seccomp in runtime-rs Signed-off-by: wangxinge <wangxinge@bupt.edu.cn>	2025-10-09 13:25:17 +08:00
Aurélien Bombo	476c827fca	Merge pull request #11878 from kata-containers/sprt/privileged-docs docs: Document `privileged_without_host_devices=false` as unsupported	2025-10-08 11:12:45 -05:00
Fabiano Fidêncio	8c4bad68a8	kata-deploy: Remove kustomize yamls, rely on helm-chart only As the kata-deploy helm chart has been the only way we've been testing kata-containers deployment as part of our CI, it's time to finally get rid of the kustomize yamls and avoid us having to maintain two different methods (with one of those not being tested). Here I removed: * kata-deploy yamls and kustomize yamls * kata-cleanup yamls and kustomize yamls * kata-rbac yals and kustomize yamls * README.md for the kustomize yamls was removed Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2025-10-08 16:54:19 +02:00
Aurélien Bombo	6ff78373cf	docs: Document `privileged_without_host_devices=false` as unsupported Document that privileged containers with privileged_without_host_devices=false are not generally supported. When you try the above, the runtime will pass all the host devices to Kata in the OCI spec, and Kata will fail to create the container for various reasons depending on the setup, e.g.: - Attempting to hotplug uninitialized loop devices. - Attempting to remount /dev devices on themselves when the agent had already created them as default devices (e.g. /dev/full). - "Conflicting device updates" errors. - And more... privileged_without_host_devices was originally created to support Kata [1][2] and lots of people are having issues when it's set to false [3]. [1] https://github.com/kata-containers/runtime/issues/1568 [2] https://github.com/containerd/cri/pull/1225 [3] https://github.com/kata-containers/kata-containers/issues?q=is%3Aissue%20%20in%3Atitle%20privileged Signed-off-by: Aurélien Bombo <abombo@microsoft.com>	2025-10-02 15:21:19 -05:00
Aurélien Bombo	5c21b1faf3	runtime: Simplify mounting guest devices when using hostPath volumes This change crystallizes and simplifies the current handling of /dev hostPath mounts with virtually no functional change. Before this change: - If a mount DESTINATION is in /dev and it is a non-regular file on the HOST, the shim passes the OCI bind mount as is to the guest (e.g. /dev/kmsg:/dev/kmsg). The container rightfully sees the GUEST device. - If the mount DESTINATION does not exist on the host, the shim relies on k8s/containerd to automatically create a directory (ie. non-regular file) on the HOST. The shim then also passes the OCI bind mount as is to the guest. The container rightfully sees the GUEST device. - For other /dev mounts, the shim passes the device major/minor to the guest over virtio-fs. The container rightfully sees the GUEST device. After this change: - If a mount SOURCE is in /dev and it is a non-regular file on the HOST, the shim passes the OCI bind mount as is to the guest. The container rightfully sees the GUEST device. - The shim does not anymore rely on k8s/containerd to create missing mount directories. Instead it explicitely handles missing mount SOURCES, and treats them like the previous bullet point. - The shim no longer uses virtio-fs to pass /dev device major/minor to the guest, instead it passes the OCI bind mount as is. Signed-off-by: Aurélien Bombo <abombo@microsoft.com>	2025-10-01 15:32:21 -05:00
Alex Lyn	f254eeb0e9	CI: Keep base64 output is a single line This commit addresses an issue where base64 output, when used with a default configuration, would introduce newlines, causing decoding to fail on the runtime. The fix ensures base64 output is a single, continuous line using the -w0 flag. This guarantees the encoded string is a valid Base64 sequence, preventing potential runtime errors caused by invalid characters. Note that: When you use the base64 command without any parameters, it typically automatically adds newlines to the output, usually every 76 chars. In contrast, base64 -w0 explicitly tells the command not to add any newlines (-w for wrap, and 0 for a width of zero), which results in a continuous string with no whitespace. This is a critical distinction because if you pass a Base64 string with newlines to a runtime, it may be treated as an invalid string, causing the decoding process to fail. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2025-09-23 11:58:53 +08:00
Saul Paredes	cc73b14e26	docs: update policy docs Update policy docs to use initdata annotation and encoding Signed-off-by: Saul Paredes <saulparedes@microsoft.com>	2025-09-15 11:40:29 -07:00
Fabiano Fidêncio	ad240a39e6	kata-deploy: tools: tests: Use zstd instead of xz Although the compress ratio is not as optimal as using xz, it's way faster to compress / uncompress, and it's "good enough". This change is not small, but it's still self-contained, and has to get in at once, in order to help bisects in the future. Signed-off-by: Fabiano Fidêncio <fabiano@fidencio.org>	2025-08-21 19:53:55 +02:00
Xynnn007	44a6d1a6f7	docs: update guest pull document After moving guest pull abilities to CDH, the document of guest pull should be updated due to new workflow. Also, replace the diagram of PNG into a mermaid one for better maintaince. Signed-off-by: Xynnn007 <xynnn@linux.alibaba.com>	2025-06-16 11:13:20 +08:00
Zvonko Kaiser	985e965adb	doc: Added Helm Chart README.md We need more and accurate documentation. Let's start by providing an Helm Chart install doc and as a second step remove the kustomize steps. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com> Co-authored-by: Steve Horsman <steven@uk.ibm.com>	2025-06-02 23:26:16 +00:00
Zvonko Kaiser	4586511c01	doc: Add Helm Chart entry Since 3.12 we're shipping the helm-chart per default with each release. Update the documentation to use helm rather then the kata-deploy manifests. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2025-05-30 14:45:01 +00:00
Paul Meyer	c4815eb3ad	runtime: add option to force guest pull This enables guest pull via config, without the need of any external snapshotter. When the config enables runtime.experimental_force_guest_pull, instead of relying on annotations to select the way to share the root FS, we always use guest pull. Co-authored-by: Markus Rudy <mr@edgeless.systems> Signed-off-by: Paul Meyer <katexochen0@gmail.com>	2025-05-27 12:42:00 +02:00
Hyounggyu Choi	a286a5aee8	Merge pull request #11076 from Jakob-Naucke/ap-bind-assoc Bind/associate for VFIO-AP	2025-05-09 09:32:46 +02:00
Hui Zhu	17af28acad	docs: Add how-to-use-memory-agent.md to howto Add how-to-use-memory-agent.md (How to use mem-agent to decrease the memory usage of Kata container) to docs to show how to use mem-agent. Fixes: #11013 Signed-off-by: Hui Zhu <teawater@gmail.com>	2025-04-02 17:45:59 +08:00
Jakob Naucke	d808cef2fb	agent: AP bind-associate for Secure Execution Kata Containers has support for both the IBM Secure Execution trusted execution environment and the IBM Crypto Express hardware security module (used via the Adjunct Processor bus), but using them together requires specific steps. In Secure Execution, the Acceleration and Enterprise PKCS11 modes of Crypto Express are supported. Both modes require the domain to be _bound_ in the guest, and the latter also requires the domain to be _associated_ with a _guest secret_. Guest secrets must be submitted to the ultravisor from within the guest. Each EP11 domain has a master key verification pattern (MKVP) that can be established at HSM setup time. The guest secret and its ID are to be provided at `/vfio_ap/{mkvp}/secret` and `/vfio_ap/{mkvp}/secret_id` via a key broker service respectively. Bind each domain, and for each EP11 domain, - get the secret and secret ID from the addresses above, - submit the secret to the ultravisor, - find the index of the secret corresponding to the ID, and - associate the domain to the index of this secret. To bind, add the secret, parse the info about the domain, and associate, the s390_pv_core crate is used. The code from this crate also does the AP online check, which can be removed from here. Signed-off-by: Jakob Naucke <jakob.naucke@ibm.com>	2025-03-26 16:37:23 +01:00
Zvonko Kaiser	122ad95da6	Merge pull request #10751 from ryansavino/snp-upstream-host-kernel-support snp: update kata to use latest upstream packages for snp	2025-02-03 11:20:59 -05:00
Balint Tobik	47140357c4	docs: replace egrep/fgrep with grep -E/-F to avoid deprecation warning https://lists.gnu.org/archive/html/info-gnu/2022-09/msg00001.html Signed-off-by: Balint Tobik <btobik@redhat.com>	2025-01-29 11:25:54 +01:00
Ryan Savino	90e2b7d1bc	docs: updated build and host setup instructions for SNP Referenced AMD developer page for latest SEV firmware. Instructions to point to upstream 6.11 kernel or later. Referenced sev-utils and AMDESE fork for kernel setup. Signed-Off-By: Ryan Savino <ryan.savino@amd.com>	2025-01-28 18:09:40 -06:00
stevenhorsman	c3f13265e4	doc: Update the release process Add a step to wait for the payload publish to complete before running the release action. Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2024-12-19 09:52:39 +00:00
Ryan Savino	e46d24184a	Merge pull request #10386 from kimullaa/fix-build-error-when-using-sev-snp docs: Fix several build failures when I tried the procedures in "Kata Containers with AMD SEV-SNP VMs"	2024-11-25 16:58:52 -06:00
Shunsuke Kimura	706e8bce89	docs: change from OVMF.fd to AmdSev.fd change the build method to generate OVMF for AmdSev. This commit adds `ovmf_build=sev` env parameter. <`638c2c4164`> Fixes #10378 Signed-off-by: Shunsuke Kimura <pbrehpuum@gmail.com>	2024-11-15 11:24:45 +09:00
Shunsuke Kimura	d7f6fabe65	docs: fix build-kernel.sh option `build-kernel.sh` no longer takes an argument for the -x option. <`6c3338271b`> Fixes #10378 Signed-off-by: Shunsuke Kimura <pbrehpuum@gmail.com>	2024-11-15 11:24:45 +09:00
Fabiano Fidêncio	780b36f477	osbuilder: Drop Clear Linux The Clear Linux rootfs is not being tested anywhere, and it seems Intel doesn't have the capacity to review the PRs related to this (combined with the lack of interested from the rest of the community on reviewing PRs that are specific to this untested rootfs). With this in mind, I'm suggesting we drop Clear Linux support and focus on what we can actually maintain. Signed-off-by: Fabiano Fidêncio <fabiano@fidencio.org>	2024-11-11 15:22:55 +01:00
Steve Horsman	4fd9df84e4	Merge pull request #10482 from GabyCT/topic/fixvirtdoc docs: Update virtualization document	2024-11-04 11:51:09 +00:00
Gabriela Cervantes	1ca83f9d41	docs: Update virtualization document This PR updates the virtualization document by removing a url link which is not longer valid. Signed-off-by: Gabriela Cervantes <gabriela.cervantes.tellez@intel.com>	2024-10-31 17:28:02 +00:00
Pradipta Banerjee	6f1ba007ed	runtime: Add GPU annotations for remote hypervisor Add GPU annotations for remote hypervisor to help with the right instance selection based on number of GPUs and model Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>	2024-10-29 10:28:21 -04:00
Alex Lyn	dfd0ca9bfe	Merge pull request #10312 from sidneychang/configurable-build-dragonball runtime-rs: Add Configurable Compilation for Dragonball in Runtime-rs	2024-09-29 22:33:54 +08:00
Gabriela Cervantes	6a8b137965	docs: Remove qemu information not longer valid This PR removes some qemu information which is not longer valid as this is referring to the tests repository and to kata 1.x. Signed-off-by: Gabriela Cervantes <gabriela.cervantes.tellez@intel.com>	2024-09-23 16:58:24 +00:00
stevenhorsman	4f745f77cb	doc: Update the release process - Reflect the need to update the versions in the Helm Chart - Add the lock branch instruction - Add clarity about the permissions needed to complete tasks Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2024-09-20 19:04:33 +01:00
sidney chang	456e13db98	runtime-rs: Add Configurable Compilation for Dragonball in Runtime-rs rename DEFAULT_HYPERVISOR to HYPERVISOR in Makefile Fixes #10310 Signed-off-by: sidney chang <2190206983@qq.com>	2024-09-20 05:41:34 -07:00

1 2 3 4 5 ...

565 Commits