kata-containers

mirror of https://github.com/kata-containers/kata-containers.git synced 2026-04-04 02:53:45 +00:00

Author	SHA1	Message	Date
llink5	d13bd3f7eb	runtime: kill container on network timeout, address review nitpicks - Kill container via SIGKILL when RescanNetwork times out instead of silently continuing without networking - Remove unused networkReady channel - Fix import ordering, structured logging, log levels - Remove double-logging on timeout path - Add rollback comment and interface doc comment - Use logrus.Fields and plain const consistently Signed-off-by: llink5 <llink5@users.noreply.github.com>	2026-04-02 18:27:14 +02:00
llink5	66a04b3114	runtime: compare Dev+Ino for netns identity check Signed-off-by: llink5 <llink5@users.noreply.github.com>	2026-04-02 18:27:11 +02:00
llink5	c445eea774	runtime: harden Docker 26+ networking fix - Replace sandbox ID denylist with positive regex (^[0-9a-f]{64}$) - Rollback partially-added endpoints on scan failure in scanEndpointsInNs Signed-off-by: llink5 <llink5@users.noreply.github.com>	2026-04-02 18:26:39 +02:00
llink5	d43c5c20de	runtime: fix Docker 26+ networking by rescanning after Start Docker 26+ configures container networking (veth pair, IP addresses, routes) after task creation rather than before. Kata's endpoint scan runs during CreateSandbox, before the interfaces exist, resulting in VMs starting without network connectivity (no -netdev passed to QEMU). Add RescanNetwork() which runs asynchronously after the Start RPC. It polls the network namespace until Docker's interfaces appear, then hotplugs them to QEMU and informs the guest agent to configure them inside the VM. Additional fixes: - mountinfo parser: find fs type dynamically instead of hardcoded field index, fixing parsing with optional mount tags (shared:, master:) - IsDockerContainer: check CreateRuntime hooks for Docker 26+ - DockerNetnsPath: extract netns path from libnetwork-setkey hook args with path traversal protection - detectHypervisorNetns: verify PID ownership via /proc/pid/cmdline to guard against PID recycling - startVM guard: rescan when len(endpoints)==0 after VM start Fixes: #9340 Signed-off-by: llink5 <llink5@users.noreply.github.com>	2026-04-02 18:26:34 +02:00
Fabiano Fidêncio	09194d71bb	Merge pull request #12767 from nubificus/fix/fc-rs runtime-rs: Fix FC API fields	2026-04-02 18:24:35 +02:00
Manuel Huber	dd868dee6d	tests: nvidia: onboard NIM service test Onboard a test case for deploying a NIM service using the NIM operator. We install the operator helm chart on the fly as this is a fast operation, spinning up a single operand. Once a NIM service is scheduled, the operator creates a deployment with a single pod. For now, the TEE-based flow uses an allow-all policy. In future work, we strive to support generating pod security policies for the scenario where NIM services are deployed and the pod manifest is being generated on the fly. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-04-02 16:58:54 +02:00
Steve Horsman	58101a2166	Merge pull request #12656 from stevenhorsman/actions/checkout-bump workflows: Update actions/checkout version	2026-04-01 17:34:39 +01:00
Fabiano Fidêncio	75df4c0bd3	Merge pull request #12766 from fidencio/topic/kata-deploy-avoid-kata-pods-to-crash-after-containerd-restart kata-deploy: Fix kata-deploy pods crashing if containerd restarts	2026-04-01 18:28:16 +02:00
Steve Horsman	2830c4f080	Merge pull request #12746 from ldoktor/ci-helm2 ci.ocp: Use helm deployment for peer-pods	2026-04-01 17:13:21 +01:00
Lukáš Doktor	55a3772032	ci.ocp: Add note about external tests to README.md to run all the tests that are running in CI we need to enable external tests. This can be a bit tricky so add it into our documentation. Signed-off-by: Lukáš Doktor <ldoktor@redhat.com>	2026-04-01 16:59:33 +01:00
Lukáš Doktor	3bc460fd82	ci.ocp: Use helm deployment for peer-pods replace the deprecated CAA deployment with helm one. Note that this also installs the CAA mutating webhook, which wasn't installed before. Signed-off-by: Lukáš Doktor <ldoktor@redhat.com>	2026-04-01 16:59:33 +01:00
Anastassios Nanos	02c82b174a	runtime-rs: Fix FC API fields A FC update caused bad requests for the runtime-rs runtime when specifying the vcpu count and block rate limiter fields. Signed-off-by: Anastassios Nanos <ananos@nubificus.co.uk>	2026-04-01 14:50:51 +00:00
Fabiano Fidêncio	2131147360	tests: add kata-deploy lifecycle tests for restart resilience and cleanup Add functional tests that cover two previously untested kata-deploy behaviors: 1. Restart resilience (regression test for #12761): deploys a long-running kata pod, triggers a kata-deploy DaemonSet restart via rollout restart, and verifies the kata pod survives with the same UID and zero additional container restarts. 2. Artifact cleanup: after helm uninstall, verifies that RuntimeClasses are removed, the kata-runtime node label is cleared, /opt/kata is gone from the host filesystem, and containerd remains healthy. 3. Artifact presence: after install, verifies /opt/kata and the shim binary exist on the host, RuntimeClasses are created, and the node is labeled. Host filesystem checks use a short-lived privileged pod with a hostPath mount to inspect the node directly. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-04-01 15:20:53 +02:00
Fabiano Fidêncio	b4b62417ed	kata-deploy: skip cleanup on pod restart to avoid crashing kata pods When a kata-deploy DaemonSet pod is restarted (e.g. due to a label change or rolling update), the SIGTERM handler runs cleanup which unconditionally removes kata artifacts and restarts containerd. This causes containerd to lose the kata shim binary, crashing all running kata pods on the node. Fix this by implementing a three-stage cleanup decision: 1. If this pod's owning DaemonSet still exists (exact name match via DAEMONSET_NAME env var), this is a pod restart — skip all cleanup. The replacement pod will re-run install, which is idempotent. 2. If this DaemonSet is gone but other kata-deploy DaemonSets still exist (multi-install scenario), perform instance-specific cleanup only (snapshotters, CRI config, artifacts) but skip shared resources (node label removal, CRI restart) to avoid disrupting the other instances. 3. If no kata-deploy DaemonSets remain, perform full cleanup including node label removal and CRI restart. The Helm chart injects a DAEMONSET_NAME environment variable with the exact DaemonSet name (including any multi-install suffix), ensuring instance-aware lookup rather than broadly matching any DaemonSet containing "kata-deploy". Fixes: #12761 Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-04-01 15:20:52 +02:00
Fabiano Fidêncio	28414a614e	kata-deploy: detect k3s/rke2 via systemd services instead of version string Newer k3s releases (v1.34+) no longer include "k3s" in the containerd version string at all (e.g. "containerd://2.2.2-bd1.34" instead of "containerd://2.1.5-k3s1"). This caused kata-deploy to fall through to the default "containerd" runtime, configuring and restarting the system containerd service instead of k3s's embedded containerd — leaving the kata runtime invisible to k3s. Fix by detecting k3s/rke2 via their systemd service names (k3s, k3s-agent, rke2-server, rke2-agent) rather than parsing the containerd version string. This is more robust and works regardless of how k3s formats its containerd version. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-04-01 14:24:55 +02:00
Fabiano Fidêncio	8b9ce3b6cb	tests: remove k3s/rke2 V3 containerd template workaround Remove the workaround that wrote a synthetic containerd V3 config template for k3s/rke2 in CI. This was added to test kata-deploy's drop-in support before the upstream k3s/rke2 patch shipped. Now that k3s and rke2 include the drop-in imports in their default template, the workaround is no longer needed and breaks newer versions. Removed: - tests/containerd-config-v3.tmpl (synthetic Go template) - _setup_containerd_v3_template_if_needed() and its k3s/rke2 wrappers - Calls from deploy_k3s() and deploy_rke2() This reverts the test infrastructure part of `a2216ec05`. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-04-01 14:24:55 +02:00
Steve Horsman	9a3f6b075e	Merge pull request #12753 from stevenhorsman/remove-agent-Cargo- agent: Remove Cargo.lock	2026-04-01 13:22:57 +01:00
Steve Horsman	0d38d88b07	Merge pull request #12484 from Amulyam24/runtime-rs-ppc64le runtime-rs: add QEMU support for ppc64le	2026-04-01 12:54:40 +01:00
Fabiano Fidêncio	6555350625	Merge pull request #12765 from fidencio/topic/kata-deploy-nydus-fix-systemd-unit kata-deploy: Make nydus a soft dep of containerd	2026-04-01 13:40:42 +02:00
Fabiano Fidêncio	b823184cf7	Merge pull request #12580 from manuelh-dev/mahuber/gpu-ci-storage tests: gpu: use the container image layer storage feature	2026-04-01 12:23:56 +02:00
Fabiano Fidêncio	fe1f804543	kata-deploy: Restart nydus-snapshotter in case of failure Let's ensure that in case nydus-snapshotter crashes for one reason or another, the service is restarted. This follows containerd approach, and avoids manual intervention in the node. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-04-01 11:00:21 +02:00
Fabiano Fidêncio	789abe6fdf	kata-deploy: Make nydus a soft dep of containerd Let's relax our RequiredBy and use a WantedBy in the nydus systemd unit file as, in case of a nydus crash, containerd would also be put down, causing the node to become NotReady. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-04-01 10:52:29 +02:00
Amulyam24	bf74f683d7	runtime-rs: align memory size with desired block size on ppc64le couldn't initialise QMP: Connection reset by peer (os error 104) Caused by: Connection reset by peer (os error 104) qemu stderr: "qemu-system-ppc64: Maximum memory size 0x80000000 is not aligned to 256 MiB” When the default max memory was assigned according to the available host memory, it failed with the above error Align the memory values with the block size of 256 MB on ppc64le. Signed-off-by: Amulyam24 <amulmek1@in.ibm.com>	2026-04-01 09:36:45 +01:00
Amulyam24	dcb7d025c7	runtime-rs: Use libc::TUNSETIFF instead of wrapper TUNSETIFF() While attaching the tap device, it fails on ppc64le with EBADF "cannot create tap device. File descriptor in bad state (os error 77)\"): unknown” Refactor the ioctl call to use the standard libc::TUNSETIFF constant. Signed-off-by: Amulyam24 <amulmek1@in.ibm.com>	2026-04-01 09:36:45 +01:00
Amulyam24	8d25ff2c36	runtime-rs: implement set_capabilities for qemu After the qemu VM is booted, while storing the guest details, it fails to set capabilities as it is not yet implemented for QEMU, this change adds a default implementation for it. Signed-off-by: Amulyam24 <amulmek1@in.ibm.com>	2026-04-01 09:36:45 +01:00
Amulyam24	778524467b	runtime-rs: enable building runtime-rs on ppc64le Adds changes in Makefile to build runtime-rs on ppc64le with QEMU support. Signed-off-by: Amulyam24 <amulmek1@in.ibm.com>	2026-04-01 09:36:45 +01:00
Manuel Huber	177f5c308e	tests: gpu: use container image layer storage Use the container image layer storage feature for the k8s-nvidia-nim.bats test pod manifests. This reduces the pods' memory requirements. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-04-01 10:22:26 +02:00
Manuel Huber	b6cf00a374	tests: parametrize storage parameters - trusted-storage.yaml.in: use $PV_STORAGE_CAPACITY and $PVC_STORAGE_REQUEST so that PV/PVC size can vary per test. - confidential_common.sh: add optional size (MB) argument to create_loop_device. - k8s-guest-pull-image.bats: pass PV_STORAGE_CAPACITY and PVC_STORAGE_REQUEST when generating storage config. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-04-01 10:22:26 +02:00
Fabiano Fidêncio	f756966a8e	Merge pull request #12757 from BbolroC/fix-parsing-imcompatibility-containerd-config tests: Configure devmapper properly regardless of containerd version	2026-04-01 10:21:50 +02:00
stevenhorsman	5390e470d3	agent: Remove Cargo.lock Following on from #12690, the agent is part of the repo workspace, so no longer needs a lock file. Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-04-01 09:11:28 +01:00
Hyounggyu Choi	11cd5f2808	tests: Configure devmapper properly regardless of containerd version The follow differences are observed between container 1.x and 2.x: ``` [plugins.'io.containerd.snapshotter.v1.devmapper'] snapshotter = 'overlayfs' ``` and ``` [plugins."io.containerd.snapshotter.v1.devmapper"] snapshotter = "overlayfs" ``` The current devmapper configuration only works with double quotes. Make it work with both single and double quotes via tomlq. In the default configuration for containerd 2.x, the following configuration block is missing: ``` [[plugins.'io.containerd.transfer.v1.local'.unpack_config]] platform = "linux/s390x" # system architecture snapshotter = "devmapper" ``` Ensure the configuration block is added for containerd 2.x. Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>	2026-04-01 07:14:52 +02:00
Fabiano Fidêncio	7e371ac325	Merge pull request #12758 from fidencio/topic/kata-deploy-skip-k3s-tests gatekeeper: unrequire kata-deploy k3s tests	2026-04-01 07:13:51 +02:00
Fabiano Fidêncio	3a1683ccdc	gatekeeper: unrequire kata-deploy k3s tests Those are breaking, and I need time to investigate why. For now, unrequire those tests. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-03-31 18:32:17 +02:00
Steve Horsman	468abea97a	Merge pull request #12719 from kata-containers/sprt/env-no-deploy gha: Avoid noisy deployment logs in PRs	2026-03-31 17:12:07 +01:00
Aurélien Bombo	78289d19f7	gha: Pin actionlint version Pin to the latest released version as a security measure. Signed-off-by: Aurélien Bombo <abombo@microsoft.com>	2026-03-31 10:51:17 -05:00
Aurélien Bombo	3122fa651e	gha: Avoid noisy deployment logs in PRs GitHub recently announced that developers can now use environments without auto-deployment, which allows us to avoid the noisy deployment logs in our PRs: https://github.blog/changelog/2026-03-19-github-actions-late-march-2026-updates/#github-actions-now-allows-developers-to-use-environments-without-auto-deployment Signed-off-by: Aurélien Bombo <abombo@microsoft.com>	2026-03-31 10:51:13 -05:00
RuoqingHe	a0e99a86cf	Merge pull request #12690 from Jiahao1226/put-agent-into-root-workspace build: Move agent to root workspace	2026-03-30 18:07:28 +08:00
stevenhorsman	12578b41f2	govmm: Delete old files The govmm workflow isn't run by us and it and the other CI files are just legacy from when it was a separate repo, so let's clean up this debt rather than having to update it frequently. Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-03-30 10:45:28 +01:00
stevenhorsman	b3179bdd8e	workflows: Update actions/checkout version Update the action to resolve the following warning in GHA: > Node.js 20 actions are deprecated. The following actions are running > on Node.js 20 and may not work as expected: > actions/checkout@11bd71901b. > Actions will be forced to run with Node.js 24 by default starting June 2nd, 2026. Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-03-30 10:45:28 +01:00
Steve Horsman	012bf4b333	Merge pull request #12635 from Apokleos/update-docs-rs runtime-rs: Update docs for runtime-rs	2026-03-30 10:42:31 +01:00
Alex Lyn	7dce05b5fc	docs: Update the pictures of kata 4.0 with mermaid codes It becomes simple and flexible with mermaid codes to update the pic or diagrams. And it also remove the legacy PNG pictures to reduce the kata-statics release file size. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-03-29 19:17:03 +02:00
Alex Lyn	3c584a474f	docs: Update libs README with complete library documentation Add all 9 library crates which are missing in workspace including: (1) kata-types with annotations, hypervisor configs, and K8s utilities. (2) kata-sys-util with all sub-modules: cpu, device, fs, hooks, k8s, mount, netns, numa, pcilibs, protection, spec, validate. (3) protocols with ttrpc bindings: agent, health, remote, csi, oci, confidential_data_hub. (4) runtime-spec with OCI container state types and namespace constants. (5) shim-interface with RESTful API and Unix socket path. (6) logging with slog framework features: JSON, journal, filtering. (7) safe-path with security-focused path resolution utilities. (8) mem-agent with memory management: memcg, compact, psi. (9) test-utils with privilege and KVM test macros. And one more thing, uniformly adopt TOCTOU in place of the redundant TOCTTOU abbreviation. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-03-29 19:17:03 +02:00
Alex Lyn	48ef2220e8	docs: Update runtime-rs README with accurate architecture documentation Add comprehensive hypervisor support table (Dragonball, QEMU, Cloud Hypervisor, Firecracker, Remote). Document all runtime handlers (VirtContainer, LinuxContainer, WasmContainer) and resource types. List all configuration files including CoCo variants (TDX, SNP, SE). Add shim-ctl crate to crates table for development tooling reference. Add Feature Flags section documenting dragonball and cloud-hypervisor options. Simplify and restructure content for clarity while preserving technical accuracy. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-03-29 19:17:03 +02:00
Alex Lyn	c96b2034dc	docs: Update kata-types README with comprehensive module documentation Add detailed module documentation table describing all available modules including: - annotations - capabilities - config - container ... Document configuration module features including TOML-based loading, drop-in files, and hypervisor-specific configurations (QEMU, Cloud Hypervisor, Firecracker, Dragonball, Remote). Improve formatting with Markdown tables and structured sections for better readability. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-03-29 19:17:03 +02:00
Alex Lyn	b8576ef476	docs: Update kata-sys-util README with comprehensive feature documentation Expand README.md to include detailed documentation for all modules: - File system operations (fs) - Mount operations (mount) - CPU utilities (cpu) - NUMA support (numa) - Device management (device) - Kubernetes support (k8s) - Network namespace (netns) - OCI specification utilities (spec) - Validation (validate) - Hooks (hooks) - Guest protection (protection) - Random generation (rand) - PCI device management (pcilibs) Add supported architectures list and improve overview section. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-03-29 19:17:03 +02:00
Alex Lyn	a747b9f774	docs: Improve and refine hypervisor README documentation Enhance documentation in the hypervisor README.md file with: (1) Standardized terminology and formatting (VMM capitalization) (2) Improved paragraph transitions and logical flow (3) Fixed punctuation errors in code blocks Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-03-29 19:17:03 +02:00
Alex Lyn	302b2c8d75	docs: Restructure and modernize virtualization design document Comprehensive rewrite of docs/design/virtualization.md to improve clarity, completeness, and usability. This document now serves as the authoritative guide for understanding and selecting hypervisors in Kata Containers deployments. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-03-29 19:17:03 +02:00
Alex Lyn	7fa68ffd52	docs: Consolidate hypervisor documentation in virtualization.md Add 'Choose a Hypervisor', 'Hypervisor Configuration Files', and 'Hypervisor Versions' sections to virtualization.md. Key changes: - Integrate hypervisor comparison table from hypervisors.md - Add configuration file reference table for both go and rust runtimes - Add current hypervisor versions from versions.yaml: - Cloud Hypervisor: v51.1 - Firecracker: v1.12.1 - QEMU: v10.2.1 - StratoVirt: v2.3.0 - Dragonball: builtin (part of rust runtime) - Preserve original structure documenting each hypervisor's device model and features - Add reference links for all hypervisors This consolidates hypervisor selection guidance and version information into a single comprehensive virtualization design document. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-03-29 19:17:03 +02:00
Alex Lyn	119a145923	docs: Upgrade architecture documentation from 3.0 to 4.0 Replace Kata 3.0 architecture docs with Kata 4.0 (Rust Runtime) documentation. Key changes: - Remove deprecated architecture 3.0 documentation - Add comprehensive Kata 4.0 architecture guide covering: - Unified single-binary architecture - Built-in Dragonball VMM integration - Async I/O model with Tokio - Layered architecture design - Modular resource manager - Extensible framework for multiple container types The new documentation reflects the production-ready Rust runtime with improved performance and reduced resource consumption. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-03-29 19:17:03 +02:00
Alex Lyn	9f6bce9517	docs: Remove containerd settings from crio dedicated document As the document is just for CRI-O, we need remove containerd related settings from it and make it clear for users. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-03-29 19:17:03 +02:00

1 2 3 4 5 ...

18347 Commits