Commit Graph

5124 Commits

Author SHA1 Message Date
alex.lyn
1d4ffe6af3 runtime-rs: Implement serializable SocketAddress with Serde
This enables consistent JSON representation of socket addresses
across system components:
(1) Add serde serialization/deserialization with standardized
field naming convention.
(2) Enforce string-based port/cid and unix/path representation
for protocol compatibility.

Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
2025-06-10 11:31:25 +08:00
Ruoqing He
781510202a runtime-rs: Log error instead of format
Log on error condition when `umount` operation fail instead of `format!`
error message.

Signed-off-by: Ruoqing He <heruoqing@iscas.ac.cn>
2025-06-08 08:28:22 +00:00
Xuewei Niu
17b2daf0a7 Merge pull request #11357 from justxuewei/nxw/remove-dcode
dragonball: Remove a useless dead_code attribute
2025-06-08 16:07:03 +08:00
Dan Mihai
e067a1be64 Merge pull request #11358 from burgerdev/gid-warning
genpolicy: improvements to /etc/passwd checks
2025-06-06 17:04:27 -07:00
Shunsuke Kimura
5193cfedca runtime: remove hotplug_vfio_on_root_bus from toml
In this commit, hotplug_vfio_on_root_bus parameter is removed.
<dd422ccb69>

pcie_root_port parameter description
(`This value is valid when hotplug_vfio_on_root_bus is true and
machine_type is "q35"`) will have no value,
and not completely valid, since vrit or DB as also support for root-ports and CLH as well.
so removed.

Fixes: #11316

Co-authored-by: Zvonko Kaiser <zkaiser@nvidia.com>
Signed-off-by: Shunsuke Kimura <pbrehpuum@gmail.com>
2025-06-05 21:53:06 +09:00
Markus Rudy
1c240de58d genpolicy: don't parse /etc/passwd in a loop
Instead of looping over the users per group and parsing passwd for each
user, we can do the reverse lookup uid->user up front and then compare
the names directly. This has the nice side-effect of silencing warnings
about non-existent users mentioned in /etc/group, which is not relevant
for policy decisions.

Signed-off-by: Markus Rudy <mr@edgeless.systems>
2025-06-04 17:54:57 +02:00
Markus Rudy
a1baaf6fe2 genpolicy: ignore groups with same name as user
containerd does not automatically add groups to the list of additional
GIDs when the groups have the same name as the user:

https://github.com/containerd/containerd/blob/f482992/pkg/oci/spec_opts.go#L852-L854

This is a bug and should be corrected, but it has been present since at
least 1.6.0 and thus affects almost all containerd deployments in
existence. Thus, we adopt the same behavior and ignore groups with the
same name as the user when calculating additional GIDs.

Signed-off-by: Markus Rudy <mr@edgeless.systems>
2025-06-04 10:29:49 +02:00
Xuewei Niu
3f8dd821e6 dragonball: Remove a useless dead_code attribute
The vhost-user-fs has been added to Dragonball, so we can remove
`update_memory`'s dead_code attribute.

Fixes: #8691

Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
2025-06-04 11:34:16 +08:00
Ruoqing He
77e68b164e agent: Upgrade ttrpc-codegen to 0.5.0
Propagate `ttrpc-codegen` upgrade from `libs/protocols` to `agent`.

Signed-off-by: Ruoqing He <heruoqing@iscas.ac.cn>
2025-06-04 01:16:46 +00:00
Ryan Savino
1e686dbca7 agent: Remove casting and fix Arc declaration
Removed unnecessary dynamic dispatch for services. Properly dereferenced
service Box values and stored in Arc.

Co-authored-by: Ruoqing He <heruoqing@iscas.ac.cn>
Signed-off-by: Ruoqing He <heruoqing@iscas.ac.cn>
Signed-Off-By: Ryan Savino <ryan.savino@amd.com>
2025-06-04 01:16:46 +00:00
Ruoqing He
0471f01074 libs: Bump ttrpc-codegen and protobuf
Previous version of `ttrpc-codegen` is generating outdated
`#![allow(box_pointers)]` which was deprecated. Bump `ttrpc-codegen`
from v0.4.2 to v0.5.0 and `protobuf` from vx to v3.7.1 to get rid of
this.

Signed-off-by: Ruoqing He <heruoqing@iscas.ac.cn>
2025-06-04 01:16:18 +00:00
Markus Rudy
eeb3d1384b genpolicy: compare additionalGIDs as sets
The additional GIDs are handled by genpolicy as a BTreeSet. This set is
then serialized to an ordered JSON array. On the containerd side, the
GIDs are added to a list in the order they are discovered in /etc/group,
and the main GID of the user is prepended to that list. This means that
we don't have any guarantees that the input GIDs will be sorted. Since
the order does not matter here, comparing the list of GIDs as sets is
close enough.

Signed-off-by: Markus Rudy <mr@edgeless.systems>
2025-06-03 20:18:35 +02:00
Markus Rudy
02ad39ddf1 genpolicy: push down warning about missing passwd file
The warning used to trigger even if the passwd file was not needed. This
commit moves it down to where it actually matters.

Signed-off-by: Markus Rudy <mr@edgeless.systems>
2025-06-03 11:19:29 +02:00
Markus Rudy
ec969e4dcd genpolicy: remove redundant group check
https://github.com/kata-containers/kata-containers/pull/11077
established that the GID from the image config is never used for
deriving the primary group of the container process. This commit removes
the associated logic that derived a GID from a named group.

Signed-off-by: Markus Rudy <mr@edgeless.systems>
2025-06-03 10:59:10 +02:00
Xynnn007
39aa481da1 runtime: fix initdata support for SNP
the qemu commandline of SNP should start with `sev-snp-guest`, and then
following other parameters separeted by ','. This patch fixes the
parameter order.

Signed-off-by: Xynnn007 <xynnn@linux.alibaba.com>
2025-06-02 20:33:19 +08:00
RuoqingHe
51cc960cdd Merge pull request #11346 from fidencio/topic/bump-cgroups-rs
rust: Update cgroups-rs to its v0.3.5 release
2025-05-31 04:13:05 +02:00
Fabiano Fidêncio
48f8496209 Merge pull request #11327 from Champ-Goblem/agent/increase-limit-nofile
agent: increase LimitNOFILE in the systemd service
2025-05-30 21:56:01 +02:00
Fabiano Fidêncio
02c46471fd rust: Update cgroups-rs to its v0.3.5 release
We're switching to using a rev as it may take some time for the package
to be updated on crates.io.

Signed-off-by: Fabiano Fidêncio <fidencio@northflank.com>
2025-05-30 21:49:50 +02:00
Champ-Goblem
ef642fe890 runtime: fix cgroupv2 deletion when sandbox_cgroup_only=false
Currently, when a new sandbox resource controller is created with cgroupsv2 and sandbox_cgroup_only is disabled,
the cgroup management falls back to cgroupfs. During deletion, `IsSystemdCgroup` checks if the path contains `:`
and tries to delete the cgroup via systemd. However, the cgroup was originally set up via cgroupfs and this process
fails with `lstat /sys/fs/cgroup/kubepods.slice/kubepods-besteffort.slice/....scope: no such file or directory`.

This patch updates the deletion logic to take in to account the sandbox_cgroup_only=false option and in this case uses
the cgroupfs delete.

Fixes: #11036
Signed-off-by: Champ-Goblem <cameron@northflank.com>
2025-05-30 17:51:31 +02:00
Champ-Goblem
f4007e5dc1 agent: increase LimitNOFILE in the systemd service
Increase the NOFILE limit in the systemd service, this helps with
running databases in the Kata runtime.

Signed-off-by: Champ-Goblem <cameron@northflank.com>
2025-05-30 17:49:29 +02:00
stevenhorsman
088e97075c workflow: Add top-level permissions
Set:
```
permissions:
  contents: read
```
as the default top-level permissions explicitly
to conform to recommended security practices e.g.
https://github.com/ossf/scorecard/blob/main/docs/checks.md#token-permissions
2025-05-28 19:34:28 +01:00
Dan Mihai
353d0822fd Merge pull request #11314 from katexochen/p/svc-name-regex
genpolicy: fix svc_name regex
2025-05-28 10:08:38 -07:00
Alex Lyn
aab6caa141 Merge pull request #10362 from Apokleos/vfio-hotplug-runtime-rs
runtime-rs: add support hotplugging vfio device for qemu-rs
2025-05-28 13:21:58 +08:00
Fabiano Fidêncio
ac934e001e Merge pull request #11244 from katexochen/p/guest-pull-config
runtime: add option to force guest pull
2025-05-27 16:00:09 +02:00
alex.lyn
e69a4d203a runtime-rs: Increase QMP read timeout to mitigate failures
It frequently causes "Resource Temporarily Unavailable (OS Error 11)"
with the original 250ms read timeout When passing through devices via
VFIO in QEMU. The root cause lies in synchronization timeout windows
failing to accommodate inherent delays during critical hardware init
phases in kernel space. This commit would increase the timeout to 5000ms
which was determined through some tests. While not guaranteeing complete
resolution for all hardware combinations, this change significantly
reduces timeout failures.

Fixes # 10361

Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
2025-05-27 21:06:57 +08:00
Paul Meyer
c4815eb3ad runtime: add option to force guest pull
This enables guest pull via config, without the need of any external
snapshotter. When the config enables runtime.experimental_force_guest_pull, instead of
relying on annotations to select the way to share the root FS, we always
use guest pull.

Co-authored-by: Markus Rudy <mr@edgeless.systems>
Signed-off-by: Paul Meyer <katexochen0@gmail.com>
2025-05-27 12:42:00 +02:00
Fabiano Fidêncio
d3f81ec337 Merge pull request #11240 from Apokleos/copydir
runtime-rs: Propagate k8s configs correctly when sharedfs is disabled
2025-05-27 12:41:21 +02:00
Paul Meyer
8de8b8185e genpolicy: rename svc_name to svc_name_downward_env
Just to be more explicit what this matches.

Signed-off-by: Paul Meyer <katexochen0@gmail.com>
2025-05-27 10:13:43 +02:00
Paul Meyer
78eb65bb0b genpolicy: fix svc_name regex
The service name is specified as RFC 1035 lable name [1]. The svc_name
regex in the genpolicy settings is applied to the downward API env
variables created based on the service name. So it tries to match
RFC 1035 labels after they are transformed to downward API variable
names [2]. So the set of lower case alphanumerics and dashes is
transformed to upper case alphanumerics and underscores.
The previous regex wronly permitted use of numbers, but did allow
dot and dash, which shouldn't be allowed (dot not because they aren't
conform with RFC 1035, dash not because it is transformed to underscore).

We have to take care not to also try to use the regex in places where
we actually want to check for RFC 1035 label instead of the downward
API transformed version of it.

Further, we should consider using a format like JSON5/JSONC for the
policy settings, as these are far from trivial and would highly benefit
from proper documentation through comments.

[1]: https://kubernetes.io/docs/concepts/services-networking/service/#defining-a-service
[2]: b2dfba4151/pkg/kubelet/envvars/envvars.go (L29-L70)

Signed-off-by: Paul Meyer <katexochen0@gmail.com>
2025-05-27 08:43:25 +02:00
RuoqingHe
139dc13bdc Merge pull request #11301 from lifupan/fix_cgroup
runtime-rs: fix the issue of delete cgroup failed
2025-05-27 05:05:32 +02:00
Xingru Li
71b6acfd7e dragonball: vsock: support single descriptor
Since kernel v6.3 the vsock packet is not split over two descriptors and
is instead included in a single one.

Therefore, we currently decide the specific method of obtaining
BufWrapper based on the length of descriptor.

Refer:
a2752fe04f
https://git.kernel.org/torvalds/c/71dc9ec9ac7d

Signed-off-by: Xingru Li <lixingru.lxr@linux.alibaba.com>
[ Gao Xiang: port this patch from the internal branch to address Linux 6.1.63+. ]
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2025-05-26 15:48:19 +08:00
Fupan Li
e9b45126fc Merge pull request #11254 from sampleyang/main
runtime-rs: fix vfio pci address domain 0001 problem
2025-05-23 18:13:10 +08:00
yangsong
06c7c5bccb runtime-rs: fix vfio pci address domain 0001 problem
Some nvidia gpu pci address domain with 0001,
current runtime default deal with 0000:bdf,
which cause address errors during device initialization
and address conflicts during device registration.

Fixes #11252

Signed-off-by: yangsong <yunya.ys@antgroup.com>
2025-05-23 14:33:06 +08:00
alex.lyn
043bab3d3e runtime-rs: Handle port allocation in PCIe topology for vfio devices
It's import to handle port allocation in a PCIe topology before vfio
deivce hotplug via QMP.
The code ensures that VFIO devices are properly allocated to available
ports (either root ports or switch ports) and updates the device's bus
and port information accordingly.
It'll first retrieves the PCIe port type from the topology using
pcie_topo.get_pcie_port(). And then, searches for an available node in
the PCIe topology with RootPort or SwitchPort type and allocates the
VFIO device to the found available port. Finally, Updates the device's
bus with the allocated port's ID and type.

Fixes # 10361

Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
2025-05-22 18:58:41 +08:00
alex.lyn
01b822de16 runtime-rs: Get available port node in the PCIe topology
This commit implements the `find_available_node` function,
which searches the PCIe topology for the first available
`TopologyPortDevice` or `SwitchDownPort`.
If no available node is found in either the `pcie_port_devices`
or the connected switches' downstream ports, the function returns
`None`.

Fixes # 10361

Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
2025-05-22 18:58:41 +08:00
alex.lyn
533d07a2c3 runtime-rs: Introduce qemu-rs vfio device hotplug handler
This commit note that the current implementation restriction where
'multifunction=on' is temporarily unsupported. While the feature
isn't available in the present version, we explicitly acknowledge
this limitation and commit to addressing it in future iterations
to enhance functional completeness.

Tracking issue #11292 has been created to monitor progress towards
full multifunction support.

Fixes #10361

Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
2025-05-22 18:58:06 +08:00
alex.lyn
f1796fe9ba runtime-rs: Add more fields in VfioDevice to express vfio devices
To support port devices for vfio devices, more fields need to be
introduced to help pass port type, bus and other information.

Fixes #10361

Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
2025-05-22 16:00:40 +08:00
Fupan Li
15cbc545ca runtime-rs: fix the issue of delete cgroup failed
When try to delete a cgroup, it's needed to move all of the
tasks/procs in the cgroup into root cgroup and then delete it.

Since for cgroup v2, it doesn't  support to move thread into
root cgroup, thus move the processes instead of moving tasks
can fix this issue.

Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
2025-05-22 12:15:02 +08:00
Fabiano Fidêncio
5378e581d8 Merge pull request #11144 from Apokleos/hotplug-block-qemu-rs
Support hot-plug block device in qemu-rs with QMP
2025-05-21 11:31:48 +02:00
Fabiano Fidêncio
6c9b199ef1 Merge pull request #11289 from BbolroC/fix-vfio-coldplug
runtime: Preserve hotplug devices for vfio-coldplug mode
2025-05-21 09:48:25 +02:00
Steve Horsman
f8c5aa6df6 Merge pull request #11259 from fitzthum/bump-gc-0140
Update Trustee and Guest Components for CoCo v0.14.0
2025-05-20 18:05:17 +01:00
Sumedh Alok Sharma
9a4432d197 Merge pull request #11233 from Ankita13-code/ankitapareek/execprocess-additional-input-validation
genpolicy: validate input process fields for ExecProcessRequest
2025-05-20 20:11:41 +05:30
Fabiano Fidêncio
29099d139b Merge pull request #11280 from kata-containers/dependabot/cargo/src/tools/kata-ctl/ring-0.17.14
build(deps): bump ring from 0.17.5 to 0.17.14 in /src/tools/kata-ctl
2025-05-20 13:47:22 +02:00
Ankita Pareek
ad75595dc8 genpolicy: Add tests for various input validations for ExecProcessRequest
These additional tests cover edge cases specific to-
- Terminal validation
- Capabilities validation
- Working directory (Cwd) validation
- NoNewPrivileges validation
- User validation
- Environment variables validation

Signed-off-by: Ankita Pareek <ankitapareek@microsoft.com>
2025-05-20 11:19:55 +00:00
Saul Paredes
1e466bf39c genpolicy: fix validation of env variables sourced from metadata.namespace
Use $(sandbox-namespace) wildcard in case none is specified in yaml. If wildcard is present, compare
input against annotation value.

Fixes regression introduced in https://github.com/microsoft/kata-containers/pull/273
where samples that use metadata.namespace env var were no longer working.

Signed-off-by: Saul Paredes <saulparedes@microsoft.com>
2025-05-20 11:19:46 +00:00
Dan Mihai
a113b9eefd genpolicy: validate probe process fields
Validate more process fields for k8s probe commands - e.g.,
livenessProbe, readinessProbe, etc.

Signed-off-by: Dan Mihai <dmihai@microsoft.com>
2025-05-20 11:15:30 +00:00
Dan Mihai
c0b8c6ed5e genpolicy: validate process for commands from settings
Validate more process fields for commands enabled using the
ExecProcessRequest "commands" and/or "regex" fields from the
settings file.
Add function to get the container from state based on container_id
matching instead of matching it against every policy container data

Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Signed-off-by: Ankita Pareek <ankitapareek@microsoft.com>
2025-05-20 11:15:30 +00:00
Dan Mihai
6f78aaa411 genpolicy: use process inputs for allow_process()
Using process data inputs for allow_process() is easier to
read/understand compared with the older OCI data inputs.

Signed-off-by: Dan Mihai <dmihai@microsoft.com>
2025-05-20 11:15:30 +00:00
Steve Horsman
2871c31162 Merge pull request #11273 from mythi/tdx-qemu-params
config: update QEMU TDX configuration
2025-05-20 10:22:59 +01:00
alex.lyn
4b27ca9233 runtime-rs: Implement volume copy allowlist check
For security reasons, we have restricted directory copying.

Introduces the `is_allowlisted_copy_volume` function to verify
if a given volume path is present in an allowed copy directory.
This enhances security by ensuring only permitted volumes are
copied

Currently, only directories under the path
`/var/lib/kubelet/pods/<uid>/volumes/{kubernetes.io~configmap,
kubernetes.io~secret, kubernetes.io~downward-api,
kubernetes.io~projected}` are allowed to be copied into the
guest. Copying of other directories will be prohibited.

Fixes #11237

Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
2025-05-20 16:57:10 +08:00