For nerdctl and docker runtimes, network is hot-plugged instead of
cold-plugged. While this change was made in the runtime,
we did not have the agent waiting for the device to be ready.
On some systems, the device hotplug could take some time causing
the update_interface rpc call to fail as the interface is not available.
Add a watcher for the network interface based on the pci-path of the
network interface. Note, waiting on the device based on name is really
not reliable especially in case multiple networks are hotplugged.
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
Hotunplugging memory is not guaranteed or even likely to work.
Nevertheless I'd really like to have this code in for tests and
observation. It shouldn't hurt, from experience so far.
Signed-off-by: Pavel Mores <pmores@redhat.com>
The bulk of this implementation are simple though tedious sanity checks,
alignment computations and logging.
Note that before any hotplugging, we query qemu directly for the current
size of hotplugged memory. This ensures that any request to resize memory
will be properly compared to the actual already available amount and only
necessary amount will be added.
Note also that we borrow checked_next_multiple_of() from CH implementation.
While this might look uncleanly it's just a rather temporary solution since
an equivalent function will apparently be part of std soon, likely the
upcoming 1.75.
Signed-off-by: Pavel Mores <pmores@redhat.com>
The algorithm is rather simple - we query qemu for existing memory devices
to figure out the index of the one we're about to add. Then we add a
backend object and a corresponding frontend device.
Signed-off-by: Pavel Mores <pmores@redhat.com>
As part of aligning the Kata OCI Spec with oci-spec-rs,
the concept of "State" falls outside the scope of the OCI
Spec itself. While we'll retain the existing code for State
management for now, to improve code organizationand clarity,
we propose moving the State-related code from the oci/ dir
to a dedicated directory named runtime-spec/.
This separation will be completed in subsequent commits with
the removal of the oci/ directory.
Fixes#9766
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
When create container failed, it should cleanup the container
thus there's no device/resource left.
Fixes: #10044
Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
Similar to HotAttach, the HotDetach method signature for network
endoints needs to be changed as well to allow for the method to make
use of device manager to manage the hot unplug of physical network
devices.
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
Enable physical network interfaces to be hotplugged.
For this, we need to change the signature of the HotAttach method
to make use of Sandbox instead of Hypervisor. Similar approach was
followed for Attach method, but this change was overlooked for
HotAttach.
The signature change is required in order to make use of
device manager and receiver for physical network
enpoints.
Fixes: #8405
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
The current implementation for device binding using driver bind/unbind
and new_id fails in the scenario when the physical device is not bound
to a driver before assigning it to vfio.
There exists and updated mechanism to accomplish the same that does not
have the same issue as above.
The driver_override field for a device allows us to specify the driver for a device
rather than relying on the bound driver to provide a positive match of the
device. It also has other advantages referenced here:
https://patchwork.kernel.org/project/linux-pci/patch/1396372540.476.160.camel@ul30vt.home/
So use the updated driver_override mechanism for binding/unbinding a
physical device/virtual function to vfio-pci.
Signed-off-by: liangxianlong <liang.xianlong@zte.com.cn>
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
Keep track of individual exec args instead of joining them in the
policy text. Verifying each arg results in a more precise policy,
because some of the args might include space characters.
This improved validation applies to commands specified in K8s YAML
files using:
- livenessProbe
- readinessProbe
- startupProbe
- lifecycle.postStart
- lifecycle.preStop
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
If the agent-config has a value for `image_registry_auth`,
Then pass this to the image-rs client and enable auth mode too
Fixes: #8122
Signed-off-by: Magnus Kulke <magnuskulke@microsoft.com>
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Add optional config for agent.image_registry_auth, to specify
the uri of credentials to be used when pulling images in the guest
from an authenticated registry
Fixes: #8122
Signed-off-by: Xynnn007 <xynnn@linux.alibaba.com>
Signed-off-by: Magnus Kulke <magnuskulke@microsoft.com>
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Add setting to allow specifying the cpath for a mount source.
cpath is the root path for most files used by a container. For example,
the container rootfs and various files copied from the Host to the
Guest when shared_fs=none are hosted under cpath.
mount_source_cpath is the root of the paths used a storage mount
sources. Depending on Kata settings, mount_source_cpath might have the
same value as cpath - but on TDX for example these two paths are
different: TDX uses "/run/kata-containers" as cpath,
but "/run/kata-containers/shared/containers" as mount_source_cpath.
Signed-off-by: Saul Paredes <saulparedes@microsoft.com>
ocicrypt config is for kata-agent to connect to CDH to request for image
decryption key. This value is specified by an env. We use this
workaround the same as CCv0 branch.
In future, we will consider better ways instead of writting files and
setting envs inside inner logic of kata-agent.
Signed-off-by: Xynnn007 <xynnn@linux.alibaba.com>
- Enable the kata-cc-rustls-tls feature in image-rs, so that it
can get resources from the KBS in order to retrieve the registry
credentials.
- Also bump to the latest image-rs to pick up protobuf fixes
- Add libprotobuf-dev dependency to the agent packaging
as it is needed by the new image-rs feature
- Add extra env in the agent make test as the
new version of the anyhow crate has changed the backtrace capture thus unit
tests of kata-agent that compares a raised error with an expected one
would fail. To fix this, we need only panics to have backtraces, thus
set RUST_BACKTRACE=0 for tests due to document
https://docs.rs/anyhow/latest/anyhow/Fixes#9538
Signed-off-by: Xynnn007 <xynnn@linux.alibaba.com>
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Add ability to auto-generate policy for SecurityContext.runAsUser and
PodSecurityContext.runAsUser.
Fixes: #8879
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Sets SharedFS config to NoSharedFS for remote hypervisor in order to start the file watcher which syncs files from the host to the guest VMs.
Signed-off-by: Silenio Quarti <silenio_quarti@ca.ibm.com>
The DAN feature has already been implemented in kata-runtime-rs, and
this commit brings the same capability to the Go kata-runtime.
Fixes: #9758
Signed-off-by: Lei Huang <leih@nvidia.com>
In dragonball Vfio device passthrough scenarois, the first passthrough
device will be allocated slot 0 which is occupied by root device.
It will cause error, looks like as below:
```
...
6: failed to add VFIO passthrough device: NoResource\n
7: no resource available for VFIO device"): unknown
...
```
To address such problem, we adopt another method with no pre-allocated
guest device id and just let dragonball auto allocate guest device id
and return it to runtime. With this idea, add_device will return value
Result<DeviceType> and apply the change to related code.
Fixes#9813
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Fixes: #9532
Instead of call agent.close_stdin in close_io, we call agent.write_stdin
with 0 len data when the stdin pipe ends.
Signed-off-by: Tim Zhang <tim@hyper.sh>
To test unsealing secrets stored in environment variables,
we create a simple test server that takes the place of
the CDH. We start this server and then use it to
unseal a test secret.
Signed-off-by: Tobin Feldman-Fitzthum <tobin@ibm.com>
Signed-off-by: Linda Yu <linda.yu@intel.com>
When sealed-secret is enabled, the Kata Agent
intercepts environment variables containing
sealed secrets and uses the CDH to unseal
the value.
Signed-off-by: Tobin Feldman-Fitzthum <tobin@ibm.com>
Signed-off-by: Linda Yu <linda.yu@intel.com>
- Bump the commit of image-rs we are pulling in to 413295415
Note: This is the last commmit before a change to whiteout handling
was introduced that lead to the error `'failed to unpack: convert whiteout"`
when pulling the agnhost:2.21 image
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Let me explain why:
In our previous approach, we implemented guest pull by passing PullImageRequest to the guest.
However, this method resulted in the loss of specifications essential for running the container,
such as commands specified in YAML, during the CreateContainer stage. To address this,
it is necessary to integrate the OCI specifications and process information
from the image’s configuration with the container in guest pull.
The snapshotter method does not care this issue. Nevertheless, a problem arises
when two containers in the same pod attempt to pull the same image, like InitContainer.
This is because the image service searches for the existing configuration,
which resides in the guest. The configuration, associated with <image name, cid>,
is stored in the directory /run/kata-containers/<cid>. Consequently, when the InitContainer finishes
its task and terminates, the directory ceases to exist. As a result, during the creation
of the application container, the OCI spec and process information cannot
be merged due to the absence of the expected configuration file.
Fixes: kata-containers#9665
Fixes: kata-containers#9666
Fixes: kata-containers#9667
Fixes: kata-containers#9668
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
Currently, the image is pulled by image-rs in the guest and mounted at
`/run/kata-containers/image/cid/rootfs`. Finally, the agent rebinds
`/run/kata-containers/image/cid/rootfs` to `/run/kata-containers/cid/rootfs` in CreateContainer.
However, this process requires specific cleanup steps for these mount points.
To simplify, we reuse the mount point `/run/kata-containers/cid/rootfs`
and allow image-rs to directly mount the image there, eliminating the need for rebinding.
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
- Add a check in setup_bundle to see if the bundle already exists
and if it does then skip the setup.
This commit is cherry-picked from 44ed3ab80e.
The reason that k8s-kill-all-process-in-container.bats failed is that
deletion of the directory `/root/kata-containers/cid/rootfs` failed during removing container
because it was mounted twice (one in image-rs and one in set_bundle ) and only unmounted once in removing container.
Fixes: #9664
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
Co-authored-by: Dave Hay <david_hay@uk.ibm.com>
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
To unseal a secret, the Kata agent will contact the CDH
using ttRPC. Add the proto that describes the sealed
secret service and messages that will be used.
Signed-off-by: Tobin Feldman-Fitzthum <tobin@ibm.com>
Signed-off-by: Biao Lu <biao.lu@intel.com>
Add a basic runtime-rs `Hypervisor` trait implementation for
AWS Firecracker
- Add basic hypervisor operations (setup / start / stop / add_device)
- Implement AWS Firecracker API on a separate file `fc_api.rs`
- Add support for running jailed (include all sandbox-related content)
- Add initial device support (limited as hotplug is not supported)
- Add separate config for runtime-rs (FC)
Notes:
- devmapper is the only snapshotter supported
- to account for no sharefs support, we copy files in the sandbox (as
in the GO runtime)
- nerdctl spawn is broken (TODO: #7703)
Fixes: #5268
Signed-off-by: George Pyrros <gpyrros@nubificus.co.uk>
Signed-off-by: Anastassios Nanos <ananos@nubificus.co.uk>
Signed-off-by: Charalampos Mainas <cmainas@nubificus.co.uk>
Signed-off-by: George Ntoutsos <gntouts@nubificus.co.uk>
This commit adds the missing step of passing an attached vfio-ap device
to a container via ociSpec. It instantiates and passes a vfio-ap device
(e.g. a Z crypto device).
A device at `/dev/z90crypt` covers all use cases at the time of writing.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Do not install the packages librados-dev and librbd-dev as they are not needed for building static qemu.
Add machine option cap-ail-mode-3=off while creating the VM to qemu cmdline.
Fixes: #9893
Signed-off-by: Amulyam24 <amulmek1@in.ibm.com>
In case we are dealing with multiple interfaces and there exists a
network interface with a conflicting name, we temporarily rename it to
avoid name conflicts.
Before doing this, we need to rename bring the interface down.
Failure to do so results in netlink returning Resource busy errors.
The resource needs to be down for subsequent operation when the name is
swapped back as well.
This solves the issue of passing multiple networks in case of nerdctl
as:
nerdctl run --rm --net foo --net bar docker.io/library/busybox:latest ip a
Fixes: #9900
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
This patch re-generates the client code for Cloud Hypervisor v40.0.
Note: The client code of cloud-hypervisor's OpenAPI is automatically
generated by openapi-generator.
Fixes: #9929
Signed-off-by: Bo Chen <chen.bo@intel.com>
Updated genpolicy settings to allow 2 empty environment variables that
may be forgotten to specify (AZURE_CLIENT_ID and AZURE_TENANT_ID)
Signed-off-by: Saul Paredes <saulparedes@microsoft.com>
Add --layers-cache-file-path flag to allow the user to
specify where the cache file for the container layers
is saved. This allows e.g. to have one cache file
independent of the user's working directory.
Signed-off-by: Leonard Cohnen <lc@edgeless.systems>
Move the `sandbox.agent.setPolicy` call out of the remoteHypervisor
if, block, so we can use the policy implementation on peer pods
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
get_vmm_master_tid() currently returns an error with the message "cannot
get qemu pid (though it seems running)" when it finds a valid
QemuInner::qemu_process instance but fails to extract the PID out of it.
This condition however in fact means that a qemu child process was running
(otherwise QemuInner::qemu_process would be None) but isn't anymore (id()
returns None).
Signed-off-by: Pavel Mores <pmores@redhat.com>
Since Hypervisor::stop_vm() is called from the WaitProcess request handling
which appears to be per-container, it can be called multiple times during
kata pod shutdown. Currently the function errors out on any subsequent
call after the initial one since there's no VM to stop anymore. This
commit makes the function tolerate that condition.
While it seems conceivable that sandbox shouldn't be stopped by WaitProcess
handling, and the right fix would then have to happen elsewhere, this
commit at least makes qemu driver's behaviour consistent with other
hypervisor drivers in runtime-rs.
We also slightly improve the error message in case there's no
QemuInner::qemu_process instance.
Signed-off-by: Pavel Mores <pmores@redhat.com>
Since no objections were raised in the linked issue (#9847) this commit
removes the attempt to derive sandbox bundle path from container bundle
path. As described in more detail in the linked issue, this is container
runtime specific and doesn't seem to serve any purpose.
As for implementation, we hoist the only part of
get_shim_info_from_sandbox() that's still useful (getting the socket
address) directly into the caller and remove the function altogether.
Fixes#9847
Signed-off-by: Pavel Mores <pmores@redhat.com>
Reject CreateContainerRequest field values that are not tested by
Kata CI and that might impact the confidentiality of CoCo Guests.
This change uses a "better safe than sorry" approach to untested
fields. It is very possible that in the future we'll encounter
reasonable use cases that will either:
- Show that some of these fields are benign and don't have to be
verified by Policy, or
- Show that Policy should verify legitimate values of these fields
These are the new CreateContainerRequest Policy rules:
count(input.shared_mounts) == 0
is_null(input.string_user)
i_oci := input.OCI
is_null(i_oci.Hooks)
is_null(i_oci.Linux.Seccomp)
is_null(i_oci.Solaris)
is_null(i_oci.Windows)
i_linux := i_oci.Linux
count(i_linux.GIDMappings) == 0
count(i_linux.MountLabel) == 0
count(i_linux.Resources.Devices) == 0
count(i_linux.RootfsPropagation) == 0
count(i_linux.UIDMappings) == 0
is_null(i_linux.IntelRdt)
is_null(i_linux.Resources.BlockIO)
is_null(i_linux.Resources.Network)
is_null(i_linux.Resources.Pids)
is_null(i_linux.Seccomp)
i_linux.Sysctl == {}
i_process := i_oci.Process
count(i_process.SelinuxLabel) == 0
count(i_process.User.Username) == 0
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Commit 'ca02c9f5124e' implements the vhost-user-blk reconnection functionality,
However, it has missed assigning VhostUserDeviceReconnect when new the QEMU
HypervisorConfig, resulting in VhostUserDeviceReconnect always set to default value 0.
Real change is this line, most of changes caused by go format,
return vc.HypervisorConfig{
// ...
VhostUserDeviceReconnect: h.VhostUserDeviceReconnect,
}, nil
Fixes: #9848
Signed-off-by: markyangcc <mmdou3@163.com>
This is a counterpart of commit abf52420a4 for the qemu-coco-dev
configuration. By allowing default_vcpu and default_memory annotations
users can fine-tune the VM based on the size of the container
image to avoid issues related with pulling large images in the guest.
Signed-off-by: Wainer dos Santos Moschetta <wainersm@redhat.com>
Just like the TEE configurations (sev, snp, tdx) we want to have the
qemu-coco-dev using shared_fs=none.
Fixes: #9676
Signed-off-by: Wainer dos Santos Moschetta <wainersm@redhat.com>
- Bump tokio to 1.38.0 to fix the security vulnerability
https://rustsec.org/advisories/RUSTSEC-2024-0019
If possible it would be good to add the many runtime-rs creates into the
runtime-rs workspace and provide a centralised version to avoid the updates
in many places.
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
- Run `cargo update -p rustjail` to pick up rustjail's bump of
tokio to 1.38.0 to fix the security vulnerability
https://rustsec.org/advisories/RUSTSEC-2024-0019
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
- Fix the lint error:
```
error: you seem to use `.enumerate()` and immediately discard the index
--> src/device_manager/mod.rs:427:33
|
427 | for (_index, device) in self.virtio_devices.iter().enumerate() {
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
```
by removing the unnecessary enumerate
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
The ci failed with:
```
error: use of `or_insert_with` to construct default value
--> src/address_space_manager.rs:650:14
|
650 | .or_insert_with(NumaNode::new);
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ help: try: `or_default()`
|
```
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
We take advantage of the Inner pattern to enable QemuInner::resize_vcpu()
take `&mut self` which we need to call non-const functions on Qmp.
This runs on Intel architecture but will need to be verified and ported
(if necessary) to other architectures in the future.
Signed-off-by: Pavel Mores <pmores@redhat.com>
The QMP_SOCKET_FILE constant in cmdline_generator.rs is made public to make
it accessible from QemuInner. This is fine for now however if the constant
needs to be accessed from additional places in the future we could consider
moving it to somewhere more visible.
The Debug impl for Qmp is empty since first, we don't actually want it,
it's only forced by Hypervisor trait bounds, and second, it doesn't have
anything to display anyway. If Qmp gets any members in the future that
can be meaningfully displayed they should be handled by Qmp's Debug::fmt().
Signed-off-by: Pavel Mores <pmores@redhat.com>
The constructor handles QMP connection initialisation, too, so there can
be non-functional Qmp instance.
Signed-off-by: Pavel Mores <pmores@redhat.com>
We set the VERSION variable consistently across Makefiles to
'unknown' if the file is empty or not present.
We also use git commands consistently for calculating the COMMIT,
COMMIT_NO variables, not erroring out when building outside of
a git repository.
In create_summary_file we also account for a missing/empty VERSION
file.
This makes e.g. the UVM build process in an environment where we
build outside of git with a minimal/reduced set of files smoother.
Signed-off-by: Manuel Huber <mahuber@microsoft.com>
When the total number of files observed is greater than limit, return -1 directly.
runtime has fixed this bug, it should b ported to runtime-rs.
Fixes:#9829
Signed-off-by: gaohuatao <gaohuatao@bytedance.com>
It shouldn't call the initial_size_manager's setup_config
in the load_config since it had been called in the sandbox's
try_init function.
Fixes: #9778
Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
For kata container, the container's pid is meaning less to
containerd/crio since the container's pid is belonged to VM,
and containerd/crio couldn't use it. Thus we just return any
tid of kata shim or hypervisor. But since the hypervisor had
been stopped before deleting the container, and it wouldn't
get the hypervisor's tid for some supported hypervisor, thus
we'd better to return the kata shim's pid instead of hypervisor's
tid.
Fixes: #9777
Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
While running with a remote hypervisor, whenever kata-monitor tries to access
metrics from the shim, the shim does a "panic" and no metric can be gathered.
The function GetVirtioFsPid() is called on metrics gathering, and had a call
to "panic()". Since there is no virtiofs process for remote hypervisor, the
right implementation is to return nil. The caller expects that, and will skip
metrics gathering for virtiofs.
Fixes: #9826
Signed-off-by: Julien Ropé <jrope@redhat.com>
This corrects the warning to point to the \`-j\` flag,
which is the correct flag for the JSON settings file.
Previously, the warning was confusing, as it pointed to
the \`-p\` flag, which specifies to the path for the Rego ruleset.
Signed-off-by: Moritz Sanft <58110325+msanft@users.noreply.github.com>
This patch re-generates the client code for Cloud Hypervisor v39.0.
Note: The client code of cloud-hypervisor's OpenAPI is automatically
generated by openapi-generator.
Fixes: #8694, #9574
Signed-off-by: Bo Chen <chen.bo@intel.com>
- Update the config parsing logic so that when reading from the
agent-config.toml file any envs are still processed
- Add units tests to formalise that the envs take precedence over values
from the command line and the config file
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
When the total number of files observed is greater than limit, return (-1, err).
When the returned err is not nil, the func countFiles should return -1.
Fixes:#9780
Signed-off-by: gaohuatao <gaohuatao@bytedance.com>
fixes#9810
Add an annotation to the enum values in the agent config that will
deserialize them using a kebab-case conversion, aligning the behaviour
to parsing of params specified via kernel cmdline.
drive-by fix: add config override for guest_component_procs variable
Signed-off-by: Magnus Kulke <magnuskulke@microsoft.com>
(1) As mis-use of cap.set causing previous Caps lost which
causing assert! failed, just replacing cap.set with cap.add.
(2) It will return error if there's no such name setting when
do update_config_by_annotation {
...
if config.runtime.name.is_empty() {
return Err(io::Error::new(
io::ErrorKind::InvalidData,
"Runtime name is missing in the configuration",
));
}
...
}
Fixes#9783
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
This is required to provide the hashes of kernel, initrd and cmdline
needed during the attestation of the coco.
Fixes: #9150
Signed-off-by: Niteesh Dubey <niteesh@us.ibm.com>
CreateContainerRequest objects can specify devices to be created inside
the guest VM. This change ensures that requested devices have a
corresponding entry in the PodSpec.
Devices that are added to the pod dynamically, for example via the
Device Plugin architecture, can be allowlisted globally by adding their
definition to the settings file.
Fixes: #9651
Signed-off-by: Markus Rudy <mr@edgeless.systems>
This adds structs and fields required to parse PodSpecs with
VolumeDevices and PVCs with non-default VolumeModes.
Signed-off-by: Markus Rudy <mr@edgeless.systems>
Since yq frequently updates, let's upgrade to a version from February to
bypass potential issues with versions 4.41-4.43 for now. We can always
upgrade to the newest version if necessary.
Fixes#9354
Depends-on:github.com/kata-containers/tests#5818
Signed-off-by: Beraldo Leal <bleal@redhat.com>
golang.mk is not ready to deal with non GOPATH installs. This is
breaking test on s390x.
Since previous steps here are installing go and yq our way, we could
skip this aditional check. A full refactor to golang.mk would be needed
to work with different paths.
Signed-off-by: Beraldo Leal <bleal@redhat.com>
fixes#9748
A configuration option `guest_component_procs` has been introduced that
indicates which guest component processes are supposed to be spawned by
the agent. The default behaviour remains that all of those processes are
actively spawned by the agent. At the moment this is based on presence
of binaries in the rootfs and the guest_component_api_rest option.
The new option is incremental:
none -> attestation-agent -> confidential-data-hub -> api-server-rest
e.g. api-server-rest implies attestation-agent and confidential-data-hub
the `none` option has been removed from guest_component_api_rest, since
this is addresses by the introduced option.
To not change expected behaviour for non-coco guests we still will still
only attempt to spawn the processes if the requested attestation binaries
are present on the rootfs, and issue in warning in those cases.
Signed-off-by: Magnus Kulke <magnuskulke@microsoft.com>
Add the CLI flag --runtime-class-names, which is used during
policy generation. For resources that can define a
runtimeClassName (e.g., Pods, Deployments, ReplicaSets,...)
the value must have any of the --runtime-class-names as
prefix, otherwise the resource is ignored.
This allows to run genpolicy on larger yaml
files defining many different resources and only generating
a policy for resources which will be deployed in a
confidential context.
Signed-off-by: Leonard Cohnen <lc@edgeless.systems>
It creates this line, as the Golang runtime does:
-object rng-random,id=rng0,filename=/dev/urandom -device virtio-rng-pci,rng=rng0
Signed-off-by: Emanuel Lima <emlima@redhat.com>
We need to remove the device from the tracking map, a container
restart will increment the bus index and we will get out of root-ports
and crash the machine.
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
We need special handling for pod_sandbox, pod_container and
single_container how and when to inject CDI devices
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
In Kubernetes we still do not have proper VM sizing
at sandbox creation level. This KEP tries to mitigates
that: kubernetes/enhancements#4113 but this can take
some time until Kube and containerd or other runtimes
have those changes rolled out.
Before we used a static config of VFIO ports, and we
introduced CDI support which needs a patched contianerd.
We want to eliminate the patched continerd in the GPU case
as well.
Fixes: #8860
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
This option allows to add a debug monitor socket when
`enable_debug = true` to control QEMU within debugging case.
Fixes: #9603
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
The `update_env_pci()` function need the PCI address mapping to
translate the host PCI address to guest PCI address in below
environment variables:
- PCIDEVICE_<prefix>_<resource-name>_INFO
- PCIDEVICE_<prefix>_<resource-name>
So collect PCI address mapping for both vfio-pci-gk and
vfio-pci devices.
Fixes#9614
Signed-off-by: Lei Huang <leih@nvidia.com>
Replace tab indentation with spaces for the three lines within the ifneq statements, aligning them with the surrounding code.
Fixes:#9692
Signed-off-by: sidneychang <2190206983@qq.com>
The `dial_timeout` works fine for Runtime-go, but is obsoleted in
Runtime-rs.
When the pod cannot connect to the Agent upon starting, we need to adjust
the `reconnect_timeout_ms` to increase the number of connection attempts to
the Agent.
Fixes: #9688
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
For the TD attestation to work the connection to QGS on the host is needed.
By default QGS runs on vsock port 4050, but can be modified by the host owner.
Format of the qemu object follows the SocketAddress structure, so it needs to be provided in the JSON format, as in the example below:
-object '{"qom-type":"tdx-guest","id":"tdx","quote-generation-socket":{"type":"vsock","cid":"2","port":"4050"}}'
Fixes: #9497
Signed-off-by: Jakub Ledworowski <jakub.ledworowski@intel.com>
The new version of sriov-network-device-plugin adds an env
`PCIDEVICE_<prefix>_<resource-name>_INFO`, which has a json
value; kata-agent can't parse it as env
`PCIDEVICE_<prefix>_<resource-name>` which has value in format
"DDDD:BB:SS.F".
This change updates env `PCIDEVICE_<prefix>_<resource-name>_INFO`.
Signed-off-by: Lei Huang <leih@nvidia.com>
Remove `rand.Seed` call to resolve the following failure:
```
rand.Seed is deprecated: As of Go 1.20 there is no reason to call Seed with a random value.
```
The go rand.Seed docs: https://pkg.go.dev/math/rand@go1.20#Seed
back this up and states:
> If Seed is not called, the generator is seeded randomly at program startup.
so I believe we can just delete the call.
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
For now, let's allow the users to set the default_cpu and default_memory
when using TDX, as they may hit issues related to the size of the
container image that must be pulled and unpacked inside the guest,
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
- Make due to us bumping the golang version used in our CI
but `make vendor` fails without the go version in the runtime go.mod
being increased, so update this and run go mod tidy
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Fixes: #9532
Close stdin when write_stdin receives data of length 0.
Stop call notify_term_close() in close_stdin, because it could
discard stdout unexpectedly.
Signed-off-by: Tim Zhang <tim@hyper.sh>
We shouldn't be using 9p, at all, with TEEs, as off right now we have no
way to ensure the channels are encrypted. The way to work this around
for now is using guest pull, either with containerd + nydus snapshotter
or with CRI-O; or even tardev snapshotter for pulling on the host (which
is the approach used by MSFT).
This is only done for TDX for now, leaving the generic, AMD, and IBM
related stuff for the folks working on those to switch and debug
possible issues on their environment.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
In Kubernetes, the following values for namespace are equivalent and all refer to the default namespace:
- ` ` (namespace field missing)
- `namespace: ""` (namespace field is the empty string)
- `namespace: "default"`(namespace field has the explicit value `default`)
Genpolicy currently does not handle the empty string case correctly.
Signed-Off-By: Malte Poll <1780588+malt3@users.noreply.github.com>
Implementation of QemuCmdLine has a fairly uniform and repetitive structure
that's guided by a set of conventions. These conventions have however been
mostly implicit so far, leading to a superfluous and annoying
request/force-push churn during qemu-rs PR reviews.
This commit aims to make things explicit so that contributors can take them
into account before an initial PR submission.
Signed-off-by: Pavel Mores <pmores@redhat.com>
ResizeMemory for Cloud Hypervisor is missing a check for the new
requested memory being greater than the max hotplug size after
alignment. Add the check, and since an earlier check for this
setsrequested memory to the max hotplug size, do the same in the
post-alignment check.
Fixes#9640
Signed-off-by: Chelsea Mafrica <chelsea.e.mafrica@intel.com>
All these settings are hardcoded as `false` and result in
no extra options on the QEMU command line, like the go
runtime does. There actually not needed :
- we're never going to ask QEMU to survive a guest shutdown
- we're never going to run QEMU daemonized since it prevents
log collection
- we're never going to ask QEMU to start with the guest stopped
No need to keep this code around then.
Signed-off-by: Greg Kurz <groug@kaod.org>
By default, when a container is created with the `--privileged` flag,
all devices in `/dev` from the host are mounted into the guest. If
there is a block device(e.g. `/dev/dm`) followed by a generic
device(e.g. `/dev/null`),two identical block devices(`/dev/dm`)
would be requested to the kata agent causing the agent to exit with error:
> Conflicting device updates for /dev/dm-2
As the generic device type does not hit any cases defined in `switch`,
the variable `kataDevice` which is defined outside of the loop is still
the value of the previous block device rather than `nil`. Defining `kataDevice`
in the loop fixes this bug.
Signed-off-by: cncal <flycalvin@qq.com>
RTC was being built in a wrong fashion on commit #2bc5e3c6e2ab0145fa9e8be95df0d5086c07a517
RTC was being constructed inside the QemuCmdLine struct,
but it should've been built inside the devices vector.
Signed-off-by: Emanuel Lima <emlima@redhat.com>
This was needed when we were using an old (and not maintained anymore)
host stack. Considering what we have as part of the distros, Today,
this can simply be dropped, as I cannot find any reference of this one
being needed in any up-to-date documentation.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
This reverts commit b7cccfa019.
The `private=on` bit has never made its way upstream, and was removed
from the latest iteration that we're using. With that in mind, let's
revert its usage in the code.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
This reverts commit 582b5b6b19.
The `private=on` bit has never made its way upstream, and was removed
from the latest iteration that we're using. With that in mind, let's
revert its addition, and later on its usage in the code.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Let's add the PLACEHOLDER_FOR_DISTRO_{QEMU,OVMF}_WITH_TDX_SUPPORT
variables instead of actually setting a path, so we can easily replace
those as part of our deployment scripts.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Everytime I create contianer on arm64 machine, containerd/kata logs a redundant warning
as follows:
``` shell
time="2024-05-07" level=warning msg="<nil>" arch=arm64 name=containerd-shim-v2
pid=xxx sandbox=fdd1f05 source=virtcontainers/hypervisor
```
I added an error statement so that the error would be logged when it occurs.
Signed-off-by: cncal <flycalvin@qq.com>
We should init and asign the runtime instance to runtime
handler, otherwise, if the pause container failed to start,
which means the runtime instance failed to start, then the
following delete & shutdown request wouldn't be run, thus
the dead shim would be left.
Fixes: #9597
Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
dragonball reserves 2048G of mmio space for the pci root bus by default
on physical addresses greater than 4G. However, for some machines with
smaller physical address widths, such as 39-bit wide physical addresses,
dragonball reserves the mmio space when initializing the memory. It is
less than 2048G, so this commit dynamically calculates and allocates the
mmio size of each pci root bus.
Fixes: #9509
Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
The CH v39 upgrade in #9575 is currently blocked because of a bug in the
Mariner host kernel. To address this, we temporarily tweak the Mariner
CI to use an Ubuntu host and the Kata guest kernel, while retaining the
Mariner initrd. This is tracked in #9594.
Importantly, this allows us to preserve CI for genpolicy. We had to
tweak the default rules.rego however, as the OCI version is now
different in the Ubuntu host. This is tracked in #9593.
This change has been tested together with CH v39 in #9588.
Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
The build message shows that yq was not found when I tried to build
runtime binaries, but I've actually installed yq by yum install.
Signed-off-by: cncal <flycalvin@qq.com>
If device /dev/kvm does not exist, kata-runtime check would fail with
an ambiguous error messae 'no such file or directory'. I added a little
more details to make it understandable and it will belike:
```
ERRO[0000] cannot open kvm device: no such file or directory arch=arm64 check-type=full device=/dev/kvm name=kata-runtime pid=2849085 source=runtime
ERRO[0000] no such file or directory arch=arm64 name=kata-runtime pid=2849085 source=runtime
no such file or directory
```
Signed-off-by: cncal <flycalvin@qq.com>
- The testVersionString logic use regex to check that the ociVersion is
displayed correctly, but with the new go module that version has a
`+` in, so we need to quote this to escape special characters
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Created a new configuration to configure Kata for CoCo without requiring TEE
hardware so to allow developers implement/test/debug platform agnostic code
on their workstations. It will also ease testing of CoCo features on CI with
non-TEE supported VMs.
This is based off qemu configuration. The following differences applied:
- switched to confidential guest image/initrd
- switched to confidential kernel
- switched to 9p shared_fs
Fixes#9487
Signed-off-by: Wainer dos Santos Moschetta <wainersm@redhat.com>
Since OPA binary was replaced by the regorus crate, we can finally stop
building and shipping the binary.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
iommu_platform support was already added on initial DeviceVhostUserFs
introduction, however it incorrectly enabled iommu_platform also on
non-CCW (e.g. PCI) systems.
Signed-off-by: Pavel Mores <pmores@redhat.com>
iommu_platform is only turned on for CCW systems.
PartialEq is added to VirtioBusType to enable the '==' operator.
Signed-off-by: Pavel Mores <pmores@redhat.com>
The adding itself is done by a new function add_iommu() that conforms with
the add_*() convention. Note though that this function is called
internally, by the QemuCmdLine constructor, simply because there's nothing
to trigger its invocation from QemuInner (unlike the other add_*()
functions so far).
Signed-off-by: Pavel Mores <pmores@redhat.com>
Remove reminder to initialize Policy earlier, because currently there
are no plans to initialize earlier.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Implement Agent Policy using the regorus crate instead of the OPA
daemon.
The OPA daemon will be removed from the Guest rootfs in a future PR.
Fixes: #9388
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Lock anyhow version to 1.0.58 because:
- Versions between 1.0.59 - 1.0.76 have not been tested yet using
Kata CI. However, those versions pass "make test" for the
Kata Agent.
- Versions 1.0.77 or newer fail during "make test" - see
https://github.com/kata-containers/kata-containers/issues/9538.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
In case of block devices using virtio-block, we need to pass the
pci-path as the storage source field to the agent.
Current the virt-path is being passed which works just for mmio block
devices.
In the future when support is added for scsi, block-ccw and pmem
devices, the storage source would need to be handled accordingly.
Fixes: #9034
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
In do_create_container and do_exec_process, we should create the proc_io first,
in case there's some error occur below, thus we can make sure
the io stream closed when error occur.
Signed-off-by: Tim Zhang <tim@hyper.sh>
Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
Fixes: #9334
In linux, when a FIFO is opened and there are no writers, the reader
will continuously receive the HUP event. This can be problematic.
To avoid this problem, we open stdin in write mode and keep the stdin-writer
We need to open the stdout/stderr as the read mode and keep the open endpoint
until the process is delete. otherwise,
the process would exit before the containerd side open and read
the stdout fifo, thus runD would write all of the stdout contents into
the stdout fifo and then closed the write endpoint. Then, containerd
open the stdout fifo and try to read, since the write side had closed,
thus containerd would block on the read forever.
Here we keep the stdout/stderr read endpoint File in the common_process,
which would be destroied when containerd send the delete rpc call,
at this time the containerd had waited the stdout read return, thus it
can make sure the contents in the stdout/stderr fifo wouldn't be lost.
Signed-off-by: Tim Zhang <tim@hyper.sh>
Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
In scenario passfd-io, we should wait for stdin to close itself
instead of manually intervening in it.
Signed-off-by: Tim Zhang <tim@hyper.sh>
Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
This patch adds a default implementation for the use_sandbox_pidns
and updates the structs that implement the K8sResource trait to use
the default.
Fixes: #8960
Signed-off-by: Archana Choudhary <archana1@microsoft.com>
- Provide default implementation for get_sandbox_name in K8sResource trait
- Remove default implementation from structs implementing the trait K8sResource
Fixes: #8960
Signed-off-by: Archana Choudhary <archana1@microsoft.com>
concurrently with itself
Based on 3a1461b0a5186a92afedaaea33ff2bd120d1cea0
Previously the tool would use the layers_cache folder for all instances
and hence delete the cache when it was done, interfereing with other
instances. This change makes it so that each instance of the tool will
have its own temp folder to use.
Signed-off-by: Saul Paredes <saulparedes@microsoft.com>
updates golang.org/x/net to newer version that closes some reported
vulnerabilities and security issues
Fixes#9486
Signed-off-by: Adil Sadik <sparky.005@gmail.com>
1. EPOLLHUP events also need to be read and will be got len 0.
2. We should kill the connection when EPOLLERR events are received.
Signed-off-by: Tim Zhang <tim@hyper.sh>
Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
To make `qemu-runtime-rs` working for CI, we have to rename a configuration
template file and `CONFIG_FILE_QEMU` in Makefile.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Linux kernel generates a panic when the init process exits.
The kernel is booted with panic=1, hence this leads to a
vm reboot.
When used as a service the kata-agent service has an ExecStop
option which does a full sync and shuts down the vm.
This patch mimicks this behavior when kata-agent is used as
the init process.
Fixes: #9429
Signed-off-by: Alexandru Matei <alexandru.matei@uipath.com>
isClhRunning uses signal 0 to test whether the process is
still alive or not. This doesn't work because the process is a
direct child of the shim. Once it is dead the process becomes
zombie.
Since no one waits for it the process lingers until
its parent dies and init reaps it. Hence sending signal 0 in
isClhRunning will always return success whether the process is
dead or not.
This patch calls wait to reap the process, if it succeeds that
means it is our child process, if not we send the signal.
Fixes: #9431
Signed-off-by: Alexandru Matei <alexandru.matei@uipath.com>
genpolicy is a handy tool to use in CI systems, to prepare workloads
before applying them to the Kubernetes API server. However, many modern
build systems like Bazel or Nix restrict network access, and rightfully
so, so any registry interaction must take place on localhost.
Configuring certificates for localhost is tricky at best, and since
there are no privacy concerns for localhost traffic, genpolicy should
allow to contact some registries insecurely. As this is a runtime
environment detail, not a target environment detail, configuring
insecure registries does not belong into the JSON settings, so it's
implemented as command line flags.
Fixes: #9008
Signed-off-by: Markus Rudy <webmaster@burgerdev.de>
CRIs don't always use a pause container, but even if they do the
concrete container choice is not specified. Even if the CRI config can
be tweaked, it's not guaranteed that registries in the public internet
can be reached. To be portable across CRI implementations and
configurations, the genpolicy user needs to be able to configure the
container the tool should append to the policy.
Signed-off-by: Markus Rudy <webmaster@burgerdev.de>
Decouple initialization of the Settings struct from creating the
AgentPolicy struct, so that the settings are available for evaluating,
extending or overriding command line arguments.
Signed-off-by: Markus Rudy <webmaster@burgerdev.de>
This patch introduces a one-time cpath to mitigate the cgroup residuals. It
might break the device cgroup merging rules when the cgroup has children.
Fixes: #9456
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
- Add v1 image test case
- Install protobuf-compiler in build check
- Reset containerd config to default in kubernetes test if we are testing genpolicy
- Update docker_credential crate
- Add test that uses default pull method
- Use GENPOLICY_PULL_METHOD in test
Signed-off-by: Saul Paredes <saulparedes@microsoft.com>
Add optional toggle to use existing containerd installation to pull and manage container images.
This adds support to a wider set of images that are currently not supported by standard pull method,
such as those that use v1 manifest.
Fixes: #9144
Signed-off-by: Saul Paredes <saulparedes@microsoft.com>
If 'rest_api' is configured, let's start the api-server-rest after
the attestation-agent and the confidential-data-hub have been started.
Fixes: #7555
Signed-off-by: Biao Lu <biao.lu@intel.com>
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Signed-off-by: Linda Yu <linda.yu@intel.com>
Co-authored-by: stevenhorsman <steven@uk.ibm.com>
Co-authored-by: Jakob Naucke <jakob.naucke@ibm.com>
Co-authored-by: Wang, Arron <arron.wang@intel.com>
Co-authored-by: zhouliang121 <liang.a.zhou@linux.alibaba.com>
Co-authored-by: Alex Carter <alex.carter@ibm.com>
Co-authored-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
Co-authored-by: Xynnn007 <xynnn@linux.alibaba.com>
Add configuration for 'rest api server'.
Optional configurations are
'agent.rest_api=attestation' will enable attestation api
'agent.rest_api=resource' will enable resource api
'agent.rest_api=all' will enable all (attestation and resource) api
Fixes: #7555
Signed-off-by: Biao Lu <biao.lu@intel.com>
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Signed-off-by: Linda Yu <linda.yu@intel.com>
Co-authored-by: stevenhorsman <steven@uk.ibm.com>
Co-authored-by: Jakob Naucke <jakob.naucke@ibm.com>
Co-authored-by: Wang, Arron <arron.wang@intel.com>
Co-authored-by: zhouliang121 <liang.a.zhou@linux.alibaba.com>
Co-authored-by: Alex Carter <alex.carter@ibm.com>
Co-authored-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
Co-authored-by: Xynnn007 <xynnn@linux.alibaba.com>
Let's introduce a new method to start the confidential data hub and the
attestation agent. The former depends on the later, and it needs to be
started before the RPC server.
Starting the attestation components is based on whether the confidential
containers guest components binaries are found in the rootfs.
Fixes: #7544
Signed-off-by: Biao Lu <biao.lu@intel.com>
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Signed-off-by: Linda Yu <linda.yu@intel.com>
Co-authored-by: stevenhorsman <steven@uk.ibm.com>
Co-authored-by: Jakob Naucke <jakob.naucke@ibm.com>
Co-authored-by: Wang, Arron <arron.wang@intel.com>
Co-authored-by: zhouliang121 <liang.a.zhou@linux.alibaba.com>
Co-authored-by: Alex Carter <alex.carter@ibm.com>
Co-authored-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
Co-authored-by: Xynnn007 <xynnn@linux.alibaba.com>
Support different pause images in the guest for guest-pull, such as k8s
pause image (registry.k8s.io/pause) and openshift pause image (quay.io/bpradipt/okd-pause).
Fixes: #9225 -- part III
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
Let's rely on the kvm module 'tdx' parameter to do so.
This aligns with both OSVs (Canonical, Red Hat, SUSE) and the TDX
adoption (https://github.com/intel/tdx-linux) stacks.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
This commit is a mess, but I'm not exactly sure what's the best way to
make it less messy, as we're getting QEMU TDX to work while partially
reverting 1e34220c41.
With that said, let me cover the content of this commit.
Firstly, we're reverting all the changes related to
"memory-backend-memfd-private", as that's what was used with the
previous host stack, but it seems it
didn't fly upstream.
Secondly, in order to get QEMU to properly work with TDX, we need to
enforce the 'private=on' knob and use the "memory-backend-ram", and
we're doing so, and also making sure to test the `private=on` newly
added knob.
I'm sorry for the confusion, I understand this is not optimal, I just
don't see an easy path to do changes without leaving the code broken
during those changes.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
This reverts commit d1b54ede29.
Conflicts:
src/runtime/virtcontainers/qemu.go
This commit was a hack that was needed in order to get QEMU + TDX to
work atop of the stack our CI was running on. As we're moving to "the
officially supported by distros" host OS, we need to get rid of this.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
The private=on|off knob is required in order to properly lauunch a TDX
guest VM.
This is a brand new property that is part of the still in-flight patches
adding TDX support on QEMU.
Please, see:
3fdd8072da
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
We used to utilize go runtime's "NumCPUs()", which will give the number
of cores available to the Go runtime, which may be a subset of physical
cores if the shim is started from within a cpuset. From the function's
description:
"NumCPU returns the number of logical CPUs usable by the current
process."
As an example, if containerd is run from within a smaller CPUset, the
maximum size of a pod will be dictated by this CPUset, instead of what
will be available on the rest of the system.
Since the shim will be moved into its own cgroup that may have a
different CPUset, let's stick with checking physical cores. This also
aligns with what we have documented for maxVCPU handling.
In the event we fail to read /proc/cpuinfo, let's use the goruntime.
Fixes: #9327
Signed-off-by: Eric Ernst <eric_ernst@apple.com>
Try to reduce duplicated code in decrease_attach_count with public
new function do_decrease_count.
Fixes: #8738
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Try to reduce duplicated code in increase_attach_count with public
new function do_increase_count.
Fixes: #8738
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Introduce a dedicated public function do_decrease_count to
reduce duplicated code in drivers' decrease_attach_count.
Fixes: #8738
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Since there are many implementations of reference counting in the
drivers, all of which have the same implementation, we should try
to reduce such duplicated code as much as possible. Therefore, a
new function is introduced to solve the problem of duplicated code.
Fixes: #8738
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Configure the system to mount cgroups-v2 by default during system boot
by the systemd system, We must add systemd.unified_cgroup_hierarchy=1
parameter to kernel cmdline, which will be passed by kernel_params in
configuration.toml.
To enable cgroup-v2, just add systemd.unified_cgroup_hierarchy=true[1]
to kernel_params.
Fixes: #9336
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
When there's a pod with multiple containers, there may be case that
attach point more than 2, we should not return Err in that case when
we are doing attach ops, but just return Ok.
Fixes: #8738
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
This reverts commit 1c5693be86.
Avoid apparent infinite loop when ReadStreamRequest is blocked by
policy - for some of the pods.
When running the k8s-limit-range.bats test with Policy enabled,
the Shim + VMM never get terminated on my cluster. Not sure why
the sandbox clean-up works better for other tests, but the
k8s-limit-range test pod gets stuck in an infinite loop:
stdout io stream copy error happens: error = %wrpc error: code =
PermissionDenied desc = \"ReadStreamRequest is blocked by policy
...
policy check: ReadStreamRequest
...
stdout io stream copy error happens: error = %wrpc error: code =
PermissionDenied desc = \"ReadStreamRequest is blocked by policy
...
policy check: ReadStreamRequest
...
Fixes: #9380
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Since we no longer use the service_offload configuration,
remove the ServiceOffload field from the image struct.
Signed-off-by: Tobin Feldman-Fitzthum <tobin@ibm.com>
These experimental options were added 2 years ago
in anticipation of features that would be added
in CoCo. These do not match the features that were
eventually added and will soon be ported to main.
Fixes: #8047
Signed-off-by: Tobin Feldman-Fitzthum <tobin@ibm.com>
This is the report from `make check`:
```
error: this loop never actually loops
--> src/signal.rs:147:9
|
147 | / loop {
148 | | select! {
149 | | _ = handle => {
150 | | println!("INFO: task completed");
... |
156 | | }
157 | | }
| |_________^
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#never_loop
= note: `#[deny(clippy::never_loop)]` on by default
```
There is only one option: you get something or a timeout. You never retry, so
the report is correct.
Fixes: #9342
Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>
The lint report is the following:
```
error: `flatten()` will run forever if the iterator repeatedly produces an `Err`
--> src/rpc.rs:1754:10
|
1754 | .flatten()
| ^^^^^^^^^ help: replace with: `map_while(Result::ok)`
|
note: this expression returning a `std::io::Lines` may produce an infinite number of `Err` in case of a read error
--> src/rpc.rs:1752:5
|
1752 | / reader
1753 | | .lines()
| |________________^
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#lines_filter_map_ok
= note: `-D clippy::lines-filter-map-ok` implied by `-D warnings`
= help: to override `-D warnings` add `#[allow(clippy::lines_filter_map_ok)]`
```
This commit simply applies the suggestion.
Fixes: #9342
Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>
Running `make check` in the `src/agent` directory gives:
```
error: you seem to use `.enumerate()` and immediately discard the index
--> rustjail/src/mount.rs:572:27
|
572 | for (_index, line) in reader.lines().enumerate() {
| ^^^^^^^^^^^^^^^^^^^^^^^^^^
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#unused_enumerate_index
= note: `-D clippy::unused-enumerate-index` implied by `-D warnings`
= help: to override `-D warnings` add `#[allow(clippy::unused_enumerate_index)]`
help: remove the `.enumerate()` call
|
572 | for line in reader.lines() {
| ~~~~ ~~~~~~~~~~~~~~
Checking tokio-native-tls v0.3.1
Checking hyper-tls v0.5.0
Checking reqwest v0.11.18
error: could not compile `rustjail` (lib) due to 1 previous error
warning: build failed, waiting for other jobs to finish...
make: *** [../../utils.mk:177: standard_rust_check] Error 101
```
Fixes: #9342
Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>
support to configure CreateContainerRequestTimeout in the
configurations.
e.g.:
[runtime]
...
create_container_timeout = 300
Note: The effective timeout is determined by the lesser of two values: runtime-request-timeout from kubelet config
(https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/#:~:text=runtime%2Drequest%2Dtimeout) and create_container_timeout.
In essence, the timeout used for guest pull=runtime-request-timeout<create_container_timeout?runtime-request-timeout:create_container_timeout.
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
Most of the content of `docs/Stable-Branch-Strategy.md` got de-facto
deprecated by the re-design of the release process described in #9064.
Remove this file and all its references in the repo.
The `## Versioning` section has some useful information though. It is
moved to `docs/Release-Process.md`. The documentation of the `PATCH`
field is adapted according to new workflow.
Fixes#9064 - part VI
Signed-off-by: Greg Kurz <groug@kaod.org>
Support to configure CreateContainerRequestTimeout in the annotations.
e.g.:
annotations:
"io.katacontainers.config.runtime.create_container_timeout": "300"
Note: The effective timeout is determined by the lesser of two values: runtime-request-timeout from kubelet config
(https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/#:~:text=runtime%2Drequest%2Dtimeout) and create_container_timeout.
In essence, the timeout used for guest pull=runtime-request-timeout<create_container_timeout?runtime-request-timeout:create_container_timeout.
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
In the situation to pull images in the guest #8484, it’s important to account for pulling large images.
Presently, the image pull process in the guest hinges on `CreateContainerRequest`, which defaults to a 60-second timeout.
However, this duration may prove insufficient for pulling larger images, such as those containing AI models.
Consequently, we must devise a method to extend the timeout period for large image pull.
Fixes: #8141
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
The remaining code in network.rs was mostly moved to utils.rs which seems
better home for these utility functions anyway (and a closely related
function open_named_tuntap() has already lived there).
ToString implementation for Address was removed after some consideration.
Address should probably ideally implement Display (as per RFC 565) which
would also supply a ToString implementation, however it implements Debug
instead, probably to enable automatic implementation of Debug for anything
that Address is a member of, if for no other reason. Rather than having
two identical functions this commit simply switches to using the Debug
implementation for printing Address on qemu command line.
Fixes#9352
Signed-off-by: Pavel Mores <pmores@redhat.com>
Make file descriptors to be passed to qemu owned by QemuCmdLine. See
commit 52958f17cd for more explanation.
Signed-off-by: Pavel Mores <pmores@redhat.com>
add_network_device() doesn't need to be passed NetworkInfo since it
already has access to the full HypervisorConfig.
Also, one of the goals of QemuCmdLine interface's design is to avoid
coupling between QemuCmdLine and the hypervisor crate's device module,
if at all possible. That's why add_network_device() shouldn't take
device module's NetworkConfig but just parts that are useful in
add_network_device()'s implementation.
Signed-off-by: Pavel Mores <pmores@redhat.com>
is_running_in_vm() is enough to figure out whether to disable_modern but
it's clumsy and verbose to use. should_disable_modern() streamlines the
usage by encapsulating the verbosity.
Signed-off-by: Pavel Mores <pmores@redhat.com>
This commit replaces the existing NetDevice-based implementation with one
using Netdev and DeviceVirtioNet.
Signed-off-by: Pavel Mores <pmores@redhat.com>
In keeping with architecture of QemuCmdLine implementation we split the
functionality into two objects: Netdev to represent and generate the
-netdev part and DeviceVirtioNet for the -device virtio-net-<transport>
part.
This change is a pure refactor, existing functionality does not change.
However, we do remove some stub generalizations and govmm-isms, notably:
- we remove the NetDev enum since the only network interface types that
kata seems to use with qemu are tuntap and macvtap, both of which are
implemented by the same -netdev tap
- enum DeviceDriver is also left out since it doesn't seem reasonable to
try to represent VFIO NICs (which are completely different from
virtio-net ones) with the same struct as virtio-net
- we also remove VirtioTransport because there's no use for it so far, but
with the expectation that it will be added soon.
We also make struct Netdev the owner of any vhost-net and queue file
descriptors so that their lifetime is tied ultimately to the lifetime of
QemuCmdLine automatically, instead of returning the fds to the caller and
forcing it to achieve the equivalent functionality but manually.
Signed-off-by: Pavel Mores <pmores@redhat.com>
generate_netdev_fds() takes NetworkConfig from which it however only needs
a host-side network device name. This commit makes it take the device name
directly, making the function useful to callers who don't have the whole
NetworkConfig but do have the requisite device name.
Signed-off-by: Pavel Mores <pmores@redhat.com>
The idea of this function is to make sure O_CLOEXEC is not set on file
descriptors that should be inherited by a child (=hypervisor) process.
The approach so far is however rather heavy-handed - clearing *all* flags
is unjustifiably aggresive for a low-level function with no knowledge of
context whatsoever.
This commit refactors the function so that it only does what's expected
and renames it accordingly. It also clarifies some of its call sites.
Signed-off-by: Pavel Mores <pmores@redhat.com>
Kata CI has full debug output enabled for the cbl-mariner k8s tests,
and the test AKS node is relatively slow. So debug prints from policy
are expensive during CI.
Fixes: #9296
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Change scripts and source that uses files in the tests repo to use the
corresponding file in the current repo.
Fixes#9165
Signed-off-by: Chelsea Mafrica <chelsea.e.mafrica@intel.com>
Currently, `.lock().await.clone()` results in `Option<ImageService>` being duplicated in memory with each call to `singleton()`.
Consequently, if kata-agent receives numerous image pulling requests simultaneously,
it will lead to the allocation of multiple `Option<ImageService>` instances in memory, thereby consuming additional memory resources.
In image.rs, we introduce two public functions:
`merge_bundle_oci()` and `init_image_service()`. These functions will encapsulate
the operations on `IMAGE_SERVICE`, ensuring that its internal details remain
hidden from external modules such as `rpc.rs`.
Fixes: #9225 -- part II
Signed-off-by: Xynnn007 <xynnn@linux.alibaba.com>
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
It is observed that virtiofsd exits immediately on s390x
if there is no attached console devices.
This commit resolves the issue by migrating `appendConsole()`
from runtime and being triggered in `start_vm()`.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
For s390x, it requires an additional option `memory-backend` for `-machine`.
Otherwise, virtiofsd exits with HandleRequest(InvalidParam).
This commit is to add a field `memory_backend` to `struct Machine`
and turn it on for s390x.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Like nvdimm for x86_64, a block device for s390x should be
treated differently with `virtio-blk-ccw`.
This is to generate a QEMU command line parameter for a block
device by using `-blockdev` and `-device` if the `vm_rootfs_driver`
is set to `virtio-blk-ccw`.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Add in the full details once cloud-hypervisor/cloud-hypervisor#6103
has been implemented, and the feature is available in a Cloud Hypervisor
release.
Fixes: #8799
Signed-off-by: David Esparza <david.esparza.borquez@intel.com>
Currently, `*-pci` is used as an argument for the device config.
It is not true for a case where a different type of bus is used.
s390x uses `ccw`.
This commit is to make it flexible to generate the device argument
based on the bus type. A structure `DeviceVhostUserFsPci` and
`VhostVsockPci` is renamed to `DeviceVhostUserFs` and `VhostVsock`
because the structure name is not bound to a certain bus type any more.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
It has been observed that the runtime stops running around
`sysinfo::total_memory()` while adjusting a config on s390x.
This is to update the crate to the latest version which happened
to resolve the issue. (No explicit release note for this)
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
MTRR, or Memory-Type Range Registers are a group of x86 MSRs providing a way to control access
and cache ability of physical memory regions.
During our test in runtime-rs + Dragonball, we found out that this register support is a must
for passthrough GPU running CUDA application, GPU needs that information to properly use GPU memory.
fixes: #9310
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
When the https_proxy/no_proxy settings are configured alongside agent-policy enabled, the process of pulling image in the guest will hang.
This issue could stem from the instantiation of `reqwest`’s HTTP client at the time of agent-policy initialization,
potentially impacting the effectiveness of the proxy settings during image guest pulling.
Given that both functionalities use `reqwest`, it is advisable to set https_proxy/no_proxy prior to the initialization of agent-policy.
Fixes: #9212
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
Enable to build kata-agent with PULL_TYPE feature.
We build kata-agent with guest-pull feature by default, with PULL_TYPE set to default.
This doesn't affect how kata shares images by virtio-fs. The snapshotter controls the image pulling in the guest.
Only the nydus snapshotter with proxy mode can activate this feature.
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
To support handle image-guest-pull block volume from different CRIs, including cri-o and containerd.
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
By default the pause image and runtime config will provided
by host side, this may have potential security risks when the
host config a malicious pause image, then we will use the pause
image packaged in the rootfs.
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
Co-authored-by: Arron Wang <arron.wang@intel.com>
Co-authored-by: Julien Ropé <jrope@redhat.com>
Co-authored-by: stevenhorsman <steven@uk.ibm.com>
Add "guest-pull" feature option to determine that the related dependencies
would be compiled if the feature is enabled.
By default, agent would be built with default-pull feature, which would
support all pull types, including sharing images by virtio-fs and
pulling images in the guest.
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
support to pass image information to guest by KataVirtualVolumeImageGuestPullType
in KataVirtualVolume, which will be used to pull image on the guest.
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
As we do not employ a forked containerd in confidential-containers, we utilize the KataVirtualVolume
which storing the image information as an integral part of `CreateContainer`.
Within this process, we store the image information in rootfs.storage and pass this image url through `CreateContainerRequest`.
This approach distinguishes itself from the use of `PullImageRequest`, as rootfs.storage is already set and initialized at this stage.
To maintain clarity and avoid any need for modification to the `OverlayfsHandler`,we introduce the `ImagePullHandler`.
This dedicated handler is responsible for orchestrating the image-pulling logic within the guest environment.
This logic encompasses tasks such as calling the image-rs to download and unpack the image into `/run/kata-containers/{container_id}/images`,
followed by a bind mount to `/run/kata-containers/{container_id}`.
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
When being passed an image name through a container annotation,
merge its corresponding bundle OCI specification and process into the passed container creation one.
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
Co-authored-by: Arron Wang <arron.wang@intel.com>
Co-authored-by: Jiang Liu <gerry@linux.alibaba.com>
Co-authored-by: stevenhorsman <steven@uk.ibm.com>
Co-authored-by: wllenyj <wllenyj@linux.alibaba.com>
Co-authored-by: jordan9500 <jordan.jackson@ibm.com>
Containerd can support set a proxy when downloading images with a environment variable.
For CC stack, image download is offload to the kata agent, we need support similar feature.
Current we add https_proxy and no_proxy, http_proxy is not added since it is insecure.
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
Co-authored-by: Arron Wang <arron.wang@intel.com>
With image-rs pull_image API, the downloaded container image layers
will store at IMAGE_RS_WORK_DIR, and generated bundle dir with rootfs
and config.json will be saved under CONTAINER_BASE/cid directory.
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
Co-authored-by: Arron Wang <arron.wang@intel.com>
Co-authored-by: Jiang Liu <gerry@linux.alibaba.com>
Co-authored-by: stevenhorsman <steven@uk.ibm.com>
Co-authored-by: wllenyj <wllenyj@linux.alibaba.com>
Introduce structure ImageService, which will be used to pull images
inside the guest.
Fixes: #8103
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
co-authored-by: wllenyj <wllenyj@linux.alibaba.com>
co-authored-by: stevenhorsman <steven@uk.ibm.com>
File descriptors that are passed to QEMU need some special care.
We want them to be closed when the QEMU process is started. But
at the same time, it is required that the associated rust File
structures, either coming from the` std::fs` or the `tokio::fs`
crates, are still in scope when the QEMU process is forked. This
is currently achieved by keeping File structures in variables
at the outer scope of `start_vm()`. This scheme is currently
duplicated, with similar justifications in the corresponding
comments.
Consolidate all this handling in one place with a more generic
explanation.
Fixes#9281
Signed-off-by: Greg Kurz <groug@kaod.org>
The agent now has a number of optional build-time features that can be
enabled.
Add details of these features to the following areas:
- Version output (`kata-agent --version`)
- Announce message (so that the details are always added to the journal
at agent startup).
- The response message returned by the ttRPC `GetGuestDetails()` API.
Fixes: #9285.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Some previous contribution missed to run cargo clippy.
Fix the dependency now so that it doesn't cause noise
in future contributions.
Signed-off-by: Greg Kurz <groug@kaod.org>
Fixes: #9269
From https://github.com/opencontainers/runtime-spec/blob/main/config.md#mounts
type (string, OPTIONAL) The type of the filesystem to be mounted.
bind may be only specified in the oci spec options -> flags update r#type
The agent will ignore bind mounts if they are only specified in the OCI spec options and not in the flags.
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
s390x supports a different machine type `s390-ccw-virtio` and it is
not required to configure cpu features by default for the platform.
A hypervisor `dragonball` is not supported on s390x so that `DBCMD`
is not necessary. `vm-rootfs_driver` should be set to `virtio-blk-ccw`.
This commit is to set the architecture-specific flags for Makefile.
Fixes: #9158
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
The guest_hook_path item in configuration.toml allows OCI hook scripts
to be executed within Kata's guest environment. Traditionally, these
guest hook programs are pre-built and included in Kata's guest rootfs
image at a fixed location.
While setting guest_hook_path = "/usr/share/oci/hooks" in configuration.toml
works, it lacks flexibility. Not all guest hooks reside in the path
/usr/share/oci/hooks, and users might have custom locations.
To address this, a more flexible and configurable approach is to be proposed
that allows users to specify their desired path. This could include using a
sandbox bind mount path for hooks specific to that particular container.
However, The current implementation of guest hooks and bind mounts in kata-agent
has a reversed order of execution compared to the desired behavior.
To achieve the intended functionality, we simply need to swap the order of their
implementation.
Fixes: #9274
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Fixes: #9267
The doc states we have support for all lifecycle hooks. There are still some missing.
This is the first issue regarding the CreateContainer hook which is run before pivot_root but after prestart and createruntime
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
The open_named_tuntap function is designed as a public function to
open a tuntap device with the specified name. However, in order to
reference existing methods in dbs_utils, we still need to keep the
reference "path = "../../../dragonball/src/dbs_utils" in dependencies
and cannot hide it.
Fixes: #8865
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Add network helpers and impl ToQemuParams trait to build
netdev params which are put into cmdline for Qemu VM running.
Fixes: #8865
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
We need ensure the add_network_device happens in netns and
move qemu process into netns which keeps the qemu process
running in this net namespace.
Fixes: #8865
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Add network device handler in start_vm, which is sepcially
for Qemu VM running with added net params to command line.
Fixes: #8865
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
We need add a new netns field in struct QemuInner, and
initialize it with argument passed down in prepare_vm().
Fixes: #8865
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
The enter_netns function is designed as a public method to help
VMMs running as a independent process enter a network namespace,
reducing duplicate code.
Fixes: #8865
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
It just move the related code to a public file(utils.rs) and make
it a common method for both vsock and network, or some others.
Fixes: #8865
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Signed-off-by: Pavel Mores <pmores@redhat.com>
In order to better support non-builtin vmm usage of NetnsGuard and
reduce code duplication, we need to move it to a common path that
can be referenced by both hypervisor and resource manager.
In this patch, it just do moving code from network/utils/netns.rs
to kata-sys-utils/src/netns.rs
Fixes: #8865
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
The PID needs to be initialized before calling isClhRunning.
waitVMM() uses isClhRunning and is called by launchClh() just
before returning from function.
Fixes: #9230
Signed-off-by: Alexandru Matei <alexandru.matei@uipath.com>
Use containerd's default environment for container images that don't
specify the Env field.
Also, re-enable policy env variable verification, now that these
uncommon images are supported too.
Fixes: #9239
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
This change updates the module import to use 'util' instead of the deprecated 'io/util'
Fixes: #9166
Signed-off-by: Chungeun Choi <ce.choi@okestro.com>
copyBuffer returns and the streams will be closed when error occurs.
If the error contains "blocked by policy" it means the log output is
disabled by policy with "ReadStreamRequest" and "WriteStreamRequest" set
to false. But at this moment, we want the real stream still working (not
be seen) because we might want to enable logging for debugging purpose,
so we repeat copybuffer in this case to avoid streams being closed.
Fixes: #8797
Signed-off-by: Linda Yu <linda.yu@intel.com>
logging/debugging information might probably be disabled in production
due to security consideration, but we'd better provide an approach for
customer to get logging information during runtime, this PR implement
setpolicy function in kata-runtime tools, although it can set whole policy
other than logging.
setpolicy would evokes remote attestation, which means before setting
policy during runtime, user has to reconfigure new policy hash in KBS/AS.
usage: kata-runtime policy set policy.rego --sandbox-id XXXXXXXX
Fixes: #8797
Signed-off-by: Linda Yu <linda.yu@intel.com>
Disable env variable verification to unblock CI, until container
images that don't specify the Env variables will be handled correctly
(see #9239).
Also, mark the image config Env field as optional, thus allowing
policy generation for these container images.
Fixes: #9240
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
a) There is some unknown syscalls triggered in new github virt machine
that would break the make test process with SIGSYS after applying
SeccompFilter. In order to fix this, we change the allowlist in this
unit test for seccompfileter into a blocklist to avoid meeting the unknown syscalls.
b) lazy static METRICS is not fully initialize in the unit test and may lead to
unstable result for this UT.
fixes: #9207
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
the mmap region start guest addr hard-code a value and later there
would be check whether the mentioned addr is larger than or equal
to mem_end (default to host_phy_mem >> 1) in order to satisfy the
requirement for DaxMemory. Since github virt machine phy_mem is larger
than previous CI machine we use, the hard-code value could no longer be
worked. To fix this, we change the address to mem_end in unit test to
avoid the influence of host machine change.
fixes: #9207
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
This fixes a panic on tracing on container exit.
The root cause is that global var needs to be set by "=" instead of
":=".
Fixes: #9102
Signed-off-by: Liu Bo <liub.liubo@gmail.com>
Add all agent configuration options to README so that users can more easily understand
what these options do and how to configure them at runtime.
Fixes: #9109
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
Relax the timeout for calling CLH's CreateVM + BootVM APIs. When
hitting the older 1s timeout, killing a half-booted Guest and
retrying the same boot sequence could have been wasteful and resulting
in unstable CI testing on slower Hosts.
Fixes: #9152
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
For consistency with the go runtime.
As the shim itself is not using the network (all its communication with
other processes is done with local unix sockets), there is no reason to
keep gathering and reporting shim-specific network metrics.
Actual network usage of the kata containers can be found from the existing
agent network metrics (kata_guest_netdev_stat).
Signed-off-by: Julien Ropé <jrope@redhat.com>
As part of the shim network metrics, the shim is reporting network interfaces
from the host with no namespace isolation - this gives insight in interfaces
not tied to the kata containers, and causes an increase in resource usage for
kata metrics.
As the shim itself is not using the network (all its communication with
other processes is done with local unix sockets), there is no reason to
keep gathering and reporting shim-specific network metrics.
Actual network usage of the kata containers can be found from the existing
hypervisor network metrics (kata_hypervisor_netdev) and from the agent
network metrics (kata_guest_netdev_stat).
Fixes: #5738
Signed-off-by: Julien Ropé <jrope@redhat.com>
We need initailize the pci_hotplug_enabled with true before we do GPU
passthrough with runtime-rs/dragonball. Otherwise it fails with error
`InvalidOperation`.
Fixes: #9129
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
When cgroup v2 is in use, a container should only see its part of the
unified hierarchy in `/sys/fs/cgroup`, not the full hierarchy created
at the OS level. Similarly, `/proc/self/cgroup` inside the container
should display `0::/`, rather than a full path such as :
0::/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podde291f58_8f20_4d44_aa89_c9e538613d85.slice/crio-9e1823d09627f3c2d42f30d76f0d2933abdbc033a630aab732339c90334fbc5f.scope
What is needed here is isolation from the OS. Do that by running the
container in its own cgroup namespace. This matches what runc and
other non VM based runtimes do.
Fixes#9124
Signed-off-by: Greg Kurz <groug@kaod.org>
In case an error is encountered while removing a network endpoint during
network cleanup, we cuurently return immediately with the error.
With this change, in case of error we simply log the error and proceed
towards removing the next endpoint. With this, we can cleanup the
network changes made by the shim as much as possible.
This is especially important when multiple interfaces are passed to the
network namespace using a network plugin like multus.
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
Move the defer for cleaning up network before the call to add network.
This way if any change made by add network is reverted by in case of
failure. This is particulary important for physical network interfaces
as with this step we make sure that driver for the physical interface is
reverted back to the original host driver. Without this the physical
network iterface will remain bound to vfio.
Fixes: #8646
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>