The qemu-coco-dev runtime class should be as close as possible to what
the TEEs runtime classes are doing, and this was one of the options that
ended up overlooked till now.
Shout out to Dan Mihai for noticing that!
Signed-off-by: Fabiano Fidêncio <fabiano@fidencio.org>
Add GPU annotations for remote hypervisor to help
with the right instance selection based on number of GPUs
and model
Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
We may decide to add this later on, but for now this is only targetting
TEEs and the confidential image / initrd.
Signed-off-by: Fabiano Fidêncio <fabiano@fidencio.org>
Return of proper error to the initiator is not guaranteed.
Method StopVM could kill shim process together with VM pieces.
Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
When QEMU is terminated by signal 15, it deletes the PidFile.
Upon detecting that QEMU has exited, the shim executes the stopVM function.
If the PidFile is not found, the PID is set to 0.
Subsequently, the shim executes `kill -9 0`, which terminates the current process group.
This prevents any further logic from being executed, resulting in resources not being cleaned up.
Signed-off-by: wangyaqi54 <wangyaqi54@jd.com>
When using network adapters that support SR-IOV, a VFIO device can be
plugged into a guest VM and claimed as a network interface. This can
significantly enhance network performance.
Fixes: #9758
Signed-off-by: Lei Huang <leih@nvidia.com>
As we don't have any CI, nor maintainer to keep ACRN code around, we
better have it removed than give users the expectation that it should or
would work at some point.
Signed-off-by: Fabiano Fidêncio <fabiano@fidencio.org>
kata-shim was not reporting `inactive_file` in memory stat.
This memory is deducted by containerd when calculating the size of container working set, as it can be paged out by the operating
system under memory pressure. Without reporting `inactive_file`, containerd will over report container memory usage.
[Here](https://github.com/containerd/containerd/blob/v1.7.22/pkg/cri/server/container_stats_list_linux.go#L117) is where containerd
deducts `inactive_file` from memory usage.
Note that kata-shim correctly reports `total_inactive_file` for cgroup v1, but this was not implemented for cgroup v2.
This commit:
- Adds code in kata-shim to report "inactive_file" memory for cgroup v2
- Implements reporting of all available cgroup v2 memory stats to containerd
- Uses defensive coding to avoid assuming existence of any memory.stat fields
The list of available cgroup v2 memory stats defined by containerd can be found
[here](https://pkg.go.dev/github.com/containerd/cgroups/v2/stats#MemoryStat).
Fixes#10280
Signed-off-by: Alex Man <alexman@stripe.com>
It will panic when users do GPU vfio passthrough with cdi in runtime.
The root cause is that CustomSpec.Annotations is nil when new element
added.
To address this issue, initialization is introduced when it's nil.
Fixes#10266
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
I know right now we're always passing a value for that, but this doesn't
really have to be set unless attestation is used. Thus, let's also omit
it in case it's empty.
Signed-off-by: Fabiano Fidêncio <fabiano@fidencio.org>
This is a quick and simple pre-req for supporting initData, which will
take advantage of the mrconfigid in the TDX case.
While already adding mrconfigid, which is hardcoded empty right now,
let's do the same for mrowner and mrownerconfig, and leave it prepared
for future expansions.
Signed-off-by: Fabiano Fidêncio <fabiano@fidencio.org>
The reason we're relying on yet another function to do so is because the
TDX object will be used in its qom / qapi json format.
Signed-off-by: Fabiano Fidêncio <fabiano@fidencio.org>
This error is specific to SNP platforms, so let's make sure we only
error this out when an SNP platform is used.
Signed-off-by: Fabiano Fidêncio <fabiano@fidencio.org>
...or by using a binary with additional suffix.
This allows having multiple versions of nydus-overlayfs installed on the
host, telling nydus-snapshotter which one to use while still detecting
Nydus is used.
Signed-off-by: Paul Meyer <49727155+katexochen@users.noreply.github.com>
PhysicalEndpoint unbinds its VF interface and rebinds it as a VFIO device,
then cold-plugs the VFIO device into the guest kernel.
When `cold_plug_vfio` is set to "no-port", cold-plugging the VFIO device
will fail.
This change checks if `cold_plug_vfio` is enabled before creating PhysicalEndpoint
to avoid unnecessary VFIO rebind operations.
Fixes: #10162
Signed-off-by: Lei Huang <leih@nvidia.com>
This patch re-generates the client code for Cloud Hypervisor v41.0.
Note: The client code of cloud-hypervisor's OpenAPI is automatically
generated by openapi-generator.
Fixes: #10203
Signed-off-by: Bo Chen <chen.bo@intel.com>
This commit allows `cdh_api_timeout` to be configured from the configuration file.
The configuration is commented out with specifying a default value (50s) because
the default value is configured in the agent.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Improved error handling to provide clearer feedback on request failures.
For example:
Improve createcontainer request timeout error message from
"Error: failed to create containerd task: failed to create shim task:context deadline exceed"
to "Error: failed to create containerd task: failed to create shim task: CreateContainerRequest timed out: context deadline exceed".
Fixes: #10173 -- part II
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
Our code for handling images being pulled inside the guest relies on a
containerType ("sandbox" or "container") being set as part of the
container annotations, which is done by the CRI Engine being used, and
depending on the used CRI Engine we check for a specfic annotation
related to the image-name, which is then passed to the agent.
However, when running kata-containers without kubernetes, specifically
when using `nerdctl`, none of those annotations are set at all.
One thing that we can do to allow folks to use `nerdctl`, however, is to
take advantage of the `--label` flag, and document on our side that
users must pass `io.kubernetes.cri.image-name=$image_name` as part of
the label.
By doing this, and changing our "fallback" so we can always look for
such annotation, we ensure that nerdctl will work when using the nydus
snapshotter, with kata-containers, to perform image pulling inside the
pod sandbox / guest.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
This parameter has been deprecated for a long time and QEMU 9.1.0 finally removes it.
Fixes: kata-containers#10112
Signed-off-by: Tom Dohrmann <erbse.13@gmx.de>
Similar to HotAttach, the HotDetach method signature for network
endoints needs to be changed as well to allow for the method to make
use of device manager to manage the hot unplug of physical network
devices.
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
Enable physical network interfaces to be hotplugged.
For this, we need to change the signature of the HotAttach method
to make use of Sandbox instead of Hypervisor. Similar approach was
followed for Attach method, but this change was overlooked for
HotAttach.
The signature change is required in order to make use of
device manager and receiver for physical network
enpoints.
Fixes: #8405
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
The current implementation for device binding using driver bind/unbind
and new_id fails in the scenario when the physical device is not bound
to a driver before assigning it to vfio.
There exists and updated mechanism to accomplish the same that does not
have the same issue as above.
The driver_override field for a device allows us to specify the driver for a device
rather than relying on the bound driver to provide a positive match of the
device. It also has other advantages referenced here:
https://patchwork.kernel.org/project/linux-pci/patch/1396372540.476.160.camel@ul30vt.home/
So use the updated driver_override mechanism for binding/unbinding a
physical device/virtual function to vfio-pci.
Signed-off-by: liangxianlong <liang.xianlong@zte.com.cn>
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
Sets SharedFS config to NoSharedFS for remote hypervisor in order to start the file watcher which syncs files from the host to the guest VMs.
Signed-off-by: Silenio Quarti <silenio_quarti@ca.ibm.com>
The DAN feature has already been implemented in kata-runtime-rs, and
this commit brings the same capability to the Go kata-runtime.
Fixes: #9758
Signed-off-by: Lei Huang <leih@nvidia.com>
Do not install the packages librados-dev and librbd-dev as they are not needed for building static qemu.
Add machine option cap-ail-mode-3=off while creating the VM to qemu cmdline.
Fixes: #9893
Signed-off-by: Amulyam24 <amulmek1@in.ibm.com>
This patch re-generates the client code for Cloud Hypervisor v40.0.
Note: The client code of cloud-hypervisor's OpenAPI is automatically
generated by openapi-generator.
Fixes: #9929
Signed-off-by: Bo Chen <chen.bo@intel.com>
Move the `sandbox.agent.setPolicy` call out of the remoteHypervisor
if, block, so we can use the policy implementation on peer pods
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Commit 'ca02c9f5124e' implements the vhost-user-blk reconnection functionality,
However, it has missed assigning VhostUserDeviceReconnect when new the QEMU
HypervisorConfig, resulting in VhostUserDeviceReconnect always set to default value 0.
Real change is this line, most of changes caused by go format,
return vc.HypervisorConfig{
// ...
VhostUserDeviceReconnect: h.VhostUserDeviceReconnect,
}, nil
Fixes: #9848
Signed-off-by: markyangcc <mmdou3@163.com>
This is a counterpart of commit abf52420a4 for the qemu-coco-dev
configuration. By allowing default_vcpu and default_memory annotations
users can fine-tune the VM based on the size of the container
image to avoid issues related with pulling large images in the guest.
Signed-off-by: Wainer dos Santos Moschetta <wainersm@redhat.com>
Just like the TEE configurations (sev, snp, tdx) we want to have the
qemu-coco-dev using shared_fs=none.
Fixes: #9676
Signed-off-by: Wainer dos Santos Moschetta <wainersm@redhat.com>
While running with a remote hypervisor, whenever kata-monitor tries to access
metrics from the shim, the shim does a "panic" and no metric can be gathered.
The function GetVirtioFsPid() is called on metrics gathering, and had a call
to "panic()". Since there is no virtiofs process for remote hypervisor, the
right implementation is to return nil. The caller expects that, and will skip
metrics gathering for virtiofs.
Fixes: #9826
Signed-off-by: Julien Ropé <jrope@redhat.com>
This patch re-generates the client code for Cloud Hypervisor v39.0.
Note: The client code of cloud-hypervisor's OpenAPI is automatically
generated by openapi-generator.
Fixes: #8694, #9574
Signed-off-by: Bo Chen <chen.bo@intel.com>
When the total number of files observed is greater than limit, return (-1, err).
When the returned err is not nil, the func countFiles should return -1.
Fixes:#9780
Signed-off-by: gaohuatao <gaohuatao@bytedance.com>
This is required to provide the hashes of kernel, initrd and cmdline
needed during the attestation of the coco.
Fixes: #9150
Signed-off-by: Niteesh Dubey <niteesh@us.ibm.com>
Since yq frequently updates, let's upgrade to a version from February to
bypass potential issues with versions 4.41-4.43 for now. We can always
upgrade to the newest version if necessary.
Fixes#9354
Depends-on:github.com/kata-containers/tests#5818
Signed-off-by: Beraldo Leal <bleal@redhat.com>
golang.mk is not ready to deal with non GOPATH installs. This is
breaking test on s390x.
Since previous steps here are installing go and yq our way, we could
skip this aditional check. A full refactor to golang.mk would be needed
to work with different paths.
Signed-off-by: Beraldo Leal <bleal@redhat.com>
We need to remove the device from the tracking map, a container
restart will increment the bus index and we will get out of root-ports
and crash the machine.
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
We need special handling for pod_sandbox, pod_container and
single_container how and when to inject CDI devices
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
In Kubernetes we still do not have proper VM sizing
at sandbox creation level. This KEP tries to mitigates
that: kubernetes/enhancements#4113 but this can take
some time until Kube and containerd or other runtimes
have those changes rolled out.
Before we used a static config of VFIO ports, and we
introduced CDI support which needs a patched contianerd.
We want to eliminate the patched continerd in the GPU case
as well.
Fixes: #8860
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
For the TD attestation to work the connection to QGS on the host is needed.
By default QGS runs on vsock port 4050, but can be modified by the host owner.
Format of the qemu object follows the SocketAddress structure, so it needs to be provided in the JSON format, as in the example below:
-object '{"qom-type":"tdx-guest","id":"tdx","quote-generation-socket":{"type":"vsock","cid":"2","port":"4050"}}'
Fixes: #9497
Signed-off-by: Jakub Ledworowski <jakub.ledworowski@intel.com>
Remove `rand.Seed` call to resolve the following failure:
```
rand.Seed is deprecated: As of Go 1.20 there is no reason to call Seed with a random value.
```
The go rand.Seed docs: https://pkg.go.dev/math/rand@go1.20#Seed
back this up and states:
> If Seed is not called, the generator is seeded randomly at program startup.
so I believe we can just delete the call.
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
For now, let's allow the users to set the default_cpu and default_memory
when using TDX, as they may hit issues related to the size of the
container image that must be pulled and unpacked inside the guest,
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
- Make due to us bumping the golang version used in our CI
but `make vendor` fails without the go version in the runtime go.mod
being increased, so update this and run go mod tidy
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
We shouldn't be using 9p, at all, with TEEs, as off right now we have no
way to ensure the channels are encrypted. The way to work this around
for now is using guest pull, either with containerd + nydus snapshotter
or with CRI-O; or even tardev snapshotter for pulling on the host (which
is the approach used by MSFT).
This is only done for TDX for now, leaving the generic, AMD, and IBM
related stuff for the folks working on those to switch and debug
possible issues on their environment.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
ResizeMemory for Cloud Hypervisor is missing a check for the new
requested memory being greater than the max hotplug size after
alignment. Add the check, and since an earlier check for this
setsrequested memory to the max hotplug size, do the same in the
post-alignment check.
Fixes#9640
Signed-off-by: Chelsea Mafrica <chelsea.e.mafrica@intel.com>
By default, when a container is created with the `--privileged` flag,
all devices in `/dev` from the host are mounted into the guest. If
there is a block device(e.g. `/dev/dm`) followed by a generic
device(e.g. `/dev/null`),two identical block devices(`/dev/dm`)
would be requested to the kata agent causing the agent to exit with error:
> Conflicting device updates for /dev/dm-2
As the generic device type does not hit any cases defined in `switch`,
the variable `kataDevice` which is defined outside of the loop is still
the value of the previous block device rather than `nil`. Defining `kataDevice`
in the loop fixes this bug.
Signed-off-by: cncal <flycalvin@qq.com>
This was needed when we were using an old (and not maintained anymore)
host stack. Considering what we have as part of the distros, Today,
this can simply be dropped, as I cannot find any reference of this one
being needed in any up-to-date documentation.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
This reverts commit b7cccfa019.
The `private=on` bit has never made its way upstream, and was removed
from the latest iteration that we're using. With that in mind, let's
revert its usage in the code.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
This reverts commit 582b5b6b19.
The `private=on` bit has never made its way upstream, and was removed
from the latest iteration that we're using. With that in mind, let's
revert its addition, and later on its usage in the code.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Let's add the PLACEHOLDER_FOR_DISTRO_{QEMU,OVMF}_WITH_TDX_SUPPORT
variables instead of actually setting a path, so we can easily replace
those as part of our deployment scripts.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Everytime I create contianer on arm64 machine, containerd/kata logs a redundant warning
as follows:
``` shell
time="2024-05-07" level=warning msg="<nil>" arch=arm64 name=containerd-shim-v2
pid=xxx sandbox=fdd1f05 source=virtcontainers/hypervisor
```
I added an error statement so that the error would be logged when it occurs.
Signed-off-by: cncal <flycalvin@qq.com>
The build message shows that yq was not found when I tried to build
runtime binaries, but I've actually installed yq by yum install.
Signed-off-by: cncal <flycalvin@qq.com>
If device /dev/kvm does not exist, kata-runtime check would fail with
an ambiguous error messae 'no such file or directory'. I added a little
more details to make it understandable and it will belike:
```
ERRO[0000] cannot open kvm device: no such file or directory arch=arm64 check-type=full device=/dev/kvm name=kata-runtime pid=2849085 source=runtime
ERRO[0000] no such file or directory arch=arm64 name=kata-runtime pid=2849085 source=runtime
no such file or directory
```
Signed-off-by: cncal <flycalvin@qq.com>
- The testVersionString logic use regex to check that the ociVersion is
displayed correctly, but with the new go module that version has a
`+` in, so we need to quote this to escape special characters
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Created a new configuration to configure Kata for CoCo without requiring TEE
hardware so to allow developers implement/test/debug platform agnostic code
on their workstations. It will also ease testing of CoCo features on CI with
non-TEE supported VMs.
This is based off qemu configuration. The following differences applied:
- switched to confidential guest image/initrd
- switched to confidential kernel
- switched to 9p shared_fs
Fixes#9487
Signed-off-by: Wainer dos Santos Moschetta <wainersm@redhat.com>
updates golang.org/x/net to newer version that closes some reported
vulnerabilities and security issues
Fixes#9486
Signed-off-by: Adil Sadik <sparky.005@gmail.com>
isClhRunning uses signal 0 to test whether the process is
still alive or not. This doesn't work because the process is a
direct child of the shim. Once it is dead the process becomes
zombie.
Since no one waits for it the process lingers until
its parent dies and init reaps it. Hence sending signal 0 in
isClhRunning will always return success whether the process is
dead or not.
This patch calls wait to reap the process, if it succeeds that
means it is our child process, if not we send the signal.
Fixes: #9431
Signed-off-by: Alexandru Matei <alexandru.matei@uipath.com>
Let's rely on the kvm module 'tdx' parameter to do so.
This aligns with both OSVs (Canonical, Red Hat, SUSE) and the TDX
adoption (https://github.com/intel/tdx-linux) stacks.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
This commit is a mess, but I'm not exactly sure what's the best way to
make it less messy, as we're getting QEMU TDX to work while partially
reverting 1e34220c41.
With that said, let me cover the content of this commit.
Firstly, we're reverting all the changes related to
"memory-backend-memfd-private", as that's what was used with the
previous host stack, but it seems it
didn't fly upstream.
Secondly, in order to get QEMU to properly work with TDX, we need to
enforce the 'private=on' knob and use the "memory-backend-ram", and
we're doing so, and also making sure to test the `private=on` newly
added knob.
I'm sorry for the confusion, I understand this is not optimal, I just
don't see an easy path to do changes without leaving the code broken
during those changes.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
This reverts commit d1b54ede29.
Conflicts:
src/runtime/virtcontainers/qemu.go
This commit was a hack that was needed in order to get QEMU + TDX to
work atop of the stack our CI was running on. As we're moving to "the
officially supported by distros" host OS, we need to get rid of this.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
The private=on|off knob is required in order to properly lauunch a TDX
guest VM.
This is a brand new property that is part of the still in-flight patches
adding TDX support on QEMU.
Please, see:
3fdd8072da
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
We used to utilize go runtime's "NumCPUs()", which will give the number
of cores available to the Go runtime, which may be a subset of physical
cores if the shim is started from within a cpuset. From the function's
description:
"NumCPU returns the number of logical CPUs usable by the current
process."
As an example, if containerd is run from within a smaller CPUset, the
maximum size of a pod will be dictated by this CPUset, instead of what
will be available on the rest of the system.
Since the shim will be moved into its own cgroup that may have a
different CPUset, let's stick with checking physical cores. This also
aligns with what we have documented for maxVCPU handling.
In the event we fail to read /proc/cpuinfo, let's use the goruntime.
Fixes: #9327
Signed-off-by: Eric Ernst <eric_ernst@apple.com>
This reverts commit 1c5693be86.
Avoid apparent infinite loop when ReadStreamRequest is blocked by
policy - for some of the pods.
When running the k8s-limit-range.bats test with Policy enabled,
the Shim + VMM never get terminated on my cluster. Not sure why
the sandbox clean-up works better for other tests, but the
k8s-limit-range test pod gets stuck in an infinite loop:
stdout io stream copy error happens: error = %wrpc error: code =
PermissionDenied desc = \"ReadStreamRequest is blocked by policy
...
policy check: ReadStreamRequest
...
stdout io stream copy error happens: error = %wrpc error: code =
PermissionDenied desc = \"ReadStreamRequest is blocked by policy
...
policy check: ReadStreamRequest
...
Fixes: #9380
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Since we no longer use the service_offload configuration,
remove the ServiceOffload field from the image struct.
Signed-off-by: Tobin Feldman-Fitzthum <tobin@ibm.com>
These experimental options were added 2 years ago
in anticipation of features that would be added
in CoCo. These do not match the features that were
eventually added and will soon be ported to main.
Fixes: #8047
Signed-off-by: Tobin Feldman-Fitzthum <tobin@ibm.com>
support to configure CreateContainerRequestTimeout in the
configurations.
e.g.:
[runtime]
...
create_container_timeout = 300
Note: The effective timeout is determined by the lesser of two values: runtime-request-timeout from kubelet config
(https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/#:~:text=runtime%2Drequest%2Dtimeout) and create_container_timeout.
In essence, the timeout used for guest pull=runtime-request-timeout<create_container_timeout?runtime-request-timeout:create_container_timeout.
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
Most of the content of `docs/Stable-Branch-Strategy.md` got de-facto
deprecated by the re-design of the release process described in #9064.
Remove this file and all its references in the repo.
The `## Versioning` section has some useful information though. It is
moved to `docs/Release-Process.md`. The documentation of the `PATCH`
field is adapted according to new workflow.
Fixes#9064 - part VI
Signed-off-by: Greg Kurz <groug@kaod.org>
Support to configure CreateContainerRequestTimeout in the annotations.
e.g.:
annotations:
"io.katacontainers.config.runtime.create_container_timeout": "300"
Note: The effective timeout is determined by the lesser of two values: runtime-request-timeout from kubelet config
(https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/#:~:text=runtime%2Drequest%2Dtimeout) and create_container_timeout.
In essence, the timeout used for guest pull=runtime-request-timeout<create_container_timeout?runtime-request-timeout:create_container_timeout.
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
In the situation to pull images in the guest #8484, it’s important to account for pulling large images.
Presently, the image pull process in the guest hinges on `CreateContainerRequest`, which defaults to a 60-second timeout.
However, this duration may prove insufficient for pulling larger images, such as those containing AI models.
Consequently, we must devise a method to extend the timeout period for large image pull.
Fixes: #8141
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
Change scripts and source that uses files in the tests repo to use the
corresponding file in the current repo.
Fixes#9165
Signed-off-by: Chelsea Mafrica <chelsea.e.mafrica@intel.com>
To support handle image-guest-pull block volume from different CRIs, including cri-o and containerd.
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
support to pass image information to guest by KataVirtualVolumeImageGuestPullType
in KataVirtualVolume, which will be used to pull image on the guest.
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
The PID needs to be initialized before calling isClhRunning.
waitVMM() uses isClhRunning and is called by launchClh() just
before returning from function.
Fixes: #9230
Signed-off-by: Alexandru Matei <alexandru.matei@uipath.com>
This change updates the module import to use 'util' instead of the deprecated 'io/util'
Fixes: #9166
Signed-off-by: Chungeun Choi <ce.choi@okestro.com>
copyBuffer returns and the streams will be closed when error occurs.
If the error contains "blocked by policy" it means the log output is
disabled by policy with "ReadStreamRequest" and "WriteStreamRequest" set
to false. But at this moment, we want the real stream still working (not
be seen) because we might want to enable logging for debugging purpose,
so we repeat copybuffer in this case to avoid streams being closed.
Fixes: #8797
Signed-off-by: Linda Yu <linda.yu@intel.com>
logging/debugging information might probably be disabled in production
due to security consideration, but we'd better provide an approach for
customer to get logging information during runtime, this PR implement
setpolicy function in kata-runtime tools, although it can set whole policy
other than logging.
setpolicy would evokes remote attestation, which means before setting
policy during runtime, user has to reconfigure new policy hash in KBS/AS.
usage: kata-runtime policy set policy.rego --sandbox-id XXXXXXXX
Fixes: #8797
Signed-off-by: Linda Yu <linda.yu@intel.com>
This fixes a panic on tracing on container exit.
The root cause is that global var needs to be set by "=" instead of
":=".
Fixes: #9102
Signed-off-by: Liu Bo <liub.liubo@gmail.com>
Relax the timeout for calling CLH's CreateVM + BootVM APIs. When
hitting the older 1s timeout, killing a half-booted Guest and
retrying the same boot sequence could have been wasteful and resulting
in unstable CI testing on slower Hosts.
Fixes: #9152
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
As part of the shim network metrics, the shim is reporting network interfaces
from the host with no namespace isolation - this gives insight in interfaces
not tied to the kata containers, and causes an increase in resource usage for
kata metrics.
As the shim itself is not using the network (all its communication with
other processes is done with local unix sockets), there is no reason to
keep gathering and reporting shim-specific network metrics.
Actual network usage of the kata containers can be found from the existing
hypervisor network metrics (kata_hypervisor_netdev) and from the agent
network metrics (kata_guest_netdev_stat).
Fixes: #5738
Signed-off-by: Julien Ropé <jrope@redhat.com>
In case an error is encountered while removing a network endpoint during
network cleanup, we cuurently return immediately with the error.
With this change, in case of error we simply log the error and proceed
towards removing the next endpoint. With this, we can cleanup the
network changes made by the shim as much as possible.
This is especially important when multiple interfaces are passed to the
network namespace using a network plugin like multus.
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
Move the defer for cleaning up network before the call to add network.
This way if any change made by add network is reverted by in case of
failure. This is particulary important for physical network interfaces
as with this step we make sure that driver for the physical interface is
reverted back to the original host driver. Without this the physical
network iterface will remain bound to vfio.
Fixes: #8646
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
Although we don't seem to be affected by
https://nvd.nist.gov/vuln/detail/CVE-2024-21626, we vendor and use the
runc package in a few different places of our code, and we better update
the package to its latest release.
Fixes: #9097
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
This is needed to fix the bug which is not allowing to create SEV container
on SNP enabled host anymore. This is a regression that was introduced as
part of the following commit:
de39fb7d38Fixes: #9036
Signed-off-by: Niteesh Dubey <niteesh@us.ibm.com>
The original handling method does not reach user expectations.
When the ClientSocketAddress method stats the corresponding
path of runtime-rs and has not found it yet, we should return
an error message here that includes the reason for the failure
(which should be an error display indicating that both runtime-go
and runtime-rs were not found). Instead of simply displaying the
corresponding path of runtime-rs as the final error message to
users.
It is also necessary to return the error promptly to the caller
for further error handling.
Fixes: #8999
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Now that we have a confidential image / initrd being built, instead of a
specific one for each TEE, let's use it everywhere possible.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
As we're building a single confidential kernel, we should rely on it
rather than keep using the specific ones for TDX / SEV / SNP.
However, for debugability-sake, let's do this change TEE by TEE.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
As we're building a single confidential kernel, we should rely on it
rather than keep using the specific ones for TDX / SEV / SNP.
However, for debugability-sake, let's do this change TEE by TEE.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
As we're building a single confidential kernel, we should rely on it
rather than keep using the specific ones for TDX / SEV / SNP.
However, for debugability-sake, let's do this change TEE by TEE.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
With this we can properly generate and the the `-confidential` kernel,
which supports SEV / SNP / TDX as part of our configuration files.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
This patch can reduce load on systemd process, and
increase the k8s deployment density when using go runtime.
Fixes: #8758
Signed-off-by: Zhigang Wang <wangzhigang17@huawei.com>
Signed-off-by: Liu Wenyuan <liuwenyuan9@huawei.com>
When creating a cgroup, add a SingleContainer when obtaining the OCI Spec to apply to ctr, podman, etc.
Fixes: #5240
Signed-off-by: yaoyinnan <35447132+yaoyinnan@users.noreply.github.com>
SharedVersion fiel add a versiontable property that isn't supported by upstream QEMU.
This is dead code since virtcontainers isn't setting SharedVersions to true.
Fixes: #7720
Signed-off-by: Kvlil <kalil.pelissier@gmail.com>
A few CPU related test cases were failing as the version was being verified against Power8 while the CI machine is Power9.
Fixes: #5531
Signed-off-by: Amulyam24 <amulmek1@in.ibm.com>
This is to reintroduce a configuration rule for IBM Z Secure Execution,
where no initrd path should be configured. For the TEE of interest,
only a kernel image should be specified with `confidential_guest=true`.
Fixes: #8692
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Device mapper is the only supported block device driver so far,
which seems limiting. Kata Containers can work well with other
block devices. It is necessary to enhance supporting of multiple
kinds of host block device.
Fixes#4714
Signed-off-by: yuchen.cc <yuchen.cc@alibaba-inc.com>
This reverts commit b0157ad73a.
```
commit b0157ad73a
Refs: 3.3.0-alpha0-124-gb0157ad73
Author: Fabiano Fidêncio <fabiano.fidencio@intel.com>
AuthorDate: Fri Aug 11 14:55:11 2023 +0200
Commit: Fabiano Fidêncio <fabiano.fidencio@intel.com>
CommitDate: Fri Nov 10 12:58:20 2023 +0100
runtime: confidential: Do not set the max_vcpu to cpu
We don't have to do this since we're relying on the
`static_sandbox_resource_mgmt` feature, which gives us the correct
amount of memory and CPUs to be allocated.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
```
This commit was removing a requirement that was made previously, but due
to the SMP issue we're facing with the QEMU used for TDX (see commit
d1b54ede290e95762099fff4e0bcdad10f816126*), QEMU will fail to start due
to:
```
Invalid CPU topology: product of the hierarchy must match maxcpus:
sockets (1) * dies (1) * cores (1) * threads (1) != maxcpus (240)"
```
This has no affect on the SEV / SNP workflow and hopefully we'll be able
to re-revet this soon enough, when this gets solved on te QEMU side.
Last but not least, this is not a "clean" revert as we're using
conf.NumVCPUs() instead of conf.NumVCPUs, to ensure we're dealing with
uint32.
Fixes: #8532
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
- Update the remote hypervisor code to match the re-genned code for
the ttrpc Hypervisor Service
Fixes: #8519
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
The new clean-generated-files make target allows for removing the
generated files (including the configuration.toml files).
The tools/packaging/static-build/shim-v2/build.sh script now uses that
target to always force the re-generation of those files.
Signed-off-by: Wainer dos Santos Moschetta <wainersm@redhat.com>
1) Creating storage for all `io.katacontainers.volume=` messages in rootFs.Options,
and then aggregates all storages into `containerStorages`.
2) Creating storage for other data volumes and push them into `volumeStorages`.
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
To enhance the construction and administration of `Katavirtualvolume` storages,
this commit expands the 'sharedFile' structure to manage both
rootfs storages(`containerStorages`) including `Katavirtualvolume` and other data volumes storages(`volumeStorages`).
NOTE: `volumeStorages` is intended for future extensions to support Kubernetes data volumes.
Currently, `KataVirtualVolume` is exclusively employed for container rootfs, hence only `containerStorages` is actively utilized.
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
The snapshotter will place `KataVirtualVolume` information
into 'rootfs.options' and commence with the prefix 'io.katacontainers.volume='.
The purpose of this commit is to transform the encapsulated KataVirtualVolume data into device information.
Fixes: #8495
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
Co-authored-by: Feng Wang <feng.wang@databricks.com>
Co-authored-by: Samuel Ortiz <sameo@linux.intel.com>
Co-authored-by: Wedson Almeida Filho <walmeida@microsoft.com>
Add the corresponding data structure in the runtime part according to
kata-containers/kata-containers/pull/7698.
Fixes: #8472
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
In order to support different pod VM instance type via
remote hypervisor implementation (cloud-api-adaptor),
we need to pass machine_type, default_vcpus
and default_memory annotations to cloud-api-adaptor.
The cloud-api-adaptor then uses these annotations to spin
up the appropriate cloud instance.
Reference PR for cloud-api-adaptor
https://github.com/confidential-containers/cloud-api-adaptor/pull/1088Fixes: #7140
Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
(based on commit 004f07f076)
This patch updates the template configuration file for
the remote hypervisor to set static_sandbox_resource_mgmt
to be true. The remote hypervisor uses the peer pod config
to determine the sandbox size, so requires this to be set to
true by default.
Fixes: #6616
Signed-off-by: Yohei Ueda <yohei@jp.ibm.com>
(based on commit 938447803b)
Add the SELinux setting to ensure it is passed through to the remote
hypervisor
Fixes: #5936
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
(based on commit 3ef2fd1784)
This patch adds the support of the remote hypervisor type.
Shim opens a Unix domain socket specified in the config file,
and sends TTPRC requests to a external process to control
sandbox VMs.
Fixes#4482
Co-authored-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
Co-authored-by: stevenhorsman <steven@uk.ibm.com>
Signed-off-by: Yohei Ueda <yohei@jp.ibm.com>
(based on commit f9278f22c3)
This patch adds a protobuf definiton of the remote hypervisor type.
Signed-off-by: Yohei Ueda <yohei@jp.ibm.com>
Co-authored-by: stevenhorsman <steven@uk.ibm.com>
(based on commit 150e8aba6d)