Commit Graph

13813 Commits

Author SHA1 Message Date
alex.lyn
548f252bc4 runtime-rs: bugfix incorrect use of refcount before vfio attach
When there's a pod with multiple containers, there may be case that
attach point more than 2, we should not return Err in that case when
we are doing attach ops, but just return Ok.

Fixes: #8738

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2024-04-01 11:28:57 +08:00
Alex Lyn
aa9cd232cd
Merge pull request #9358 from GabyCT/topic/nerdrandom
gha: Update journal log names for nerdctl artifacts
2024-04-01 09:50:16 +08:00
Alex Lyn
dfa8832406
Merge pull request #9345 from c3d/bug/9342-agent-test-errors
agent: Fix errors in `make check`
2024-04-01 09:48:44 +08:00
Dan Mihai
3a7dbcfc17
Merge pull request #9367 from microsoft/danmihai1/infinite-io-stream-copy-loop
runtime: remove stream copy infinite loop
2024-03-29 09:37:44 -07:00
Dan Mihai
600f9266f3 runtime: remove stream copy infinite loop
This reverts commit 1c5693be86.

Avoid apparent infinite loop when ReadStreamRequest is blocked by
policy - for some of the pods.

When running the k8s-limit-range.bats test with Policy enabled,
the Shim + VMM never get terminated on my cluster. Not sure why
the sandbox clean-up works better for other tests, but the
k8s-limit-range test pod gets stuck in an infinite loop:

stdout io stream copy error happens: error = %wrpc error: code =
PermissionDenied desc = \"ReadStreamRequest is blocked by policy

...

policy check: ReadStreamRequest

...

stdout io stream copy error happens: error = %wrpc error: code =
PermissionDenied desc = \"ReadStreamRequest is blocked by policy

...

policy check: ReadStreamRequest

...

Fixes: #9380

Signed-off-by: Dan Mihai <dmihai@microsoft.com>
2024-03-28 22:43:28 +00:00
James O. D. Hunt
13966f4d1d docs: kata-manager: Add help for permissions issue
The 3.3.0 release installs the `kata-manager` script with overly restrictive
permissions (see #9373), so add details to help users handle the situation.

Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
2024-03-28 16:22:10 +00:00
James O. D. Hunt
5589e4e291 docs: kata-manager: Update with latest details
Now that v3.3.0 has been released, simplify
the `kata-manager` documentation.

Fixes: #9227.

Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
2024-03-28 16:22:10 +00:00
James O. D. Hunt
52fe60c94b docs: kata-manager: Fix heading levels
Add an extra heading indent so that there is only a single
top-level heading.

Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
2024-03-28 16:21:31 +00:00
Dan Mihai
ebb26edf42
Merge pull request #9347 from microsoft/danmihai1/reduce-exec-test-policy-prints
genpolicy: reduce policy debug prints
2024-03-27 15:12:10 -07:00
Gabriela Cervantes
a32418bf32 versions: Remove runc version information
This PR removes the runc version information as this is not longer being used
in the kata containers scripts.

Fixes #9364

Signed-off-by: Gabriela Cervantes <gabriela.cervantes.tellez@intel.com>
2024-03-27 20:32:38 +00:00
Steve Horsman
b3acbe0b7f
Merge pull request #8046 from fitzthum/clean-config
runtime: remove unimplemented CoCo configurations
2024-03-27 19:39:48 +00:00
Tobin Feldman-Fitzthum
04d021bd12 packaging: remove SERVICEOFFLOAD option
Since we're removing the unused service_offload parameter,
don't set it in any of the packaging scripts.

Signed-off-by: Tobin Feldman-Fitzthum <tobin@ibm.com>
2024-03-27 12:21:13 -05:00
Tobin Feldman-Fitzthum
9856fe5bea runtime: remove ServiceOffload parameter
Since we no longer use the service_offload configuration,
remove the ServiceOffload field from the image struct.

Signed-off-by: Tobin Feldman-Fitzthum <tobin@ibm.com>
2024-03-27 12:21:13 -05:00
Tobin Feldman-Fitzthum
a18c7ca307 runtime: remove unimplemented CoCo configurations
These experimental options were added 2 years ago
in anticipation of features that would be added
in CoCo. These do not match the features that were
eventually added and will soon be ported to main.

Fixes: #8047

Signed-off-by: Tobin Feldman-Fitzthum <tobin@ibm.com>
2024-03-27 12:21:06 -05:00
Steve Horsman
53fa1fd82d
Merge pull request #9349 from fidencio/topic/ci-k8s-update-cpuid
k8s: confidential: Update cpuid to its latest release
2024-03-27 16:57:36 +00:00
Chengyu Zhu
e66a5cb54d
Merge pull request #9332 from ChengyuZhu6/guest-pull-timeout
Support to set timeout to pull large image in guest
2024-03-28 00:34:08 +08:00
Christophe de Dinechin
82c4079fd0 agent: Remove useless loop
This is the report from `make check`:

```
error: this loop never actually loops
   --> src/signal.rs:147:9
    |
147 | /         loop {
148 | |             select! {
149 | |                 _ = handle => {
150 | |                     println!("INFO: task completed");
...   |
156 | |             }
157 | |         }
    | |_________^
    |
    = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#never_loop
    = note: `#[deny(clippy::never_loop)]` on by default
```

There is only one option: you get something or a timeout. You never retry, so
the report is correct.

Fixes: #9342

Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>
2024-03-27 17:03:44 +01:00
Christophe de Dinechin
df5c88cdf0 agent: Remove lint error about .flatten running forever
The lint report is the following:

```
error: `flatten()` will run forever if the iterator repeatedly produces an `Err`
    --> src/rpc.rs:1754:10
     |
1754 |         .flatten()
     |          ^^^^^^^^^ help: replace with: `map_while(Result::ok)`
     |
note: this expression returning a `std::io::Lines` may produce an infinite number of `Err` in case of a read error
    --> src/rpc.rs:1752:5
     |
1752 | /     reader
1753 | |         .lines()
     | |________________^
     = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#lines_filter_map_ok
     = note: `-D clippy::lines-filter-map-ok` implied by `-D warnings`
     = help: to override `-D warnings` add `#[allow(clippy::lines_filter_map_ok)]`
```

This commit simply applies the suggestion.

Fixes: #9342

Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>
2024-03-27 17:03:44 +01:00
Christophe de Dinechin
bfb55312be agent: Fix .enumerate errors during make check
Running `make check` in the `src/agent` directory gives:

```
error: you seem to use `.enumerate()` and immediately discard the index
   --> rustjail/src/mount.rs:572:27
    |
572 |     for (_index, line) in reader.lines().enumerate() {
    |                           ^^^^^^^^^^^^^^^^^^^^^^^^^^
    |
    = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#unused_enumerate_index
    = note: `-D clippy::unused-enumerate-index` implied by `-D warnings`
    = help: to override `-D warnings` add `#[allow(clippy::unused_enumerate_index)]`
help: remove the `.enumerate()` call
    |
572 |     for line in reader.lines() {
    |         ~~~~    ~~~~~~~~~~~~~~

    Checking tokio-native-tls v0.3.1
    Checking hyper-tls v0.5.0
    Checking reqwest v0.11.18
error: could not compile `rustjail` (lib) due to 1 previous error
warning: build failed, waiting for other jobs to finish...
make: *** [../../utils.mk:177: standard_rust_check] Error 101
```

Fixes: #9342

Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>
2024-03-27 17:03:44 +01:00
Greg Kurz
e1068da1a0
Merge pull request #9326 from gkurz/draft-release
Only tag and publish the release when it is fully ready
2024-03-27 15:59:59 +01:00
ChengyuZhu6
c50d3ebacc tests:k8s: Add a test to pull large images in the guest
Add a test to pull large images in the guest.

Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
2024-03-27 21:58:44 +08:00
ChengyuZhu6
8551ee9533 how-to: add createcontainer timeout to sandbox config documentation
add createcontainer timeout annotation to sandbox config documentation.

Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
2024-03-27 21:58:44 +08:00
ChengyuZhu6
c2dc13ebaa runtime: support to configure CreateContainer Timeout in configurations
support to configure CreateContainerRequestTimeout in the
configurations.

e.g.:
[runtime]
...
create_container_timeout = 300

Note: The effective timeout is determined by the lesser of two values: runtime-request-timeout from kubelet config
(https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/#:~:text=runtime%2Drequest%2Dtimeout) and create_container_timeout.
In essence, the timeout used for guest pull=runtime-request-timeout<create_container_timeout?runtime-request-timeout:create_container_timeout.

Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
2024-03-27 21:58:41 +08:00
Chengyu Zhu
87fc17d4d2
Merge pull request #9341 from ChengyuZhu6/guest-pull-doc
docs: Add documents for kata guest image management
2024-03-27 21:20:22 +08:00
ChengyuZhu6
95b2f7f129 how-to: Add a document for kata guest image management usage
Add a document for kata guest image management usage.

Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
2024-03-27 20:09:37 +08:00
Greg Kurz
693c9487d4 docs: Adjust release documentation
Most of the content of `docs/Stable-Branch-Strategy.md` got de-facto
deprecated by the re-design of the release process described in #9064.
Remove this file and all its references in the repo.

The `## Versioning` section has some useful information though. It is
moved to `docs/Release-Process.md`. The documentation of the `PATCH`
field is adapted according to new workflow.

Fixes #9064 - part VI

Signed-off-by: Greg Kurz <groug@kaod.org>
2024-03-27 12:41:48 +01:00
Steve Horsman
45aba769c0
Merge pull request #9346 from cmaf/ci-remove-repo-docs
Remove additional links to tests directory
2024-03-27 11:13:32 +00:00
Steve Horsman
a1a615a7c8
Merge pull request #9356 from stevenhorsman/agent-opa-ppc64le-s390x
workflows: Build agent-opa for more archs
2024-03-27 08:53:28 +00:00
ChengyuZhu6
2224f6d63f runtime: support to configure CreateContainer timeout in annotation
Support to configure CreateContainerRequestTimeout in the annotations.

e.g.:
annotations:
      "io.katacontainers.config.runtime.create_container_timeout": "300"

Note: The effective timeout is determined by the lesser of two values: runtime-request-timeout from kubelet config
(https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/#:~:text=runtime%2Drequest%2Dtimeout) and create_container_timeout.
In essence, the timeout used for guest pull=runtime-request-timeout<create_container_timeout?runtime-request-timeout:create_container_timeout.

Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
2024-03-27 15:44:29 +08:00
ChengyuZhu6
39bd462431 runtime: support to set timeout for CreateContainerRequest
In the situation to pull images in the guest #8484, it’s important to account for pulling large images.
Presently, the image pull process in the guest hinges on `CreateContainerRequest`, which defaults to a 60-second timeout.
However, this duration may prove insufficient for pulling larger images, such as those containing AI models.
Consequently, we must devise a method to extend the timeout period for large image pull.

Fixes: #8141

Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
2024-03-27 15:44:29 +08:00
Gabriela Cervantes
a997e282be gha: Update journal log names for nerdctl artifacts
This PR updates the journal log name for nerdctl artifacts to make
sure that we have different names in case we add a parallel GHA job.

Fixes #9357

Signed-off-by: Gabriela Cervantes <gabriela.cervantes.tellez@intel.com>
2024-03-26 20:03:54 +00:00
GabyCT
c163d9f114
Merge pull request #9329 from GabyCT/topic/seun
scripts: Fix unbound variables in k8s setup script
2024-03-26 11:19:33 -06:00
stevenhorsman
9aa675abb9 workflows: Build agent-opa for more archs
Since https://github.com/kata-containers/kata-containers/pull/7769, we support
building the OPA binary into the ppc64le and s390x arch versions of the rootfs,
so build the policy enabled agent to match for those architectures too.

Fixes: #9355
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2024-03-26 17:02:14 +00:00
Lukáš Doktor
a671b3fc6e
tests: Use full svc address to check kbs service
the service might not listen on the default port, use the full service
address to ensure we are talking to the right resource.

Signed-off-by: Lukáš Doktor <ldoktor@redhat.com>
2024-03-26 16:59:02 +01:00
Lukáš Doktor
6b0eaca4d4
tests: Add support for nodeport ingress for the kbs setup
this can be used on kcli or other systems where cluster nodes are
accessible from all places where the tests are running.

Fixes: #9272

Signed-off-by: Lukáš Doktor <ldoktor@redhat.com>
2024-03-26 16:59:00 +01:00
Greg Kurz
5009fabde4 release: Keep it draft until all artifacts have been published
The automated release workflow starts with the creation of the release in
GitHub. This is followed by the build and upload of the various artifacts,
which can be very long (like hours). During this period, the release appears
to be fully available in https://github.com/kata-containers/kata-containers/
even though it lacks all the artifacts. This might be confusing for users
or automation consuming the release.

Create the release as draft and clear the draft flag when all jobs are
done. This ensure that the release will only be tagged and made public
when it is fully usable.

If some job fails because of network timeout or any other transient
error, the correct action is to restart the failed jobs until they
eventually all succeed. This is by far the quicker path to complete
the release process.

If the workflow is *canceled* for some reason, the draft release is left
behind. A new run of the workflow will create a brand new draft release
with the same name (not an issue with GitHub). The draft release from
the previous run should be manually deleted. This step won't be automated
as it looks safer to leave the decision to a human.

[1] https://github.com/kata-containers/kata-containers/releases

Fixes #9064 - part VI

Signed-off-by: Greg Kurz <groug@kaod.org>
2024-03-26 14:48:05 +01:00
Pavel Mores
4c72b02e53 runtime-rs: remove the now-unused code of NetDevice
The remaining code in network.rs was mostly moved to utils.rs which seems
better home for these utility functions anyway (and a closely related
function open_named_tuntap() has already lived there).

ToString implementation for Address was removed after some consideration.
Address should probably ideally implement Display (as per RFC 565) which
would also supply a ToString implementation, however it implements Debug
instead, probably to enable automatic implementation of Debug for anything
that Address is a member of, if for no other reason.  Rather than having
two identical functions this commit simply switches to using the Debug
implementation for printing Address on qemu command line.

Fixes #9352

Signed-off-by: Pavel Mores <pmores@redhat.com>
2024-03-26 12:52:40 +01:00
Pavel Mores
c94e55d45a runtime-rs: make QemuCmdLine own vsock file descriptor
Make file descriptors to be passed to qemu owned by QemuCmdLine.  See
commit 52958f17cd for more explanation.

Signed-off-by: Pavel Mores <pmores@redhat.com>
2024-03-26 12:50:41 +01:00
Pavel Mores
0cf0e923fc runtime-rs: refactor QemuCmdLine::add_network_device() signature
add_network_device() doesn't need to be passed NetworkInfo since it
already has access to the full HypervisorConfig.

Also, one of the goals of QemuCmdLine interface's design is to avoid
coupling between QemuCmdLine and the hypervisor crate's device module,
if at all possible.  That's why add_network_device() shouldn't take
device module's NetworkConfig but just parts that are useful in
add_network_device()'s implementation.

Signed-off-by: Pavel Mores <pmores@redhat.com>
2024-03-26 12:50:41 +01:00
Pavel Mores
a4f033f864 runtime-rs: add should_disable_modern() utility function
is_running_in_vm() is enough to figure out whether to disable_modern but
it's clumsy and verbose to use.  should_disable_modern() streamlines the
usage by encapsulating the verbosity.

Signed-off-by: Pavel Mores <pmores@redhat.com>
2024-03-26 12:50:41 +01:00
Pavel Mores
12e40ede97 runtime-rs: reimplement add_network_device() using Netdev & DeviceVirtioNet
This commit replaces the existing NetDevice-based implementation with one
using Netdev and DeviceVirtioNet.

Signed-off-by: Pavel Mores <pmores@redhat.com>
2024-03-26 12:50:41 +01:00
Pavel Mores
0a57e2bb32 runtime-rs: refactor NetDevice in qemu driver
In keeping with architecture of QemuCmdLine implementation we split the
functionality into two objects: Netdev to represent and generate the
-netdev part and DeviceVirtioNet for the -device virtio-net-<transport>
part.

This change is a pure refactor, existing functionality does not change.
However, we do remove some stub generalizations and govmm-isms, notably:
- we remove the NetDev enum since the only network interface types that
  kata seems to use with qemu are tuntap and macvtap, both of which are
  implemented by the same -netdev tap
- enum DeviceDriver is also left out since it doesn't seem reasonable to
  try to represent VFIO NICs (which are completely different from
  virtio-net ones) with the same struct as virtio-net
- we also remove VirtioTransport because there's no use for it so far, but
  with the expectation that it will be added soon.

We also make struct Netdev the owner of any vhost-net and queue file
descriptors so that their lifetime is tied ultimately to the lifetime of
QemuCmdLine automatically, instead of returning the fds to the caller and
forcing it to achieve the equivalent functionality but manually.

Signed-off-by: Pavel Mores <pmores@redhat.com>
2024-03-26 12:50:41 +01:00
Pavel Mores
7f23734172 runtime-rs: reduce generate_netdev_fds() dependencies
generate_netdev_fds() takes NetworkConfig from which it however only needs
a host-side network device name.  This commit makes it take the device name
directly, making the function useful to callers who don't have the whole
NetworkConfig but do have the requisite device name.

Signed-off-by: Pavel Mores <pmores@redhat.com>
2024-03-26 12:50:41 +01:00
Pavel Mores
d4ac45d840 runtime-rs: refactor clear_fd_flags()
The idea of this function is to make sure O_CLOEXEC is not set on file
descriptors that should be inherited by a child (=hypervisor) process.
The approach so far is however rather heavy-handed - clearing *all* flags
is unjustifiably aggresive for a low-level function with no knowledge of
context whatsoever.

This commit refactors the function so that it only does what's expected
and renames it accordingly.  It also clarifies some of its call sites.

Signed-off-by: Pavel Mores <pmores@redhat.com>
2024-03-26 12:50:14 +01:00
Fabiano Fidêncio
cfe75f9422
k8s: confidential: Update cpuid to its latest release
Since v2.2.6 it can detect TDX guests on Azure, so let's bump it even if
Azure peer-pods are not currently used as part of our CI.

Fixes: #9348

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2024-03-26 10:21:12 +01:00
Chengyu Zhu
d16971e37e
Merge pull request #9325 from ChengyuZhu6/image_service
agent:image: Refactor code to improve memory efficiency of image service
2024-03-26 10:38:37 +08:00
Dan Mihai
6c72c29535 genpolicy: reduce policy debug prints
Kata CI has full debug output enabled for the cbl-mariner k8s tests,
and the test AKS node is relatively slow. So debug prints from policy
are expensive during CI.

Fixes: #9296

Signed-off-by: Dan Mihai <dmihai@microsoft.com>
2024-03-26 02:21:26 +00:00
Alex Lyn
cec943fc26
Merge pull request #9244 from Apokleos/dgb-gpu
runtime-rs/dragonball: add support building kernel with upcall and GPU hotplug
2024-03-26 08:53:54 +08:00
Chelsea Mafrica
4e3deb5a3b tools: Fix path for installing yq in packaging script
The lib.sh script uses the right directory but the wrong path for the
script that installs yq; fix it.

Fixes #9165

Signed-off-by: Chelsea Mafrica <chelsea.e.mafrica@intel.com>
2024-03-25 15:09:52 -07:00
Chelsea Mafrica
cfb977625e docs: Remove links to tests repo
Remove links to tests repo and update with corresponding location in the
current repo.

Fixes #9165

Signed-off-by: Chelsea Mafrica <chelsea.e.mafrica@intel.com>
2024-03-25 15:09:52 -07:00