The agent registers an event fd in `memory.oom_control`. An OOM event is
forwarded to containerd when the event is emitted, regardless of the
content in that file.
I observed content indicating that events should not be forwarded, as shown
below. When `oom_kill` is set to 0, it means no OOM has occurred. Therefore,
it is important to check the content to avoid mistakenly forwarding OOM
events.
```
oom_kill_disable 0
under_oom 0
oom_kill 0
```
Fixes: #8715
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
Don't release the lock between is_allowed and set_policy calls,
because the policy might change in between these calls.
Also, move more policy code into policy.rs.
Fixes: #8734
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
vfio commits introduce quite a lot change in runtime-rs, this commit is
for all the changes related to ci, including compilation errors and so on.
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
This PR fixes k8's configmap/secrets etc update propagation when filesystem sharing is disabled.
The commit introduces below changes with some limitations:
- creates new timestamped directory in guest
- updates the '..data' symlink
- creates user visible symlinks to newly created secrets.
- Limitation: The older timestamped directory and stale user visible symlinks exist in guest
due to missing DELETE api in agent.
Fixes: #7398
Signed-off-by: Sumedh Alok Sharma <sumsharma@microsoft.com>
- Disable device cgroup restriction while pod cgroup is not available.
- Remove balcklist-related names and change whitelist-related names to
allowed_all.
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
FSManager of systemd cgroup manager is responsible for setting up cgroup
path. The container launching will be failed if the FSManager is in
read-only mode.
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
The changes include:
- Change to debug logging level for resources after processed.
- Remove a todo for pod cgroup cleanup.
- Add an anyhow context to `get_paths_and_mounts()`.
- Remove code which denys access to VMROOTFS since it won't take effect. If
blackmode is in use, the VMROOTFS will be denyed as default. Otherwise,
device cgroups won't be updated in whitelist mode.
- Add a unit test for `default_allowed_devices()`.
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
The runk is a standard OCI runtime that isnt' aware of concept of sandbox.
Therefore, the `devcg_info` argument of `LinuxContainer::new()` is
unneccessary to be provided.
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
The target is to guarantee that containers couldn't escape to access extra
devices, like vm rootfs, etc.
Assume that there is a cgroup, such as `/A/B`. The `B` is container cgroup,
and the `A` is what we called pod cgroup. No matter what permissions are
set for the container (`B`), the `A`'s permission is always `a *:* rwm`. It
leads that containers could acquire permission to access to other devices
in VM that not belongs to themselves.
In order to set devices cgroup properly, the order of setting cgroups is
that the pod cgroup comes first and the container cgroup comes after.
The `Sandbox` has a new field, `devcg_info`, to save cgroup states. To
avoid setting container cgroup too early, an initialization should be done
carefully. `inited`, one of the states, is a boolean to indicate if the pod
cgroup is initialized. If no, the pod cgroup should be created firstly, and
set default permissions. After that, the pause container cgroup is created
and inherits the permissions from the pod cgroup.
If whitelist mode which allows containers to access all devices in VM is
enabled, then device resources from OCI spec are ignored.
This feature not supports systemd cgroup and cgroup v2, since:
- Systemd cgroup implemented on Agent hasn't supported devices subsystem so
far, see: https://github.com/kata-containers/kata-containers/issues/7506.
- Cgroup v2's device controller depends on eBPF programs, which is out of
scope of cgroup.
Fixes: #7507
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
This is to skip a flaky test `create_tmpfs()` on s390x until a root cause is identified and fixed.
Fixes: #4248
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Remove the ability to block access to kata agent endpoints by using
agent-config.toml. That functionality is now implemented using the
Agent Policy feature (#7573).
The CCv0 branch relied on blocking endpoints using agent-config.toml
but will set-up an equivalent default policy file instead (#8219).
Fixes: #8228
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
This will help to build the agent binary as part of the kata-deploy
localbuild, as we need to pass the DESTDIR to where the agent will be
installed, and also whether we're building the agent with policy support
enabled or not.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Otherwise `make test` will fail with:
```
error[E0583]: file not found for module `version`
```
Fixes: #7974 -- part 0
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
- Allow `clippy::redundant-closure-call` in `from_cmdline`
which has issues with the guard function passed into
the `parse_cmdline_param` macro
Fixes: #7902
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
1. Directly support CgroupManager::freeze through systemd API.
2. Avoid always passing unit_name by storing it into DBusClient.
3. Realize CgroupManager::destroy more accurately by killing systemd unit rather than stop it.
4. Ignore no such unit error when destroying systemd unit.
5. Update zbus version and corresponding interface file.
Acknowledgement: error handling for no such systemd unit error refers to
Fixes: #7080, #7142, #7143, #7166
Signed-off-by: Yuan-Zhuo <yuanzhuo0118@outlook.com>
Signed-off-by: Yohei Ueda <yohei@jp.ibm.com>
When a storage device is used by more than one container, the second
and forth instances will cause storage device reference count leakage,
thus cause storage device leakage. The reason is:
add_storages() will increase reference count of existing storage device,
but forget to add the device to the `mount_list` array, thus leak the
reference count.
Fixes: #7820
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
Refine storage related code by:
- remove the STORAGE_HANDLER_LIST
- define type alias
- move code near to its caller
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
Introduce StorageDevice and StorageHandlerManager, which will be used
to refine storage device management for kata-agent.
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
Simplify the way to manage storage objects, and introduce
StorageStateCommon structures for coming extensions.
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
When the FileMode field for the device is unset (0), use a default value instead
to allow the use of the device from the container.
This behaviour is seen from cri-o typically.
Note: this is what runc is doing, which is why regular containers don't have an
issue. This change makes sure kata behaves the same as runc.
Fixes: #7717
Signed-off-by: Julien Ropé <jrope@redhat.com>
When building with AGENT_POLICY=yes and AGENT_INIT=yes:
1. Include OPA and the Policy settings in rootfs.
2. Start OPA from the kata agent.
Before these changes, building with both AGENT_POLICY=yes and
AGENT_INIT=yes was unsupported.
Starting OPA from systemd (when AGENT_INIT=no) was already supported.
Fixes: #7615
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
We extend the `Result` and `Option` types with associated types that
allows converting a `Result<T, E>` and `Option<T>` into
`ttrpc::Result<T>`.
This allows the elimination of many `match` statements in favor of
calling the map function plus the `?` operator. This transformation
simplifies the code.
Fixes: #7624
Signed-off-by: Wedson Almeida Filho <walmeida@microsoft.com>
Fixes: #7573
To enable this feature, build your rootfs using AGENT_POLICY=yes. The
default is AGENT_POLICY=no.
Building rootfs using AGENT_POLICY=yes has the following effects:
1. The kata-opa service gets included in the Guest image.
2. The agent gets built using AGENT_POLICY=yes.
After this patch, the shim calls SetPolicy if and only if a Policy
annotation is attached to the sandbox/pod. When creating a sandbox/pod
that doesn't have an attached Policy annotation:
1. If the agent was built using AGENT_POLICY=yes, the new sandbox uses
the default agent settings, that might include a default Policy too.
2. If the agent was built using AGENT_POLICY=no, the new sandbox is
executed the same way as before this patch.
Any SetPolicy calls from the shim to the agent fail if the agent was
built using AGENT_POLICY=no.
If the agent was built using AGENT_POLICY=yes:
1. The agent reads the contents of a default policy file during sandbox
start-up.
2. The agent then connects to the OPA service on localhost and sends
the default policy to OPA.
3. If the shim calls SetPolicy:
a. The agent checks if SetPolicy is allowed by the current
policy (the current policy is typically the default policy
mentioned above).
b. If SetPolicy is allowed, the agent deletes the current policy
from OPA and replaces it with the new policy it received from
the shim.
A typical new policy from the shim doesn't allow any future SetPolicy
calls.
4. For every agent rpc API call, the agent asks OPA if that call
should be allowed. OPA allows or not a call based on the current
policy, the name of the agent API, and the API call's inputs. The
agent rejects any calls that are rejected by OPA.
When building using AGENT_POLICY_DEBUG=yes, additional Policy logging
gets enabled in the agent. In particular, information about the inputs
for agent rpc API calls is logged in /tmp/policy.txt, on the Guest VM.
These inputs can be useful for investigating API calls that might have
been rejected by the Policy. Examples:
1. Load a failing policy file test1.rego on a different machine:
opa run --server --addr 127.0.0.1:8181 test1.rego
2. Collect the API inputs from Guest's /tmp/policy.txt and test on the
machine where the failing policy has been loaded:
curl -X POST http://localhost:8181/v1/data/agent_policy/CreateContainerRequest \
--data-binary @test1-inputs.json
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
This patch allows copying of directories and symlinks when
static file copying is used between host and guest. This change is
necessary to support recursive file copying between shim and agent.
Signed-off-by: Yohei Ueda <yohei@jp.ibm.com>
(cherry picked from commit de232b8030)