This changed valid() in hypervisor to check the case where both
initrd and image path are set; in this case it returns an error.
Fixes#1868
Signed-off-by: Wainer dos Santos Moschetta <wainersm@redhat.com>
Fix `-l <log-level>` for the trace forwarder which didn't work
previously as it lacked the magic Cargo configuration.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Hybrid VSOCK requires `root` privileges to access the sandbox-specific
host-side AF_UNIX socket created by the hypervisor (CLH or FC). However,
once the socket has been bound, privileges can be dropped, allowing the
forwarder to run as user `nobody`.
Fixes: #2905.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
The oci_to_grpc function just handles part of oci fields,
and others are not copied from oci spec to grpc spec,
such as process.env, process.capabilities, mounts and so on.
Try to implement more handlings to convert thoses fields.
Fixes#2686
Signed-off-by: Lei Li <cdlleili@cn.ibm.com>
Rather than generating a potentially misleading error message if the
socket bind fails, perform an explicit check for `root` for Hybrid
VSOCK.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Updated the trace forwarder README to ensure the real socket path is
created, not the template socket path returned by `kata-runtime env`.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
- Install OpenSSL for key generation in kernel build
- Do not install libpmem
- Do not exclude `*/share/*/*.img` files in QEMU tarball since among
them are boot loader files critical for IPLing.
Fixes: #2895
Signed-off-by: Jakob Naucke <jakob.naucke@ibm.com>
Show available guest protections in the
`kata-runtime env` output. Also bump the formatVersion.
Fixes: #1982
Signed-off-by: Yujia Qiao <rapiz3142@gmail.com>
Add functions to return guestProtection as a string slice, which
can be then used in `kata-runtime env` output.
Signed-off-by: Yujia Qiao <rapiz3142@gmail.com>
uevents with action=remove was ignored causing the agent to reuse stale
data in the device map. This patch adds handling of such uevents.
Fixes#2405
Signed-off-by: Haitao Li <lihaitao@gmail.com>
On a conventional (e.g. runc) container, passing in a VFIO group device,
/dev/vfio/NN, will result in the same VFIO group device being available
within the container.
With Kata, however, the VFIO device will be bound to the guest kernel's
driver (if it has one), possibly appearing as some other device (or a
network interface) within the guest.
This add a new `vfio_mode` option to alter this. If set to "vfio" it will
instruct the agent to remap VFIO devices to the VFIO driver within the
guest as well, meaning they will appear as VFIO devices within the
container.
Unlike a runc container, the VFIO devices will have different names to the
host, since the names correspond to the IOMMU groups of the guest and those
can't be remapped with namespaces.
For now we keep 'guest-kernel' as the value in the default configuration
files, to maintain current Kata behaviour. In future we should change this
to 'vfio' as the default. That will make Kata's default behaviour more
closely resemble OCI specified behaviour.
fixes#693
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Currently constrainGRPCSpec always removes VFIO devices from the OCI
container spec which will be used for the inner container. For
upcoming support for VFIO devices in DPDK usecases we'll need to not
do that.
As a preliminary to that, add an extra parameter to the function to
control whether or not it will remove the VFIO devices from the spec.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
"constraint" is a noun, "constrain" is the associated verb, which makes
more sense in this context.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
In order to support DPDK workloads, we need to change the way VFIO devices
will be handled in Kata containers. However, the current method, although
it is not remotely OCI compliant has real uses. Therefore, introduce a new
runtime configuration field "vfio_mode" to control how VFIO devices will be
presented to the container.
We also add a new sandbox annotation -
io.katacontainers.config.runtime.vfio_mode - to override this on a
per-sandbox basis.
For now, the only allowed value is "guest-kernel" which refers to the
current behaviour where VFIO devices added to the container will be bound
to whatever driver in the VM kernel claims them.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Add and adjust the vfio devices in the inner container spec so that
rustjail will create device nodes for them.
In order to do that, we also need to make sure the VFIO device node is
ready within the guest VM first. That may take (slightly) longer than
just the underlying PCI device(s) being ready, because vfio-pci needs
to initialize. So, add a helper function that will wait for a
specific VFIO device node to be ready, using the existing uevent
listening mechanism. It also returns the device node name for the
device (though in practice it will always /dev/vfio/NN where NN is the
group number).
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Many device nodes go directly under /dev, however some are conventionally
placed in subdirectories under /dev. For example /dev/vfio/vfio or
/dev/pts/ptmx.
Currently, attempting to pass such a device into a Kata container will fail
because mknod() will get an ENOENT because the parent directory is missing
(or an equivalent error for bind_dev()).
Correct that by making subdirectories as necessary in create_devices().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
For each user supplied device, create_devices() checks that the given path
actually is in /dev, by checking that its path starts with /dev and does
not contain "..".
However, this has subtle errors because it's interpreting the path as a raw
string without considering separators. It will accept the path /devfoo
which it should not, while it will not accept the valid (though weird)
paths /dev/... and /dev/a..b.
Correct this by using std::path::Path methods designed for the purpose.
Having done this, it's trivial to also generate the relative path that
mknod_dev() or bind_dev() will need, so do that at the same time.
We also move this logic into a helper function so that we can add some unit
tests for it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Both these functions take the absolute path from LinuxDevice and drop the
leading '/' to make a relative path. They do that with a simple
&dev.path[1..]. That can be technically incorrect in some edge cases such
as a path with redundant /s like "//dev//sda".
To handle cases like that, have the explicit relative path passed into
these functions. For now we calculate it in the same buggy way, but we'll
fix that shortly.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
create_devices() within the rustjail module is responsible for creating
device nodes within the (inner) containers. Errors that occur here will
be propagated up, but are likely to be low level failures of mknod() - e.g.
ENOENT or EACCESS - which won't be very useful without context when
reported all the way up to the runtime without the context of what we were
trying to do.
Add some anyhow context information giving the details of the device we
were trying to create when it failed.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Currently, update_spec_device() assumes that the proper device path in the
(inner) container is the same as the device path specified in the outer OCI
spec on the host.
Usually that's correct. However for VFIO group devices we actually need
the container to see the VM's device path, since it's normal to correlate
that with IOMMU group information from sysfs which will be different in the
guest and which we can't namespace away.
So, add an extra "final_path" parameter to update_spec_device() to allow
callers to chose the device path that should be used for the inner
container. All current callers pass the same thing as container_path, but
that will change in future.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
update_spec_device_list() is used to update the container configuration to
change device major/minor numbers configured by the Kata client based on
host details to values suitable for the sandbox VM, which may differ. It
takes a 'device' object, but the only things it actually uses from there
are container_path and vm_path.
Refactor this as update_spec_device(), taking the host and guest paths to
the device as explicit parameters. This makes the function more
self-contained and will enable some future extensions.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Each VFIO device passed into the guest could represent a whole IOMMU group
of devices on the host. Since these devices aren't DMA isolated from each
other, they must appear as the same IOMMU group in the guest as well.
The VMM should enforce that for us, but double check it, since things can't
work otherwise. This also means we determine the guest IOMMU group for the
VFIO device, which we'll be needing later.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
For upcoming VFIO extensions we'll need to work with the IOMMU groups of
VFIO devices. This helps us towards that by adding pci_iommu_group() to
retrieve the IOMMU group (if any) of a given PCI device.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
VFIO devices can be added to a Kata container and they will be passed
through to the sandbox guest. However, inside the guest those devices
will bind to a native guest driver, so they will no longer appear as VFIO
devices within the guest. This behaviour differs from runc or other
conventional container runtimes.
This code allows the agent to match the behaviour of other runtimes,
if instructed to by kata-runtime. VFIO devices it's informed about
with the "vfio" type instead of the existing "vfio-gk" type will be
rebound to the vfio-pci driver within the guest.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
For better VFIO support, we're going to need to take control of which guest
driver controls specific guest devices. To assist with that, add the
pci_driver_override() function to force a specific guest device to be
bound to a specific guest driver.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Use `"c".to_string` in the device type of `dev/full`
in order to consistent with the coding style of other devices
Fixes: #2890
Signed-off-by: Manabu Sugimoto <Manabu.Sugimoto@sony.com>
create_tmpfs won't pass as the race condition in watcher umount. quote
James's words here:
1. Rust runs all tests in parallel.
2. Mounts are a process-wide, not a per-thread resource.
The only test that calls watcher.mount() is create_tmpfs().
However, other tests create BindWatcher objects.
3. BindWatcher's drop() implementation calls self.cleanup(),
which calls unmount for the mountpoint create_tmpfs() asserts.
4. The other tests are calling unmount whenever a BindWatcher goes
out of scope.
To avoid that issue, let the tests using BindWatcher in watcher and
sandbox.rs run sequentially.
Fixes: #2809
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
DefaultMaxVCPUs may be larger than the defaultMaxQemuVCPUs that should
be checked and avoided.
Fixes: #2809
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
The physical current vcpu number should not be used directly as the
largest vcpu number is limited to defaultMaxQemuVCPUs.
Here, a new helper is introduced in pkg/katautils/config.go to get
current vcpu number.
Fixes: #2809
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
The current kernel version parse lib can't process suffix '+', as the
modified kernel version will add '+' as suffix, thus panic will occur.
For example, if the current kernel version is "5.14.0-rc4+", test
TestHostNetworkingRequested will panic:
--- FAIL: TestHostNetworkingRequested (0.00s)
panic: &{DistroName:ubuntu DistroVersion:18.04
KernelVersion:5.11.0-rc3+ Issue: Passed:[] Failed:[] Debug:true
ActualEUID:0}: failed to check test constraints: error: Build meta data
is empty
Here, remove the suffix '+' in kernel version fix helper.
Fixes: #2809
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
Export the top level hypervisor type
s/hypervisor/Hypervisor
Fixes: #2880
Signed-off-by: Manohar Castelino <mcastelino@apple.com>
Signed-off-by: Eric Ernst <eric_ernst@apple.com>
Last of a series of commits to export the top level
hypervisor generic methods.
s/createSandbox/CreateVM
Fixes#2880
Signed-off-by: Manohar Castelino <mcastelino@apple.com>
Signed-off-by: Eric Ernst <eric_ernst@apple.com>