Enable "-sandbox on" in qemu can introduce another protect layer
on the host, to make the secure container more secure.
The default option is disable because this feature may introduce some
performance cost, even though user can enable
/proc/sys/net/core/bpf_jit_enable to reduce the impact.
Fixes: #2266
Signed-off-by: Feng Wang <feng.wang@databricks.com>
Remove space from root span name to follow camel casing of other tracing
span names in the runtime and to make parsing easier in testing.
Fixes#4483
Signed-off-by: Chelsea Mafrica <chelsea.e.mafrica@intel.com>
By comparing the content of the old url and the new url,
ensure that their content is consistent and does not contain ambiguities
Fixes: #4454
Signed-off-by: Binbin Zhang <binbin36520@gmail.com>
Let's improve the log so we make it clear that we're only *actually*
adding the net device to the Cloud Hypervisor configuration when calling
our own version of VmAddNetPut().
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
We want to have the file descriptors of the opened tuntap device to pass
them down to the VMMs, so the VMMs don't have to explicitly open a new
tuntap device themselves, as the `container_kvm_t` label does not allow
such a thing.
With this change we ensure that what's currently done when using QEMU as
the hypervisor, can be easily replicated with other VMMs, even if they
don't support multiqueue.
As a side effect of this, we need to close the received file descriptors
in the code of the VMMs which are not going to use them.
Fixes: #3533
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Adding FFI_NO_PI to the netlink flags causes no harm to the supported
and tested hypervisors as when opening the device by its name Cloud
Hypervisor[0], Firecracker[1], and QEMU[2] do set the flag already.
However, when receiving the file descriptor of an opened tutap device
Cloud Hypervisor is not able to set the flag, leaving the guest without
connectivity.
To avoid such an issue, let's simply add the FFI_NO_PI flag to the
netlink flags and ensure, from our side, that the VMMs don't have to set
it on their side when dealing with an already opened tuntap device.
Note that there's a PR opened[3] just for testing that this change
doesn't cause any breakage.
[0]: e52175c2ab/net_util/src/tap.rs (L129)
[1]: b6d6f71213/src/devices/src/virtio/net/tap.rs (L126)
[2]: 3757b0d08b/net/tap-linux.c (L54)
[3]: https://github.com/kata-containers/kata-containers/pull/4292
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
This is basically a no-op right now, as:
* netPair.TapInterface.VMFds is nil
* the tap name is still passed to Cloud Hypervisor, which is the Cloud
Hypervisor's first choice when opening a tap device.
In the very near future we'll stop passing the tap name to Cloud
Hypervisor, and start passing the file descriptors of the opened tap
instead.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Knowing that VmAddNetPut works as expected, let's switch to manually
building the request and writing it to the appropriate socket.
By doing this it gives us more flexibility to, later on, pass the file
descriptor of the tuntap device to Cloud Hypervisor, as openAPI doesn't
support such operation (it has no notion of SCM Rights).
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Instead of creating the VM with the network device already plugged in,
let's actually add the network device *after* the VM is created, but
*before* the Vm is actually booted.
Although it looks like it doesn't make any functional difference between
what's done in the past and what this commit introduces, this will be
used to workaround a limitation on OpenAPI when it comes to passing down
the network device's file descriptor to Cloud Hypervisor, so Cloud
Hypervisor can use it instead of opening the device by its name on the
VMM side.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
VmAddNetPut is the API provided by the Cloud Hypervisor client (auto
generated) code to hotplug a new network device to the VM.
Let's expose it now as it'll be used as part this series, mostly to
guide the reviewer through the process of what we have to do, as later
on, spoiler alert, it'll end up being removed.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
So far this has been done for x86_64. Now that the support for building
and testing has been added for all arches, let's do the second part of
the switch.
We're still not done yet for powerpc, as some a virtifosd crash on the
rust version has been found by the maintainer.
Fixes: #4258, #4260
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Changed bitsize for parsing functions to 64-bit in order to avoid
parsing errors.
Fixes#4435
Signed-off-by: Alexandru Matei <alexandru.matei@uipath.com>
Add more detail to the `kata-monitor` doc to allow an admin to make a
more informed decision about where and how to run the daemon.
Fixes: #4416.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
GetOOMEvent is a blocking call that will fail if
the container exit, in this case, it's not an error or warning.
Changing the log level for logs in case of GetOOMEvent call fails
will reduce log noise in a large cluster that has pods
creating/deleting frequently.
Fixes: #4376
Signed-off-by: Bin Liu <bin@hyper.sh>
Since #902 the `io.katacontainers.config.hypervisor` pod annotations
have only been permitted if explicitly allowed in the global
configuration. The default global configuration allows no such
annotations. That's important because several of those annotations
would cause Kata to execute arbitrary binaries, and so were wildly
unsafe.
However, this is inconvenient for the
`io.katacontainers.config.hypervisor.enable_iommu` annotation
specifically, which controls whether the sandbox VM includes a vIOMMU.
A guest side vIOMMU is necessary to implement VFIO passthrough devices
with `vfio_mode = vfio`, so enabling that mode of operation currently
requires a global configuration change, and can't just be enabled
per-pod.
Unlike some of the other hypervisor annotations, the `enable_iommu`
annotation is quite safe. By default the vIOMMU is not present, so
allowing a user to override it for a pod only improves their
facilities for isolation. Even if the global default were changed to
enable the vIOMMU, that doesn't compel the guest kernel to use it, so
allowing a user to disable the vIOMMU doesn't materially affect
isolation either.
Therefore, allow the io.katacontainers.config.hypervisor.enable_iommu
annotation to work in the default configurations.
fixes#4330
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Set thestop container force flag to true so that the container state is always set to
“StateStopped” after the container wait goroutine is finished. This is necessary for
the following delete container step to succeed.
Fixes: #4359
Signed-off-by: Feng Wang <feng.wang@databricks.com>
In linux 5.14 and hopefully some backports, core scheduling allows processes to
be co scheduled within the same domain on SMT enabled systems.
Containerd impl sets the core sched domain when launching a shim. This
allows a clean way for each shim(container/pod) to be in its own domain and any
additional containers, (v2 pods) be be launched with the same domain as well as
any exec'd process added to the container.
kernel docs: https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/core-scheduling.html
For Kata specifically, we will look for SCHED_CORE environment variable
to be set to indicate we shuold create a new schedule core domain.
This is equivalent to the containerd shim's PR: e48bbe8394Fixes: #4309
Signed-off-by: Eric Ernst <eric_ernst@apple.com>
Signed-off-by: Michael Crosby <michael@thepasture.io>
While end users can connect directly to the shim, let's provide a way to
easily get/set iptables from kata-runtime itself.
Fixes: #4080
Signed-off-by: Eric Ernst <eric_ernst@apple.com>
Without this, potential errors are silently dropped. Let's ensure we
return the error code as well as potenial data from the response.
Signed-off-by: Eric Ernst <eric_ernst@apple.com>
Before, we had a mix of slash, etc. Unfortunately, when cleaning URL
paths, serve mux seems to mangle the request method, resulting in each
request being a GET (instead of PUT or POST).
Signed-off-by: Eric Ernst <eric_ernst@apple.com>
Add two endpoints: ip6tables, iptables.
Each url handler supports GET and PUT operations. PUT expects
the requests' data to be []bytes, and to contain iptable information in
format to be consumed by iptables-restore.
Signed-off-by: Eric Ernst <eric_ernst@apple.com>
Introduce get/set iptable handling. We add a sandbox API for getting and
setting the IPTables within the guest. This routes it from sandbox
interface, through kata-agent, ultimately making requests to the guest
agent.
Signed-off-by: Eric Ernst <eric_ernst@apple.com>
This release has been tracked through the v24.0 project.
virtio-iommu specification describes how a device can be attached by default
to a bypass domain. This feature is particularly helpful for booting a VM with
guest software which doesn't support virtio-iommu but still need to access
the device. Now that Cloud Hypervisor supports this feature, it can boot a VM
with Rust Hypervisor Firmware or OVMF even if the virtio-block device exposing
the disk image is placed behind a virtual IOMMU.
Multiple checks have been added to the code to prevent devices with identical
identifiers from being created, and therefore avoid unexpected behaviors at boot
or whenever a device was hot plugged into the VM.
Sparse mmap support has been added to both VFIO and vfio-user devices. This
allows the device regions that are not fully mappable to be partially mapped.
And the more a device region can be mapped into the guest address space, the
fewer VM exits will be generated when this device is accessed. This directly
impacts the performance related to this device.
A new serial_number option has been added to --platform, allowing a user to
set a specific serial number for the platform. This number is exposed to the
guest through the SMBIOS.
* Fix loading RAW firmware (#4072)
* Reject compressed QCOW images (#4055)
* Reject virtio-mem resize if device is not activated (#4003)
* Fix potential mmap leaks from VFIO/vfio-user MMIO regions (#4069)
* Fix algorithm finding HOB memory resources (#3983)
* Refactor interrupt handling (#4083)
* Load kernel asynchronously (#4022)
* Only create ACPI memory manager DSDT when resizable (#4013)
Deprecated features will be removed in a subsequent release and users should
plan to use alternatives
* The mergeable option from the virtio-pmem support has been deprecated
(#3968)
* The dax option from the virtio-fs support has been deprecated (#3889)
Fixes: #4317
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Today the shim does a translation when doing
direct-volume stats where it takes the source and
returns the mount path within the guest.
The source for a direct-assigned volume is actually
the device path on the host and not the publish
volume path.
This change will perform a lookup of the mount info
during direct-volume stats to ensure that the
device path is provided to the shim for querying
the volume stats.
Fixes: #4297
Signed-off-by: Yibo Zhuang <yibzhuang@gmail.com>
The go default http mux AFAIK doesn’t support pattern
routing so right now client is padding the url
for direct-volume stats with a subpath of the volume
path and this will always result in 404 not found returned
by the shim.
This change will update the shim to take the volume
path as a GET query parameter instead of a subpath.
If the parameter is missing or empty, then return
400 BadRequest to the client.
Fixes: #4297
Signed-off-by: Yibo Zhuang <yibzhuang@gmail.com>
The action function expects a function that returns error
but the current direct-volume stats Action returns
(string, error) which is invalid.
This change fixes the format and print out the stats from
the command instead.
Fixes: #4293
Signed-off-by: Yibo Zhuang <yibzhuang@gmail.com>
The documentation of the bufio package explicitly says
"Err returns the first non-EOF error that was encountered by the
Scanner."
When io.EOF happens, `Err()` will return `nil` and `Scan()` will return
`false`.
Fixes#4079
Signed-off-by: Rafael Fonseca <r4f4rfs@gmail.com>
This allows to get guest early boot logs which are usually
missed when virtconsole is used.
- It utilizes previous work on the govmm side:
https://github.com/kata-containers/govmm/pull/203
- unit test added
Fixes: #4237
Signed-off-by: Snir Sheriber <ssheribe@redhat.com>
As now we build and ship the rust version of virtiofsd, which is not
tied to QEMU, we need to update its default location to match with where
we're installing this binary.
Fixes: #4249
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
go-test.sh by default adds the -v option to 'go test' meaning that output
will be printed from all the passing tests as well as any failing ones.
This results in a lot of output in which it's often difficult to locate the
failing tests you're interested in.
So, remove -v from the default flags.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
One of the responsibilities of the go-test.sh script is setting up the
default flags for 'go test'. This is constructed across several different
places in the script using several unneeded intermediate variables though.
Consolidate all the flag construction into one place.
fixes#4190
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
go-test.sh changes behaviour based on both the $CI and $KATA_DEV_MODE
variables, but not in a way that makes a lot of sense.
If either one is set it uses the test_coverage path, instead of the
test_local path. That collects coverage information, as the name
suggests, but it also means it runs the tests twice as root and
non-root, which is very non-obvious.
It's not clear what use case the test_local path is for at all.
Developer local builds will typically have $KATA_DEV_MODE set and CI
builds will have $CI set. There's essentially no downside to running
coverage all the time - it has little impact on the test runtime.
In addition, if *both* $CI and $KATA_DEV_MODE are set, the script
refuses to run things as root, considering it "unsafe". While having
both set might be unwise in a general sense, there's not really any
way running sudo can be any more unsafe than it is with either one
set.
So, simplify everything by just always running the test_coverage path.
This leaves the test_local path unused, so we can remove it entirely.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
go-test.sh accepts subcommands, however invoking it in the usual way via
the Makefile doesn't use them. In fact the only remaining subcommand is
"help" and we already have another way of getting the usage information
(-h or --help). We don't need a second way, so just drop subcommand
handling.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
go-test.sh defaults to testing all the packages listed by go list, except
for a number filtered out. It turns out that none of those filters are
necessary any more:
* We've long required a Go newer than 1.9 which means the vendor filter
isn't needed
* The agent filter doesn't do anything now that we've moved to the Kata
2.x unified repo
* The tests filters don't hit anything on the list of modules in
src/runtime (which is the only user of the script)
But since we don't need to filter anything out any more, we don't even need
to iterate through a list ourselves. We can simply pass "./..." directly
to go test and it will iterate through all the sub-packages itself.
Interestingly this more than doubles the speed of "make test" for me - I
suspect because go test's internal paralellism works better over a larger
pool of tests.
This also lets us remove handling of non-existent coverage files from
test_go_package(), since with default options we will no longer test packages without tests
by default. If the user explicitly requests testing of a package with no
tests, then failing makes sense.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
The go-test.sh script has an explicit chmod command, run as root, to
set the mode of the temporary coverage files to 0644. AFAICT the
point of this is specifically the 004 bit allowing world read access,
so that we can then merge the temporary coverage file into the main
coverage file.
That's a convoluted way of doing things. Instead we can just run the tail
command which reads the temporary file as the same user that generated it.
In addition, go-test.sh became root to remove that temporary coverage
file. This is not necessary, since deleting a regular file just requires
write access to the directory, not the file itself.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
The html-coverage option to this script doesn't really alter behaviour
it just does the same thing as normal coverage, then converts the
report to HTML. That conversion is a single command, plus a chmod to
make the final output mode 0644. That overrides any umask the user
has set, which doesn't seem like a policy decision this script should
be making.
Nothing in the kata-containers or tests repository uses this, so it doesn't
really make sense to keep this logic inside this script.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
In addition to coverage.txt, the go-test.sh script creates
coverage.txt.tmp files while running. These are temporary and
certainly shouldn't be committed, so add them to the gitignore file.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
The go unit tests for the runtime are invoked by the helper script
ci/go-test.sh. Which calls the run_go_test() function in ci/lib.sh. Which
calls into .ci/go-test.sh from the tests repository.
But.. the runtime is the only user of this script, and generally stuff for
unit tests (rather than functional or integration tests) lives in the main
repository, not the tests repository.
So, just move the actual script into src/runtime. A change to remove it
from the tests repo will follow.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
We're currently hitting a race condition on the Cloud Hypervisor's
driver code when quickly removing and adding a block device.
This happens because the device removal is an asynchronous operation,
and we currently do *not* monitor events coming from Cloud Hypervisor to
know when the device was actually removed. Together with this, the
sandbox code doesn't know about that and when a new device is attached
it'll quickly assign what may be the very same ID to the new device,
leading to the Cloud Hypervisor's driver trying to hotplug a device with
the very same ID of the device that was not yet removed.
This is, in a nutshell, why the tests with Cloud Hypervisor and
devmapper have been failing every now and then.
The workaround taken to solve the issue is basically *not* passing down
the device ID to Cloud Hypervisor and simply letting Cloud Hypervisor
itself generate those, as Cloud Hypervisor does it in a manner that
avoids such conflicts. With this addition we have then to keep a map of
the device ID and the Cloud Hypervisor's generated ID, so we can
properly remove the device.
This workaround will probably stay for a while, at least till someone
has enough cycles to implement a way to watch the device removal event
and then properly act on that. Spoiler alert, this will be a complex
change that may not even be worth it considering the race can be avoided
with this commit.
Fixes: #4176
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
With everything implemented, let's now expose the disk rate limiter
configuration options in the Cloud Hypervisor configuration file.
Fixes: #4139
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
With everything implemented, let's now expose the net rate limiter
configuration options in the Cloud Hypervisor configuration file.
Fixes: #4017
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
The notion of "built-in rate limiter" was added as part of
bd8658e362, and that commit considered
that only Firecracker had a built-in rate limiter, which I think was the
case when that was introduced (mid 2020).
Nowadays, however, Cloud Hypervisor takes advantage of the very same crate
used by Firecraker to do I/O throttling.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Let's take advantage of the newly added DiskRateLimiter* options and
apply those to the network device configuration.
The logic here is identical to the one already present in the Network
part of Cloud Hypervisor's driver.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Let's add the newly added disk rate limiter configurations to the Cloud
Hypervisor's hypervisor configuration.
Right now those are not used anywhere, and there's absolutely no way the
users can set those up. That's coming later in this very same series.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
This is the disk counterpart of the what was introduced for the network
as part of the previous commits in this series.
The newly added fields are:
* DiskRateLimiterBwMaxRate, defined in bits per second, which is used to
control the network I/O bandwidth at the VM level.
* DiskRateLimiterBwOneTimeBurst, also defined in bits per second, which
is used to define an *initial* max rate, which doesn't replenish.
* DiskRateLimiterOpsMaxRate, the operations per second equivalent of the
DiskRateLimiterBwMaxRate.
* DiskRateLimiterOpsOneTimeBurst, the operations per second equivalent of
the DiskRateLimiterBwOneTimeBurst.
For now those extra fields have only been added to the hypervisor's
configuration and they'll be used in the coming patches of this very
same series.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Let's take advantage of the newly added NetRateLimiter* options and
apply those to the network device configuration.
The logic here is quite similar to the one already present in the
Firecracker's driver, with the main difference being the single Inbound
/ Outbound MaxRate and the presence of both Bandwidth and Operations
rate limiter.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Firecracker's driver doesn't expose the RefillTime option of the rate
limiter to the user. Instead, it uses a contant value of 1000
miliseconds (1 second).
As we're following Firecracker's driver implementation, let's expose
create a new constant, use it as part of the Firecracker's driver, and
later on re-use it as part of the Cloud Hypervisor's driver.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Firecracker's revertBytes function, now called "RevertBytes", can be
exposed as part of the virtcontainers' utils file, as this function will
be reused by Cloud Hypervisor, when adding the rate limiter logic there.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Let's add the newly added network rate limiter configurations to the
Cloud Hypervisor's hypervisor configuration.
Right now those are not used anywhere, and there's absolutely no way the
users can set those up. That's coming later in this very same series.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
In a similar way to what's already exposed as RxRateLimiterMaxRate and
TxRateLimiterMaxRate, let's add four new fields to the Hypervisor's
configuration.
The values added are related to bandwidth and operations rate limiters,
which have to be added so we can expose I/O throttling configurations to
users using Cloud Hypervisor as their preferred VMM.
The reason we cannot simply re-use {Rx,Tx}RateLimiterMaxRate is because
Cloud Hypervisor exposes a single MaxRate to be used for both inbound
and outbound queues.
The newly added fields are:
* NetRateLimiterBwMaxRate, defined in bits per second, which is used to
control the network I/O bandwidth at the VM level.
* NetRateLimiterBwOneTimeBurst, also defined in bits per second, which
is used to define an *initial* max rate, which doesn't replenish.
* NetRateLimiterOpsMaxRate, the operations per second equivalent of the
NetRateLimiterBwMaxRate.
* NetRateLimiterOpsOneTimeBurst, the operations per second equivalent of
the NetRateLimiterBwOneTimeBurst.
For now those extra fields have only been added to the hypervisor's
configuration and they'll be used in the coming patches of this very
same series.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Currently EnableMockTesting() takes no arguments and will always place the
mock storage in the fixed location /tmp/vc/mockfs. This means that one
test run can interfere with the next one if anything isn't cleaned up
(and there are other bugs which means that happens). If if those were
fixed this would allow developers testing on the same machine to interfere
with each other.
So, allow the mockfs to be placed at an arbitrary place given as a
parameter to EnableMockTesting(). In TestMain() we place it under our
existing temporary directory, so we don't need any additional cleanup just
for the mockfs.
fixes#4140
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Currently MockFSInit always creates the mockfs at the fixed path
/tmp/vc/mockfs. This change allows it to be initialized at any path
given as a parameter. This allows the tests in fs_test.go to be
simplified, because the by using a temporary directory from
t.TempDir(), which is automatically cleaned up, we don't need to
manually trigger initTestDir() (which is misnamed, it's actually a
cleanup function).
For now we still use the fixed path when auto-creating the mockfs in
MockAutoInit(), but we'll change that later.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
virtcontainers/persist/fs/mockfs.go defines a mock filesystem type for
testing. A global variable in virtcontainers/persist/manager.go is used to
force use of the mock fs rather than a normal one.
This patch moves the global, and the EnableMockTesting() function which
sets it into mockfs.go. This is slightly cleaner to begin with, and will
allow some further enhancements.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
storagePathSuffix defines the file path suffix - "vc" - used for
Kata's persistent storage information, as a private constant. We
duplicate this information in fc.go which also needs it.
Export it from fs.go instead, so it can be used in fc.go.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
A number of unit tests under virtcontainers/factory use
MockStorageRootPath() as a general purpose temporary directory. This
doesn't make sense: the mockfs driver isn't even in use here since we only
call EnableMockTesting for the pase virtcontainers package, not the
subpackages.
Instead use t.TempDir() which is for exactly this purpose. As a bonus it
also handles the cleanup, so we don't need MockStorageDestroy any more.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
There are several tests in mount_test.go which perform a sample bind
mount. These need a corresponding unmount to clean up afterwards or
attempting to delete the temporary files will fail due to the existing
mountpoint. Most of them had such an unmount, but
TestBindMountInvalidPgtypes was missing one.
In addition, the existing unmounts where done inconsistently - one was
simply inline (so wouldn't be executed if the test fails too early) and one
is a defer. Change them all to use the t.Cleanup mechanism.
For the dummy mountpoint files, rather than cleaning them up after the
test, the tests were removing them at the beginning of the test. That
stops the test being messed up by a previous run, but messily. Since
these are created in a private temporary directory anyway, if there's
something already there, that indicates a problem we shouldn't ignore.
In fact we don't need to explicitly remove these at all - they'll be
removed along with the rest of the private temporary directory.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
The tests in hook_test.go run a mock hook binary, which does some debug
logging to /tmp/mock_hook.log. Currently we don't clean up those logs
when the tests are done. Use a test cleanup function to do this.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
SetupOCIConfigFile creates a temporary directory with os.MkDirTemp(). This
means the callers need to register a deferred function to remove it again.
At least one of them was commented out meaning that a /temp/katatest-
directory was leftover after the unit tests ran.
Change to using t.TempDir() which as well as better matching other parts of
the tests means the testing framework will handle cleaning it up.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Several tests in kata_agent_test.go create /tmp/mountPoint as a dummy
directory to mount. This is not cleaned up after the test. Although it
is in /tmp, that's still a little messy and can be confusing to a user.
In addition, because it uses the same name every time, it allows for one
run of the test to interfere with the next.
Use the built in t.TempDir() to use an automatically named and deleted
temporary directory instead.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
This is to fix a test failure for the
kata-containers-2.0-ubuntu-20.04-s390x-main-baseline jenkins job
Fixes: #4088
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
kata-monitor allows to get data profiles from the kata shim
instances running on the same node by acting as a proxy
(e.g., http://$NODE_ADDRESS:8090/debug/pprof/?sandbox=$MYSANDBOXID).
In order to proxy the requests and the responses to the right shim,
kata-monitor requires to pass the sandbox id via a query string in the
url.
The profiling index page proxied by kata-monitor contains the link to all
the data profiles available. All the links anyway do not contain the
sandbox id included in the request: the links result then broken when
accessed through kata-monitor.
This happens because the profiling index page comes from the kata shim,
which will not include the query string provided in the http request.
Let's add on-the-fly the sandbox id in each href tag returned by the kata
shim index page before providing the proxied page.
Fixes: #4054
Signed-off-by: Francesco Giudici <fgiudici@redhat.com>
The scanner reads nothing from viriofsd stderr pipe, because param
'--syslog' rediercts stderr to syslog. So there is no need to write
scanner.Text() to kata log
Fixes: #4063
Signed-off-by: Zhuoyu Tie <tiezhuoyu@outlook.com>
The fsGroup will be specified by the fsGroup key in
the direct-assign mountinfo metadate field.
This will be set when invoking the kata-runtime
binary and providing the key, value pair in the metadata
field. Similarly, the fsGroupChangePolicy will also
be provided in the mountinfo metadate field.
Adding an extra fields FsGroup and FSGroupChangePolicy
in the Mount construct for container mount which will
be populated when creating block devices by parsing
out the mountInfo.json.
And in handleDeviceBlockVolume of the kata-agent client,
it checks if the mount FSGroup is not nil, which
indicates that fsGroup change is required in the guest,
and will provide the FSGroup field in the protobuf to
pass the value to the agent.
Fixes#4018
Signed-off-by: Yibo Zhuang <yibzhuang@gmail.com>
This change adds two fields to the Storage pb
FSGroup which is a group id that the runtime
specifies to indicate to the agent to perform a
chown of the mounted volume to the specified
group id after mounting is complete in the guest.
FSGroupChangePolicy which is a policy to indicate
whether to always perform the group id ownership
change or only if the root directory group id
does not match with the desired group id.
These two fields will allow CSI plugins to indicate
to Kata that after the block device is mounted in
the guest, group id ownership change should be performed
on that volume.
Fixes#4018
Signed-off-by: Yibo Zhuang <yibzhuang@gmail.com>
Add some links to rendered webpages for better user experience,
let users can jump to pages only by clicking links in browsers.
Fixes: #4061
Signed-off-by: bin <bin@hyper.sh>
virtiofsd's debug will be enabled if hypervisor's debug has been
enabled, this will generate too many noisy logs from virtiofsd.
Unbind the relationship of log level between virtiofsd and
hypervisor, if users want to see debug log of virtiofsd,
can set it by:
virtio_fs_extra_args = ["-o", "log_level=debug"]
Fixes: #3303
Signed-off-by: bin <bin@hyper.sh>
The vhost-user-blk can be hotplugged on the PCI bridge successfully on
X86, but failed on Arm. However, hotplugging it on Root Port as a PCIe
device can work well on ARM.
Open the "pcie_root_port" in configuration.toml is needed.
Fixes: #4019
Signed-off-by: Jaylyn Ren <jaylyn.ren@arm.com>
This configuration option is valid for all the hypervisor that are going
to be used with the confidential containers effort, thus exposing the
configuration option for Cloud Hypervisor as well.
Fixes: #4022
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
The directory created by `T.TempDir` is automatically removed when the
test and all its subtests complete.
This commit also updates the unit test advice to use `T.TempDir` to
create temporary directory in tests.
Fixes: #3924
Reference: https://pkg.go.dev/testing#T.TempDir
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
(default: "/run/containerd/containerd.sock") is duplicated when
printing kata-monitor usage:
[root@kubernetes ~]# kata-monitor --help
Usage of kata-monitor:
-listen-address string
The address to listen on for HTTP requests. (default ":8090")
-log-level string
Log level of logrus(trace/debug/info/warn/error/fatal/panic). (default "info")
-runtime-endpoint string
Endpoint of CRI container runtime service. (default: "/run/containerd/containerd.sock") (default "/run/containerd/containerd.sock")
the golang flag package takes care of adding the defaults when printing
usage. Remove the explicit print of the value so that it would not be
printed on screen twice.
Fixes: #3998
Signed-off-by: Francesco Giudici <fgiudici@redhat.com>
getOOMEvents is a long-waiting call, it will retry when failed.
For cases of agent shutdown, the retry should stop.
When the agent hasn't detected agent has died, we can also check
whether the error is "ttrpc: closed".
Fixes: #3815
Signed-off-by: bin <bin@hyper.sh>
Modify the 2Mib in the comment to 4Mib.
VirtioMem is set by configuration file or annotation. And setupVirtioMem is called only when VirtioMem is true.
Fixes: #3750
Signed-off-by: yaoyinnan <yaoyinnan@foxmail.com>
Previously, it was not permitted to have neither an initrd nor an image.
However, this is the exact config to use for Secure Execution, where the
initrd is part of the image to be specified as `-kernel`. Require the
configuration of no initrd for Secure Execution.
Also
- remove redundant code for image/initrd checking -- no need to check in
`newQemuHypervisorConfig` (calling) when it is also checked in
`getInitrdAndImage` (called)
- use `QemuCCWVirtio` constant when possible
Fixes: #3922
Signed-off-by: Jakob Naucke <jakob.naucke@ibm.com>
src/runtime/virtcontainers/hook/mock contains a simple example hook in Go.
The only thing this is used for is for some tests in
src/runtime/pkg/katautils/hook_test.go. It doesn't really have anything
to do with the rest of the virtcontainers package.
So, move it next to the test code that uses it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
We've now removed the need to install the mock hook binary for unit tests.
However, it turns out that managing that was the *only* thing that the
install and uninstall targets in the virtcontainers Makefile handled.
So, remove them.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Running unit tests should generally have minimal dependencies on
things outside the build tree. It *definitely* shouldn't modify
system wide things outside the build tree. Currently the runtime
"make test" target does so, though.
Several of the tests in src/runtime/pkg/katautils/hook_test.go require a
sample hook binary. They expect this hook in
/usr/bin/virtcontainers/bin/test/hook, so the makefile, as root, installs
the test binary to that location.
Go tests automatically run within the package's directory though, so
there's no need to use a system wide path. We can use a relative path to
the binary build within the tree just as easily.
fixes#3941
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
The VC_BIN_DIR variable in the virtcontainers Makefile is almost unused.
It's used to generate TEST_BIN_DIR, and it's created in the install target.
However, we also create TEST_BIN_DIR, which is a subdirectory of VC_BIN_DIR
with mkdir -p, so it will necessarily create VC_BIN_DIR along the way.
So we can drop the unnecessary mkdir and expand the definition of
VC_BIN_DIR in the definition of TEST_BIN_DIR.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
The INSTALL_EXEC and UNINSTALL_EXEC definitions from the virtcontainers
Makefile (unlike those from the runtime Makefile in the parent directory)
are entirely unused. Remove them.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
The check-go-test target passes the path to the mock hook test binary to
go-test.sh when it invokes it. But go-test.sh just calls run_go_test from
ci/lib.sh, which invokes a script from the tests repo *without* any
parameters.
That is, this parameter is ignored anyway, so remove it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Currently kata shim v2 doesn't translate ESRCH signal, causing container
fail to stop and shim leak.
Fixes: #3874
Signed-off-by: Feng Wang <feng.wang@databricks.com>
Currently, the block driver option is specifed by hard coding, maybe it
is better to use const string variables instead of hard coded strings.
Another modification is to remove duplicate consts for virtio driver in
manager.go.
Fixes: #3321
Signed-off-by: Jason Zhang <zhanghj.lc@inspur.com>
This change introduces the `disable_guest_empty_dir` config option,
which allows the user to change whether a Kubernetes emptyDir volume is
created on the guest (the default, for performance reasons), or the host
(necessary if you want to pass data from the host to a guest via an
emptyDir).
Fixes#2053
Signed-off-by: Evan Foster <efoster@adobe.com>
Let's bring in the latest release of Containerd, 1.6.1, released on
March 2nd, 2022.
With this, we take the opportunity to remove containerd/api reference as
we shouldn't need a separate module only for the API.
Here's the list of changes needed in the code due to the bump:
* stop using `grpc.WithInsecure()` as it's been deprecated
- use `grpc.WithTransportCredentials(insecure.NewCredentials())`
instead
Fixes: #3820
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
As this is just a initial vcpu hotplug support, thread and socket has
not been supported. So, don't set socket and thread when hotadd cpu for
arm/virt.
Fixes: #3280
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
Mount the direct-assigned block device fs only once and keep a refcount
in the guest. Also use the ro flag inside the options field to determine
whether the block device and filesystem should be mounted as ro
Fixes: #3454
Signed-off-by: Feng Wang <feng.wang@databricks.com>
Translate the volume path from host-known path to guest-known path
and forward the request to kata agent.
Fixes: #3454
Signed-off-by: Feng Wang <feng.wang@databricks.com>
During the container creation, it will parse the mount info file
of the direct assigned volumes and update the in memory mount object.
Fixes: #3454
Signed-off-by: Feng Wang <feng.wang@databricks.com>
Add GetVolumeStats and ResizeVolume APIs for the runtime to query stat
and resize fs in the guest.
Fixes: #3454
Signed-off-by: Feng Wang <feng.wang@databricks.com>
To query fs stats and resize fs, the requests need to be passed to
kata agent through containerd-shim-v2. So we're adding to rest APIs
on the shim management endpoint.
Also refactor shim management client to its own go file.
Fixes: #3454
Signed-off-by: Feng Wang <feng.wang@databricks.com>
In the direct assigned volume scenario, Kata Containers persists
the information required for managing the volume inside the guest
on host filesystem.
Fixes: #3454
Signed-off-by: Feng Wang <feng.wang@databricks.com>
Add commands to add, remove, resize and get stats of a direct-assigned volume.
These commands are expected to be consumed by CSI.
Fixes: #3454
Signed-off-by: Feng Wang <feng.wang@databricks.com>
If, for some reason, we're able to launch cloud hypervisor but not able
to boot the VM up, the virtiofsd process would be left behind.
Let's ensure, via defer, that we stop virtiofsd in case of errors.
Fixes: #3819
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>