Sometimes OVFM provides incorrect values to the kernel
we override it by telling the kernel to handle the PCI space setup
like allocating the proper window sizes and assigning the proper busses
to each device.
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Stop computing and setting mem-reserve and pref64-reserve on PCIe root
ports and switch ports. Remove getBARsMaxAddressableMemory() which
scanned host GPU BARs to pre-calculate these values.
The previous approach only considered GPU devices (IsGPU(), class
0x0302) when scanning for BAR sizes, so devices like NVSwitches (class
0x0680) with their 32MB non-prefetchable BAR0 were not accounted for
and received the 4MB default. Additionally, GetTotalAddressableMemory()
classifies BARs by 32/64-bit address width rather than by the
prefetchable flag that QEMU's mem-reserve vs pref64-reserve maps to.
Modern QEMU introspects VFIO device BARs when they are attached to
root ports and sizes the MMIO windows accordingly. Modern OVMF
(edk2-stable202502+) automatically calculates the 64-bit PCI MMIO
aperture based on the BARs of actually present devices during PCI
enumeration. Omitting the reserve parameters lets QEMU and OVMF
handle MMIO window sizing correctly for all device types including
GPUs, NVSwitches, and NICs without requiring host-side BAR scanning.
This also removes the nvpci dependency from qemu_arch_base.go.
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Modern OVMF (edk2-stable202502 and later) automatically sizes the
64-bit PCI MMIO aperture based on the BARs of actually attached
devices during PCI enumeration. The opt/ovmf/X-PciMmio64Mb fw_cfg
hint is no longer needed to ensure large-BAR devices like NVIDIA
GPUs receive adequate MMIO space.
The previous approach was fragile: the runtime scanned host PCI
devices to estimate the required aperture size, but only considered
GPU devices (class 0x0302), missing NVSwitches and other devices
with large BARs. Removing this code avoids confusion about MMIO
sizing responsibility.
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
On containerd v3 config, disable_snapshot_annotations must be set under the
images plugin, not the runtime plugin.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Different kubernetes distributions, such as k0s, use a different kubelet
root dir location instead of the default /var/lib/kubelet, so ConfigMap
and Secret volume propagation were failing.
This adds a kubelet_root_dir config option that the go runtime uses when
matching volume paths and kata-deploy now sets it automatically for k0s
via a drop-in file.
runtime-rs does not need this option: it identifies ConfigMap/Secret,
projected, and downward-api volumes by volume-type path segment
(kubernetes.io~configmap, etc.), not by kubelet root prefix.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The host path of bundles is not portable and could be literally anything
depending on containerd configuration, so we can't rely on a specific
prefix when deriving the bundle-id. Instead, we derive the bundle-id
from the target root path in the guest.
NOTE: fixes https://github.com/kata-containers/kata-containers/issues/10065
Signed-off-by: Markus Rudy <mr@edgeless.systems>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
This commit adds logic to properly handle memory hotplug
for QemuCCWVirtio in the ExecMemdevAdd() path.
The new logic is triggered only when virtio-mem is enabled.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
ResizeMemory() already contains the virtio-mem resize logic.
However, hotplugAddMemory(), which is invoked via a different
path, lacked this handling and always fell back to the pc-dimm
path, even when virtio-mem was configured.
This commit adds virtio-mem resize handling to hotplugAddMemory().
It also adds corresponding unit tests.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
When image.reference or kubectlImage.reference already contains a digest
(e.g. quay.io/...@sha256:...), use the reference as-is instead of
appending :tag. This avoids invalid image strings like 'image@sha256🔤'
when tag is empty and allows users to pin by digest.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The actionlint gh extension is outdated and the wrapping seems
unnecessary when there is a github action that seems to be maintained,
so let's update to use that
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
cargo machete has identified the follow crates as unused:
- containerd-shim-protos
- safe-path
- strum
- ttrpc
strum is neded (and maybe isn't picked up due to it being
used by macros?), so add it to the ignore list and remove
the rest
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
`cargo machete` has identified `openssl` and `serde-transcode`
as being un-used. openssl is required, so add it to the ignore
list and just remove serde-transcode
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
`log` and `rustjail` are flagged by cargo machete as unused,
so lets remove them to reduce the footprint of crates in this tool
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
- Remove unused crates to reduce our size and the work needed
to do updates
- Also update package.metadata.cargo-machete with some crates
that are incorrectly coming up as unused
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
cargo machete can't understand `host-device = ["dep:vfio-bindings"`,
so tell it to ignore `vfio-bindings` and not suggest it's unused
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
I ran cargo machete in trace-forwarder and it suggested that some
of the packages were not used, including a chain with a vulnerability,
so try and remove them to resolve RUSTSEC-2021-0139
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Reinstate mariner host testing - including the Agent Policy tests on
these hosts - now that a new CLH version brought in the required fixes.
This reverts commit ea53779b90.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Specify raw image format for all guest block devices.
- Attempting to auto-detect the image format from CLH would be riskier
for the Host.
- Creating a new raw image file, auto-detecting its format, and then
creating a filesystem from the Guest onto the block device is no
longer supported by CLH v51. Therefore, Kata CI's k8s-block-volume.bats
would fail without specifying the raw format when hot plugging its block
device.
- See cloud-hypervisor/cloud-hypervisor@b3e8e2a for additional information.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Specify raw image format for all guest block devices.
- Attempting to auto-detect the image format from CLH would be riskier
for the Host.
- Creating a new raw image file, auto-detecting its format, and then
creating a filesystem from the Guest onto the block device is no
longer supported by CLH v51. Therefore, Kata CI's k8s-block-volume.bats
would fail without specifying the raw format when hot plugging its block
device.
- See cloud-hypervisor/cloud-hypervisor@b3e8e2a for additional information.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
```
v51.1
=====
This is a bug fix release. The following issues have been addressed:
* Fix image_type in OpenAPI definition (#7734)
v51.0
=====
This release has been tracked in v51.0 group of our roadmap project.
Security Fixes
This release fixes a security vulnerability in disk image handling.
Details can be found in GHSA-jmr4-g2hv-mjj6.
* A new `backing_files=on|off` option has been added to `--disk` to
explicitly control whether QCOW2 backing files are permitted. This
defaults to `off` to prevent the loading of backing files entirely.
(#7685)
* Explicit image type specification via the user interface, removing
reliance on format autodetection (#7728).
* Prevent sector-zero writes for autodetected raw images (#7728).
Significant QCOW2 v3 Improvements
A large number of QCOW2 v3 specification features have been implemented:
* RAW backing file support for QCOW2 overlays (#7570)
* Zero bit in L2 entries (#7627)
* Incompatible feature bit validation (#7612)
* Dirty bit support (#7636)
* Variable refcount widths (1 to 64-bit) (#7633)
* Corrupt bit detection and marking (#7639)
* Autoclear feature bits handling (#7648)
* Thread safety fix for multiple virtio queues (`num_queues > 1`)
(#7661)
* Correct zero-fill for reads beyond backing file size (#7678)
* Live disk resize support (#7687)
ACPI Generic Initiator Support
ACPI Generic Initiator Affinity (SRAT Type 5) support has been added
to associate VFIO-PCI devices with dedicated memory/CPU-less NUMA
nodes. This enables the guest OS to make NUMA-aware memory allocation
decisions for device workloads. A new `device_id` parameter has been
added to `--numa` for specifying VFIO devices. (#7626)
Block Device DISCARD and WRITE_ZEROES Support
The `virtio-blk` device now supports `DISCARD` and `WRITE_ZEROES`
operations for QCOW2 and RAW image formats. This enables thin
provisioning and efficient space reclamation when guests trim
filesystems. A new `sparse=on|off` option has been added to `--disk` to
control disk space management: `sparse=on` (default) enables thin
provisioning with space reclamation, while `sparse=off` provides thick
provisioning with consistent I/O latency. (#7666)
Notable Performance Improvements
* Transparent Huge Pages (THP) support has been extended to cover
anonymous shared memory (`shared=on`) via `madvise`. Previously, THP
was only used for non-shared memory. (#7646)
* The `vhost-user-net` device now uses the default set of vhost-user
virtio features, including `VIRTIO_F_RING_INDIRECT_DESC`, which
provides a performance improvement. (#7653)
MSHV Support Improvements
* Optimize CPU state update after emulation by only updating special
registers when changed (#7603)
* Enable SMT for guests with `threads_per_core > 1` (#7668)
* Stub `save_data_tables()` to unblock VM pause/resume (#7692)
* Handle `GHCB_INFO_SPECIAL_DBGPRINT` VMG exit in SEV-SNP guest exit
handler (#7703)
* Fix CVM boot failure on MSHV (#7548)
* Fix CPU topology detection for multithreaded configurations (#7576)
Notable Bug Fixes
* Fix VFIO device hot-remove leaving group and container file
descriptors open, preventing re-add (#7676)
* Fix snapshot restore when backing file is on read-only storage with
`shared=false` (#7674)
* Enforce `VIRTIO_BLK_F_RO` even if guest does not negotiate it
(#7705)
* Fix read-only block device FLUSH requests from OVMF preventing VMs
from booting (#7706)
* Fix vhost-user device not properly dropping unowned file descriptors
(#7679)
* Fix `vhost-user-block` `get_config` interoperability (#7617)
* Fix vsock TOCTOU race condition by copying packet header from guest
memory before processing (#7530)
* Fix vsock handling of large TX packets spanning multiple data
descriptors (#7680)
* Add `gettid()` to all seccomp filters (#7596)
* Fix MAC address parsing that wrongly allowed `+` instead of hex
characters (#7579)
* Improve UUID parse error message and `--net` fd help text (#7702)
* Fix various inconsistencies in our OpenAPI specification file
(#7716, #7726)
* Various documentation fixes (#7602, #7606)
```
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Trustee now returns the binary SNP TCB claims as hex rather than base64
(for consistency with other platforms). Fortunately, the sev-snp-measure
tool has a flag for setting the output type of the launch digest.
I think hex is the default, but let's keep the flag here to be explicit.
Signed-off-by: Tobin Feldman-Fitzthum <tfeldmanfitz@nvidia.com>
Unfortunately, due to golang/go#75031, there is an issue
that results in `go: no such tool "covdata"`
with a automatically installed 1.25 toolchain, so
the approach to skip the install_go.sh script (which causes
double install problems) didn't work. Try the alternative approach
of using setup-go action, which should do a more comprehensive job
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
This commit enables the SEV-SNP guest policy to be explicitly
configured via the runtime configuration in runtime-rs.
To provide both ease of use and maximum flexibility, the following
logic is implemented:
1. If the user provides a custom `snp_guest_policy` in the
configuration, this value is passed directly to the QEMU SEV-SNP
guest object.
2. If the user does not specify a policy, the driver defaults to
`0x30000`, matching QEMU's standard default for SEV-SNP guests.
This enhancement allows users to fine-tune security constraints through
the policy bitmask, while ensuring a sensible and functional default
for standard SNP deployments.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
This commit introduces three new fields to the `SecurityInfo` struct
to support SEV-SNP (Secure Nested Paging) attestation and measurement
capabilities:
(1) `snp_id_block`: A 96-byte Base64-encoded ID block for the
SNP_LAUNCH_FINISH command.
(2) `snp_id_auth`: A 4096-byte Base64-encoded authentication structure
accompanying the ID block.
(3) `snp_guest_policy`: A bitmask for the SNP guest policy, passed to
the SNP_LAUNCH_START command.
These fields enable users to provide identity information to the SNP
firmware, allowing for remote attestation and verified guest launches.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
A bitmask for the SNP guest policy is introduced in ObjectSevSnpGuest
to help pass to Qemu cmdline.
And defaults to 0x30000 (QEMU's default) to maintain standard behavior
it just looks like as: "policy=0x30000"
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>