Commit Graph

3113 Commits

Author SHA1 Message Date
Rong Liu
321e560968 hv: add max payload to vrp
It seems important that passthru device's max payload settings match
the settings on the native device otherwise passthru device may not work.
So we have to set vrp's max payload capacity as native root port
otherwise we may accidentally change passthru device's max payload
since during guest OS's pci device enumeration, pass-thru device will
renegotiate its max payload's setting with vrp.

Tracked-On: #5915

Signed-off-by: Rong Liu <rong.l.liu@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
2021-06-15 08:53:53 +08:00
Victor Sun
50868dd594 HV: ramdisk and kernel load addr improve
For ramdisk, need to double check the limit of ramdisk GPA when locate
ramdisk load addr;

For SOS kernel load addr, need not to consider position of hypervisor
start and end address since the range has been set to e820 RESERVED.

Tracked-On: #5879

Signed-off-by: Victor Sun <victor.sun@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
2021-06-11 21:50:22 +08:00
Victor Sun
e371432695 HV: avoid pre-launched VM modules being corrupted by SOS kernel load
When hypervisor boots, the multiboot modules have been loaded to host space
by bootloader already. The space range of pre-launched VM modules is also
exposed to SOS VM, so SOS VM kernel might pick this range to extract kernel
when KASLR enabled. This would corrupt pre-launched VM modules and result in
pre-launched VM boot fail.

This patch will try to fix this issue. The SOS VM will not be loaded to guest
space until all pre-launched VMs are loaded successfully.

Tracked-On: #5879

Signed-off-by: Victor Sun <victor.sun@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
2021-06-11 10:06:02 +08:00
Victor Sun
1b3a75c984 HV: place kernel and ramdisk by find_space_from_ve820()
We should not hardcode the VM ramdisk load address right after kernel
load address because of two reasons:
	1. Per Linux kernel boot protocol, the Kernel need a size of
	   contiguous memory(i.e. init_size field in zeropage) from
	   its load address to boot, then the address would overlap
	   with ramdisk;
	2. The hardcoded address could not be ensured as a valid address
	   in guest e820 table, especially with a huge ramdisk;

Also we should not hardcode the VM kernel load address to its pref_address
which work for non-relocatable kernel only. For a relocatable kernel,
it could run from any valid address where bootloader load to.

The patch will set the VM kernel and ramdisk load address by scanning
guest e820 table with find_space_from_ve820() api:
	1. For SOS VM, the ramdisk has been loaded by multiboot bootloader
	   already so set the load address as module source address,
	   the relocatable kernel would be relocated to a appropriate address
	   out space of hypervisor and boot modules to avoid guest memory
	   copy corruption;
	2. For pre-launched VM, the kernel would be loaded to pref_address
	   first, then ramdisk will be put to a appropriate address out space
	   of kernel according to guest memory layout and maximum ramdisk
	   address limit under 4GB;

Tracked-On: #5879

Signed-off-by: Victor Sun <victor.sun@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
2021-06-11 10:06:02 +08:00
Victor Sun
eca245760a HV: create guest efi memmap for SOS VM
The SOS VM should not use host efi memmap directly, since there are some
memory ranges which reserved by hypersior and pre-launched VM should not
be exposed to SOS VM. These memory ranges should be filtered from SOS VM
efi memmap, otherwise it would caused unexpected issues. For example, The
SOS kernel kaslr will try to find the random address for extracted kernel
image in EFI table first. So it's possible that these reserved memory is
picked for extracted kernel image. This will make SOS kernel boot fail.

The patch would create efi memmory map for SOS VM and pass the memory map
info to zeropage for loading SOS VM kernel. The boot service related region
in host efi memmap is also kept for SOS VM so that SOS VM could have full
capability of EFI services as host.

Tracked-On: #5626

Signed-off-by: Victor Sun <victor.sun@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
2021-06-11 10:06:02 +08:00
Victor Sun
a966ed70c2 HV: correct bootargs module size
The bootargs module represents a string buffer and there is a NULL char at
the end so its size should not be calculated by strnlen_s(), otherwise the
NULL char will be ignored in gpa copy and result in kernel boot fail;

Tracked-On: #6162

Signed-off-by: Victor Sun <victor.sun@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
2021-06-11 10:06:02 +08:00
Victor Sun
268d4c3f3c HV: boot guest with boot params
Previously the load GPA of LaaG boot params like zeropage/cmdline and
initgdt are all hard-coded, this would bring potential LaaG boot issues.

The patch will try to fix this issue by finding a 32KB load_params memory
block for LaaG to store these guest boot params.

For other guest with raw image, in general only vgdt need to be cared of so
the load_params will be put at 0x800 since it is a common place that most
guests won't touch for entering protected mode.

Tracked-On: #5626

Signed-off-by: Victor Sun <victor.sun@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
2021-06-11 10:06:02 +08:00
Victor Sun
ed97022646 HV: add find_space_from_ve820() api
The API would search ve820 table and return a valid GPA when the requested
size of memory is available in the specified memory range, or return
INVALID_GPA if the requested memory slot is not available;

Tracked-On: #5626

Signed-off-by: Victor Sun <victor.sun@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
2021-06-11 10:06:02 +08:00
Victor Sun
6127c0c5d2 HV: modify low 1MB area for pre-launched VM e820
The memory range of [0xA0000, 0xFFFFF] is a known reserved area for BIOS,
actually Linux kernel would enforce this area to be reserved during its
boot stage. Set this area to usable would cause potential compatibility
issues.

The patch set the range to reserved type to make it consistent with the
real world.

BTW, There should be a EBDA(Entended BIOS DATA Area) with reserved type
exist right before 0xA0000 in real world for non-EFI boot. But given ACRN
has no legacy BIOS emulation, we simply skipped the EBDA in vE820.

Tracked-On: #5626

Signed-off-by: Victor Sun <victor.sun@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
2021-06-11 10:06:02 +08:00
Victor Sun
9dfac7a7a3 HV: init hv_e820 from efi mmap if boot from uefi
Hypervisor use e820_alloc_memory() api to allocate memory for trampoline code
and ept pages, whereas the usable ram in hv_e820 might include efi boot service
region if system boot from uefi environment, this would result in some uefi
service broken in SOS. These boot service region should be filtered from
hv_e820.
This patch will parse the efi memory descriptor entries info from efi memory
map pointer when system boot from uefi environment, and then initialize hv_e820
accordingly, that all efi boot service region would be kept as reserved in
hv_e820.

Please note the original efi memory map could be above 4GB address space,
so the efi memory parsing process must be done after enable_paging().

Tracked-On: #5626

Signed-off-by: Victor Sun <victor.sun@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
2021-06-11 10:06:02 +08:00
Victor Sun
9ac8e292fd HV: add efi memory map parsing function
When hypervisor boot from efi environment, the efi memory layout should be
considered as main memory map reference for hypervisor use. This patch add
function that parses the efi memory descriptor entries info from efi memory
map pointer and stores the info into a static hv_memdesc[] array.

Tracked-On: #5626

Signed-off-by: Victor Sun <victor.sun@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
2021-06-11 10:06:02 +08:00
Victor Sun
4e1deab3d9 HV: init paging before init e820
With this patch, the hv_e820 will be initialized after enable paging. This
is because the hv_e820 will be initialized from efi mmap when system boot
from uefi, which the efi mmap could be above 4G space.

Tracked-On: #5626

Signed-off-by: Victor Sun <victor.sun@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
2021-06-11 10:06:02 +08:00
Victor Sun
82c28af404 HV: modularization: rename mi_acpi_rsdp_va to acpi_rsdp_va
The simply rename mi_acpi_rsdp_va in acrn_boot_info struct to acpi_rsdp_va;

Tracked-On: #5661

Signed-off-by: Victor Sun <victor.sun@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
2021-06-11 10:06:02 +08:00
Victor Sun
4774c79da0 HV: modularizatoin: refine efi_info struct usage in acrn boot info
This patch has below changes:
	1. rename mi_efi_info to uefi_info in struct acrn_boot_info;

	2. remove redundant "efi_" prefix for efi_info struct members;

	3. The efi_info structure in acrn_boot_info struct is defined as
	   same as Linux kernel so the native efi info from boot loader
	   is passed to SOS zeropage with memcpy() api directly. Now replace
	   memcpy() with detailed struct member assignment;

	4. add boot_from_uefi() api;

Tracked-On: #5661

Signed-off-by: Victor Sun <victor.sun@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
2021-06-11 10:06:02 +08:00
Victor Sun
82a1d4406c HV: modularization: use abi_mmap struct in acrn boot info
Use more generic abi_mmap struct to replace multiboot_mmap struct in
acrn_boot_info;

Tracked-On: #5661

Signed-off-by: Victor Sun <victor.sun@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
2021-06-11 10:06:02 +08:00
Victor Sun
c59ea6c250 HV: modularization: use abi_module struct in acrn boot info
Use more generic abi_module struct to replace multiboot_module struct in
acrn_boot_info;

Tracked-On: #5661

Signed-off-by: Victor Sun <victor.sun@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
2021-06-11 10:06:02 +08:00
Victor Sun
16624bab5e HV: modularization: use loader_name char array in acrn boot info
The patch has below changes:
	1. rename mi_loader_name in acrn_boot_info struct to loader_name;
	2. change loader_name type from pointer to array to avoid accessing
	   original multiboot info region;
	3. remove mi_drivers_length and mi_drivers_addr which are never used;

Tracked-On: #5661

Signed-off-by: Victor Sun <victor.sun@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
2021-06-11 10:06:02 +08:00
Victor Sun
484d3ec9df HV: modularization: use cmdline char array in acrn boot info
The name of mi_cmdline in acrn_boot_info structure would cause confusion with
mi_cmdline in multiboot_info structure, rename it to cmdline. At the same time,
the data type is changed from pointer to array to avoid accessing the original
multiboot info region which might be used by other software modules.

Tracked-On: #5661

Signed-off-by: Victor Sun <victor.sun@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
2021-06-11 10:06:02 +08:00
Victor Sun
b11dfb6f20 HV: modularization: add boot.c to wrap multiboot module
Add a wrapper API init_acrn_boot_info() so that it could be used to boot
ACRN with any boot protocol;

Another change is change term of multiboot1 to multiboot because there is
no such term officially;

Tracked-On: #5661

Signed-off-by: Victor Sun <victor.sun@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
2021-06-11 10:06:02 +08:00
Victor Sun
28b7cee412 HV: modularization: rename multiboot.h to boot.h
Given the structure in multiboot.h could be used for any boot protocol,
use a more generic name "boot.h" instead;

Tracked-On: #5661

Signed-off-by: Victor Sun <victor.sun@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
2021-06-11 10:06:02 +08:00
Victor Sun
e8f726e321 HV: modularization: remove mi_flags from acrn boot info
The mi_flags is not needed any more so remove it from acrn_boot_info struct;

Tracked-On: #5661

Signed-off-by: Victor Sun <victor.sun@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
2021-06-11 10:06:02 +08:00
Victor Sun
8f24d91108 HV: modularization: name change on acrn_multiboot_info
The acrn_multiboot_info structure stores acrn specific boot info and should
not be limited to support multiboot protocol related structure only.

This patch only do below changes:

	1. change name of acrn_multiboot_info to acrn_boot_info;
	2. change name of mbi to abi because of the change in 1, also the
	   naming might bring confusion with native multiboot info;

Tracked-On: #5661

Signed-off-by: Victor Sun <victor.sun@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
2021-06-11 10:06:02 +08:00
Victor Sun
b0e1d610d2 HV: modularization: move module check to sanitize multiboot info
ACRN used to support deprivileged boot mode which do not need multiboot
modules, while direct boot mode need multiboot modules at lease for
service VM bzImage, so ACRN postponed the multiboot modules sanity check
in init_vm_boot_info.

Now deprivileged boot mode was totally removed, so we can do multiboot
module check in sanitize_acrn_multiboot_info().

Tracked-On: #5661

Signed-off-by: Victor Sun <victor.sun@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-06-11 10:06:02 +08:00
Liang Yi
400d31916a doc: update timer HLD doc after modularization
Replace rdstc() and get_tsc_khz() with their architectural agnostic
counterparts cpu_ticks() and cpu_tickrate().

Tracked-On: #5920
Signed-off-by: Yi Liang <yi.liang@intel.com>
2021-06-09 17:11:25 -04:00
Shuo A Liu
d965f6e6a1 hv: Enlarge E820_MAX_ENTRIES to 64
e820_alloc_memory() splits one E820 entry into two entries. With vEPT
enabled, e820_alloc_memory() is called one more. On some platforms, the
e820 entries might exceed 32.

Enlarge E820_MAX_ENTRIES to 64. Please note, it must be less than 128
due to constrain of zeropage. Linux kernel defines it as 128.

Tracked-On: #6168
Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-06-09 10:07:05 +08:00
Shuo A Liu
9ae32f96af hv: Wrap same code as a static function
vmptrld_vmexit_handler() has a same code snippet with
vmclear_vmexit_handler(). Wrap the same code snippet as a static
function clear_vmcs02().

There is only a small logic change that add
	nested->current_vmcs12_ptr = INVALID_GPA
in vmptrld_vmexit_handler() for the old VMCS. That's reasonable.

Tracked-On: #5923
Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-06-09 10:07:05 +08:00
Shuo A Liu
387ea23961 hv: Rename get_ept_entry() to get_eptp()
get_ept_entry() actually returns the EPTP of a VM. So rename it to
get_eptp() for readability.

Tracked-On: #5923
Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-06-09 10:07:05 +08:00
Zide Chen
b6b5373818 hv: deny access to HV owned legacy PIO UART from SOS
We need to deny accesses from SOS to the HV owned UART device, otherwise
SOS could have direct access to this physical device and mess up the HV
console.

If ACRN debug UART is configured as PIO based, For example,
CONFIG_SERIAL_PIO_BASE is generated from acrn-config tool, or the UART
config is overwritten by hypervisor parameter "uart=port@<port address>",
it could run into problem if ACRN doesn't emulate this UART PIO port
to SOS. For example:

- none of the ACRN emulated vUART devices has same PIO port with the
  port of the debug UART device.
- ACRN emulates PCI vUART for SOS (configure "console_vuart" with
  PCI_VUART in the scenario configuration)

This patch fixes the above issue by masking PIO accesses from SOS.
deny_hv_owned_devices() is moved after setup_io_bitmap() where
vm->arch_vm.io_bitmap is initialized.

Commit 50d852561 ("HV: deny HV owned PCI bar access from SOS") handles
the case that ACRN debug UART is configured as a PCI device. e.g.,
hypervisor parameter "uart=bdf@<BDF value>" is appended.

If the hypervisor debug UART is MMIO based, need to configured it as
a PCI type device, so that it can be hidden from SOS.

Tracked-On: #5923
Signed-off-by: Zide Chen <zide.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-06-08 16:16:14 +08:00
Yonghua Huang
25c0e3817e hv: validate input for dmar_free_irte function
Malicious input 'index' may trigger buffer
 overflow on array 'irte_alloc_bitmap[]'.

 This patch validate that 'index' shall be
 less than 'CONFIG_MAX_IR_ENTRIES' and also
 remove unnecessary check on 'index' in
 'ptirq_free_irte()' function with this fix.

Tracked-On: #6132
Signed-off-by: Yonghua Huang <yonghua.huang@intel.com>
2021-06-08 09:03:10 +08:00
Yonghua Huang
4acaeb91bd hv: remove unnecessary ASSERT in vlapic_write
vlapic_write handle 'offset' that is valid and ignore
 all other invalid 'offset'. so ASSERT on this 'offset'
 input is unnecessary.

 This patch removes above ASSERT to avoid potential
 hypervisor crash by guest malicious input when debug
 build is used.

Tracked-On: #6131
Signed-off-by: Yonghua Huang <yonghua.huang@intel.com>
2021-06-08 09:03:10 +08:00
Tao Yuhong
cb75de2163 HV: vpci: refine vbar sizing
For a pci BAR, its size aligned bits have fixed to 0(except the memory
type bits, they have another fixed value), they are read-only.
When write ~0U to BAR for sizing, (type_bits | size_mask) is written
into BAR.
So do not need to distinguish between sizing vBAR and programming vBAR.
When write a value to vBAR, always store (value & size_mask | type_bit)
to vfcg.
pci_vdev_read_vbar() is unnecessary, because it is only need to read
vcfg.

Tracked-On: #6011
Signed-off-by: Tao Yuhong <yuhong.tao@intel.com>
Reviewed-by: Eddie Dong <eddie.dong@intel.com>
Reviewed-by: Li Fei <fei1.li@intel.com>
2021-06-08 08:39:01 +08:00
Tao Yuhong
5ecca6b256 HV: vpci: check if address is in VM BAR MMIO space
When guest doing BAR re-programming, we should check whether
the base address of the BAR is valid.This patch does this check by:
1. whether the gpa is located in the responding MMIO window
2. whether the gpa is aligned with the BAR size

Tracked-On: #6011
Signed-off-by: Tao Yuhong <yuhong.tao@intel.com>
Reviewed-by: Eddie Dong <eddie.dong@intel.com>
Reviewed-by: Li Fei <fei1.li@intel.com>
2021-06-08 08:39:01 +08:00
Tao Yuhong
7da53ce138 HV: vpci: Fix do not mask I/O BAR upper 16-bit
Now we use pci_vdev_update_vbar_base to update vBAR base address when
guest re-programming BAR. For a IO BAR, we would calculate the 32 bits
base address then mask the high 16 bits. However, the mask code would
never be called since the first if condition statement is always true.

This patch fix it by move the unamsk code into the first if condition
statement.

Tracked-On: #6011
Signed-off-by: Tao Yuhong <yuhong.tao@intel.com>
Reviewed-by: Eddie Dong <eddie.dong@intel.com>
Reviewed-by: Li Fei <fei1.li@intel.com>
2021-06-08 08:39:01 +08:00
Shuo A Liu
15e6c5b9cf hv: nested: audit guest EPT mapping during shadow EPT entries setup
generate_shadow_ept_entry() didn't verify the correctness of the requested
guest EPT mapping. That might leak host memory access to L2 VM.

To simplify the implementation of the guest EPT audit, hide capabilities
'map 2-Mbyte page' and 'map 1-Gbyte page' from L1 VM. In addition,
minimize the attribute bits of EPT entry when create a shadow EPT entry.
Also, for invalid requested mapping address, reflect the EPT_VIOLATION to
L1 VM.

Here, we have some TODOs:
1) Enable large page support in generate_shadow_ept_entry()
2) Evaluate if need to emulate the invalid GPA access of L2 in HV directly.
3) Minimize EPT entry attributes.

Tracked-On: #5923
Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-06-04 13:53:47 +08:00
Shuo A Liu
3110e70d0a hv: nested: INVEPT emulation supports shadow EPT
L1 VM changes the guest EPT and do INVEPT to invalidate the previous
TLB cache of EPT entries. The shadow EPT replies on INVEPT instruction
to do the update.

The target shadow EPTs can be found according to the 'type' of INVEPT.
Here are two types and their target shadow EPT,
  1) Single-context invalidation
     Get the EPTP from the INVEPT descriptor. Then find the target
     shadow EPT.
  2) Global invalidation
     All shadow EPTs of the L1 VM.

The INVEPT emulation handler invalidate all the EPT entries of the
target shadow EPTs.

Tracked-On: #5923
Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com>
Signed-off-by: Zide Chen <zide.chen@intel.com>
Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-06-04 13:53:47 +08:00
Shuo A Liu
1dc7b7f798 hv: nested: Introduce shadow EPT release function
When a shadow EPT is not used anymore, its resources need to be
released.

free_sept_table() is introduced to walk the whole shadow EPT table and
free the pagetable pages.

Please note, the PML4E page of shadow EPT is not freed by
free_sept_table() as it still be used to present a shadow EPT pointer.

Tracked-On: #5923
Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-06-04 13:53:47 +08:00
Shuo A Liu
b10b5658bd hv: nested: Introduce L2 VM EPT VIOLATION handler
With shadow EPT, the hypervisor walks through guest EPT table:

  * If the entry is not present in guest EPT, ACRN injects EPT_VIOLATION
    to L1 VM and resumes to L1 VM.

  * If the entry is present in guest EPT, do the EPT_MISCONFIG check.
    Inject EPT_MISCONFIG to L1 VM if the check failed.

  * If the entry is present in guest EPT, do permission check.
    Reflect EPT_VIOLATION to L1 VM if the check failed.

  * If the entry is present in guest EPT but shadow EPT entry is not
    present, create the shadow entry and resumes to L2 VM.

  * If the entry is present in guest EPT but the GPA in the entry is
    invalid, injects EPT_VIOLATION to L1 VM and resumes L1 VM.

Tracked-On: #5923
Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com>
Signed-off-by: Zide Chen <zide.chen@intel.com>
Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-06-04 13:53:47 +08:00
Shuo A Liu
8565750bbe hv: nested: Hide some capability bits from L1 guest
* Hide 5 level EPT capability, let L1 guest stick to 4 level EPT.

 * Access/Dirty bits are not support currently, hide corresponding EPT
   capability bits.

 * "Mode-based execute control for EPT" is also not support well
   currently, hide its capability bit from MSR_IA32_VMX_PROCBASED_CTLS2.

Tracked-On: #5923
Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com>
Signed-off-by: Zide Chen <zide.chen@intel.com>
Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-06-04 13:53:47 +08:00
Shuo A Liu
540a484147 hv: nested: Manage shadow EPTP according to guest VMCS change
'struct nept_desc' is used to associate guest EPTP with a shadow EPTP.
It's created in the first reference and be freed while no reference.

The life cycle seems like,

While guest VMCS VMX_EPT_POINTER_FULL is changed, the 'struct nept_desc'
of the new guest EPTP is referenced; the 'struct nept_desc' of the old
guest EPTP is dereferenced.

While guest VMCS be cleared(by VMCLEAR in L1 VM), the 'struct nept_desc'
of the old guest EPTP is dereferenced.

While a new guest VMCS be loaded(by VMPTRLD in L1 VM), the 'struct
nept_desc' of the new guest EPTP is referenced. The 'struct nept_desc'
of the old guest EPTP is dereferenced.

Tracked-On: #5923
Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com>
Signed-off-by: Zide Chen <zide.chen@intel.com>
Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-06-04 13:53:47 +08:00
Shuo A Liu
10ec896f99 hv: nested: Introduce shadow EPT infrastructure
To shadow guest EPT, the hypervisor needs construct a shadow EPT for each
guest EPT. The key to associate a shadow EPT and a guest EPT is the EPTP
(EPT pointer). This patch provides following structure to do the association.

	struct nept_desc {
	       /*
	        * A shadow EPTP.
	        * The format is same with 'EPT pointer' in VMCS.
	        * Its PML4 address field is a HVA of the hypervisor.
	        */
	       uint64_t shadow_eptp;
	       /*
	        * An guest EPTP configured by L1 VM.
	        * The format is same with 'EPT pointer' in VMCS.
	        * Its PML4 address field is a GPA of the L1 VM.
	        */
	       uint64_t guest_eptp;
	       uint32_t ref_count;
	};

Due to lack of dynamic memory allocation of the hypervisor, a array
nept_bucket of type 'struct nept_desc' is introduced to store those
association information. A guest EPT might be shared between different
L2 vCPUs, so this patch provides several functions to handle the
reference of the structure.

Interface get_shadow_eptp() also is introduced. To find the shadow EPTP
of a specified guest EPTP.

Tracked-On: #5923
Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com>
Signed-off-by: Zide Chen <zide.chen@intel.com>
Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-06-04 13:53:47 +08:00
Shuo A Liu
17bc7f08c9 hv: nested: Create a page pool for shadow EPT construction
Shadow EPT uses lots of pages to construct the shadow page table. To
utilize the memory more efficient, a page poll sept_page_pool is
introduced.

For simplicity, total platform RAM size is considered to calculate the
memory needed for shadow page tables. This is not an accurate upper
bound.  This can satisfy typical use-cases where there is not a lot
of overcommitment and sharing of memory between L2 VMs.

Memory of the pool is marked as reserved from E820 table in early stage.

Tracked-On: #5923
Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com>
Signed-off-by: Zide Chen <zide.chen@intel.com>
Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-06-04 13:53:47 +08:00
Zide Chen
811e367ad9 hv: nested: implement nested VM exit handler
Nested VM exits happen when vCPU is in guest mode (VMCS02 is current).
Initially we reflect all nested VM exits to L1 hypervisor. To prepare
the environment to run L1 guest:

- restore some VMCS fields to the value as what L1 hypervisor programmed.
- VMCLEAR VMCS02, VMPTRLD VMCS01 and enable VMCS shadowing.
- load the non-shadowing host states from VMCS12 to VMCS01 guest states.
- VMRESUME to L1 guest with this modified VMCS01.

Tracked-On: #5923
Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com>
Signed-off-by: Zide Chen <zide.chen@intel.com>
Signed-off-by: Alexander Merritt <alex.merritt@intel.com>
2021-06-03 15:23:25 +08:00
Zide Chen
22d225663f hv: nested: update run_vcpu() function for nested case
Since L2 guest vCPU mode and VPID are managed by L1 hypervisor, so we
can skip these handling in run_vcpu().

And be careful that we can't cache L2 registers in struct acrn_vcpu.

Tracked-On: #5923
Signed-off-by: Zide Chen <zide.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-06-03 15:23:25 +08:00
Zide Chen
4acc65eacc hv: nested: support for INVEPT and INVVPID emulation
invvpid and invept instructions cause VM exits unconditionally.
For initial support, we pass all the instruction operands as is
to the pCPU.

Tracked-On: #5923
Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com>
Signed-off-by: Zide Chen <zide.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-06-03 15:23:25 +08:00
Zide Chen
4c29a0bb29 hv: nested: support for VMLAUNCH and VMRESUME emulation
Implement the VMLAUNCH and VMRESUME instructions, allowing a L1
hypervisor to run nested guests.

- merge VMCS control fields and VMCS guest fields to VMCS02
- clear shadow VMCS indicator on VMCS02 and load VMCS02 as current
- set VMCS12 launch state to "launched" in VMLAUNCH handler

Tracked-On: #5923
Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com>
Signed-off-by: Zide Chen <zide.chen@intel.com>
Signed-off-by: Alex Merritt <alex.merritt@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-06-03 15:23:25 +08:00
Yonghua Huang
1a6ead9af5 hv: update RTCT ACPI table detecting
Signature of RTCT ACPI table maybe "PTCT"(v1) or "RTCT"(v2).
 and the MAGIC number in CRL header is also changed from "PTCM"
 to "RTCM".

 This patch refine the code to detect RTCT table for both
 v1 and v2.

Tracked-On: #6020
Signed-off-by: Yonghua Huang <yonghua.huang@intel.com>
2021-06-01 08:22:20 +08:00
Rong Liu
43acbdd345 hv: ensure PTM root is always enabled in hw
For post launch VM, ACRN supports PTM under these conditions:
1. HW implements a simple PTM hierarchy: PTM requestor device (ep) is
directly connected to PTM root capable root port. Or
2. ptm requestor itself is root complex integrated ep.
Currently acrn doesn't support emulation of other type of PTM hiearchy, such
as if there is an intermediate PTM node (for example, switch) inbetween
PTM requestor and PTM root.
To avoid VM touching physical hardware, acrn hv ensures PTM is always enabled
in the hardware.
During hv's pci init, if root port is ptm capable,
hv will enable PTM on that root port.  In addition,
log error (and don't enable PTM) if ptm root
capability is on intermediate node other than root port.

V2:
 - Modify commit messages to clarify the limitation
of current PTM implementation.
 - Fix code that may fail FUSA
 - Remove pci_ptm_info() and put info log inside pci_enable_ptm_root().

Tracked-On: #5915
Signed-off-by: Rong Liu <rong.l.liu@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-05-27 09:00:50 +08:00
Li Fei1
0d5f12e281 hv: vlapic: a minor refine about vlapic_x2apic_pt_icr_access
In physical destination mode, the destination processor is specified by its
local APIC ID. When a CPU switch xAPIC Mode to x2APIC Mode or vice versa,
the local APIC ID is not changed. So a vcpu in x2APIC Mode could use physical
Destination Mode to send an IPI to another vcpu in xAPIC Mode by writing ICR.

This patch adds support for a vCPU A could write ICR to send IPI to another
vCPU B which is in different APIC mode.

Tracked-On: #5923
Signed-off-by: Li Fei1 <fei1.li@intel.com>
2021-05-27 09:00:08 +08:00
dongshen
3e1fa0fb98 hv: hypercall: 72 reserved bytes are used in struct hc_platform_info to pass physical APIC and cat IDs shift to DM
So that DM can retrieve physical APIC IDs and use them to fill in the ACPI MADT table for
post-launched VMs.

Note:
1. DM needs to use the same logic as hypervisor to calculate vLAPIC IDs based on physical APIC IDs
and CPU affinity setting

2. Using reserved0[] in struct hc_platform_info to pass physical APIC IDs means we can only support at
most 116 cores. And it assumes LAPIC ID is 8bits (X2APIC mode supports 32 bits).

Cat IDs shift will be used by DM RTCT V2

Tracked-On: #6020
Reviewed-by: Wang, Yu1 <yu1.wang@intel.com>
Signed-off-by: dongshen <dongsheng.x.zhang@intel.com>
2021-05-26 11:23:06 +08:00
dongshen
5e3c6ae941 hv: vcpuid: passthrough host CPUID leaf.0BH to guest VMs
Using physical APIC IDs as vLAPIC IDs for pre-Launched and post-launched VMs
is not sufficient to replicate the host CPU and cache topologies in guest VMs,
we also need to passthrough host CPUID leaf.0BH to guest VMs, otherwise,
guest VMs may see weird CPU topology.

Note that in current code, ACRN has already passthroughed host cache CPUID
leaf 04H to guest VMs

Tracked-On: #6020
Reviewed-by: Wang, Yu1 <yu1.wang@intel.com>
Signed-off-by: dongshen <dongsheng.x.zhang@intel.com>
2021-05-26 11:23:06 +08:00