Commit Graph

1290 Commits

Author SHA1 Message Date
Mingqiang Chi
bb0327e700 hv: remove UUID
With current arch design the UUID is used to identify ACRN VMs,
all VM configurations must be deployed with given UUIDs at build time.
For post-launched VMs, end user must use UUID as acrn-dm parameter
to launch specified user VM. This is not friendly for end users
that they have to look up the pre-configured UUID before launching VM,
and then can only launch the VM which its UUID in the pre-configured UUID
list,otherwise the launch will fail.Another side, VM name is much straight
forward for end user to identify VMs, whereas the VM name defined
in launch script has not been passed to hypervisor VM configuration
so it is not consistent with the VM name when user list VM
in hypervisor shell, this would confuse user a lot.

This patch will resolve these issues by removing UUID as VM identifier
and use VM name instead:
1. Hypervisor will check the VM name duplication during VM creation time
   to make sure the VM name is unique.
2. If the VM name passed from acrn-dm matches one of pre-configured
   VM configurations, the corresponding VM will be launched,
   we call it static configured VM.
   If there is no matching found, hypervisor will try to allocate one
   unused VM configuration slot for this VM with given VM name and get it
   run if VM number does not reach CONFIG_MAX_VM_NUM,
   we will call it dynamic configured VM.
3. For dynamic configured VMs, we need a guest flag to identify them
   because the VM configuration need to be destroyed
   when it is shutdown or creation failed.

v7->v8:
    -- rename is_static_vm_configured to is_static_configured_vm
    -- only set DM owned guest_flags in hcall_create_vm
    -- add check dynamic flag in get_unused_vmid

v6->v7:
    -- refine get_vmid_by_name, return the first matching vm_id
    -- the GUEST_FLAG_STATIC_VM is added to identify the static or
       dynamic VM, the offline tool will set this flag for
       all the pre-defined VMs.
    -- only clear name field for dynamic VM instead of clear entire
       vm_config

Tracked-On: #6685
Signed-off-by: Mingqiang Chi <mingqiang.chi@intel.com>
Reviewed-by: Zhao Yakui <yakui.zhao@intel.com>
Reviewed-by: Victor Sun<victor.sun@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-11-16 14:42:59 +08:00
Yuanyuan Zhao
4f6aa38ea5 hv: remove CONFIG_LOW_RAM_SIZE
The CONFIG_LOW_RAM_SIZE is used to describe the size of trampoline code
that is never changed. And it totally confused user to configure it.

This patch hard code it to 1MB and remove the macro for configuration.
In the trampoline related code, use ld_trampoline_end and
ld_trampoline_start symbol to calculate the real size.

Tracked-On: #6805
Signed-off-by: Yuanyuan Zhao <yuanyuan.zhao@linux.intel.com>
Reviewed-by: Wang, Yu1 <yu1.wang@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-11-12 11:56:03 +08:00
Shiqing Gao
7bbd17ce80 hv: initialize and save/restore IA32_TSC_AUX MSR for guest
Commit cbf3825 "hv: Pass-through IA32_TSC_AUX MSR to L1 guest"
lets guest own the physical MSR IA32_TSC_AUX and does not handle this MSR
in the hypervisor.
If multiple vCPUs share the same pCPU, when one vCPU reads MSR IA32_TSC_AUX,
it may get the value set by other vCPUs.

To fix this issue, this patch does:
 - initialize the MSR content to 0 for the given vCPU, which is consistent with
   the value specified in SDM Vol3 "Table 9-1. IA-32 and Intel 64 Processor
   States Following Power-up, Reset, or INIT"
 - save/restore the MSR content for the given vCPU during context switch

v1 -> v2:
 * According to Table 9-1, the content of IA32_TSC_AUX MSR is unchanged
   following INIT, v2 updates the initialization logic so that the content for
   vCPU is consistent with SDM.

Tracked-On: #6799
Signed-off-by: Shiqing Gao <shiqing.gao@intel.com>
Reviewed-by: Zide Chen <zide.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-11-12 09:30:12 +08:00
Zhou, Wu
1bc25ed198 HV: refine the ve820 tab for pre-VMs
This patch moves the ssram area in ve820 tab, and reunites the
hpa1_low_part1/2 areas. The ve820 building code is refined.

before:
|<---low_1M--->|
|<---hpa1_low_part1--->|
|<---SSRAM--->|
|<---hpa1_low_part2--->|
|<---GPU_OpRegion--->|
|<---ACPI DATA--->|
|<---ACPI NVS--->|
---2G---

after:
|<---low_1M--->|
|<---hpa_low--->|
|<---SSRAM--->|
|<---GPU_OpRegion--->|
|<---ACPI DATA--->|
|<---ACPI NVS--->|
---2G---

The SSRAM area's address is described in the ACPI's RTCT/PTCT
table. To simplify the SSRAM implementation, SSRAM area was
identical mapped to GPA, and resulted in the divition of hpa_low.
Then the ve820 building logic became too complicated.

Now we managed to edit the guest's RTCT/PTCT table by offline
tools in the former patch, so we can move the guest's SSRAM
area, and reunite the hpa_low areas again.

After doing this, this patch rewrites the ve820 building code
in a much simpler way.

Tracked-On: #6674

Signed-off-by: Zhou, Wu <wu.zhou@intel.com>
Reviewed-by: Eddie Dong <eddie.dong@intel.com>
Reviewed-by: Wang, Yu1 <yu1.wang@intel.com>
2021-11-08 13:13:14 +08:00
Zhou, Wu
f1f6fe11c1 HV: move the ve820 GPU OpRegion address
The ve820 table' hpa1_low area is divided into two parts, which
is making the code too complicated and causing problems. Moving
the entries that divides the hpa1_low could make things easier.

This patch moves the GPU OpRegion to the tail area of 2G,
consecutive to the acpi data/nvs area.

before:
|<---low_1M--->|
|<---hpa1_low_part1--->|
|<---SSRAM--->|
|<---GPU_OpRegion--->|
|<---hpa1_low_part2--->|
|<---ACPI DATA--->|
|<---ACPI NVS--->|
---2G---

after:
|<---low_1M--->|
|<---hpa1_low_part1--->|
|<---SSRAM--->|
|<---hpa1_low_part2--->|
|<---GPU_OpRegion--->|
|<---ACPI DATA--->|
|<---ACPI NVS--->|
---2G---

Tracked-On: #6674

Signed-off-by: Zhou, Wu <wu.zhou@intel.com>
Reviewed-by: Victor Sun <victor.sun@intel.com>
Reviewed-by: Wang, Yu1 <yu1.wang@intel.com>
2021-11-08 13:13:14 +08:00
Zhou, Wu
e00421d5be HV: Fix the problems in ve820 acpi area
The length of the ACPI data entry in ve820 tab was 960K, while the
ACPI file is 1MB. It causes the ACPI file copy failed due to reserved
ACPI regions in ve820 table is not enough when loading pre-launched
VMs. This patch changes ACPI data area to 1MB to fix the problem.

And the ACPI data length was missed when calculating
ENTRY_HPA1_LOW_PART2 length. Fixed here too.

Also adds some refinement to the hard-coded ACPI base/addr definations

Tracked-On: #6674

Signed-off-by: Zhou, Wu <wu.zhou@intel.com>
Reviewed-by: Eddie Dong <eddie.dong@intel.com>
Reviewed-by: Wang, Yu1 <yu1.wang@intel.com>
2021-11-08 13:13:14 +08:00
Junjie Mao
83a938bae6 HV: treewide: fix violations of coding guideline C-TY-27 & C-TY-28
The coding guideline rules C-TY-27 and C-TY-28, combined, requires that
assignment and arithmetic operations shall be applied only on operands of the
same kind. This patch either adds explicit type casts or adjust types of
variables to align the types of operands.

The only semantic change introduced by this patch is the promotion of the
second argument of set_vmcs_bit() and clear_vmcs_bit() to
uint64_t (formerly uint32_t). This avoids clear_vmcs_bit() to accidentally
clears the upper 32 bits of the requested VMCS field.

Other than that, this patch has no semantic change. Specifically this patch
is not meant to fix buggy narrowing operations, only to make these
operations explicit.

Tracked-On: #6776
Signed-off-by: Junjie Mao <junjie.mao@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-11-04 18:15:47 +08:00
Junjie Mao
9781873e77 HV: treewide: fix violations of coding guideline C-TY-12
The coding guideline rule C-TY-12 requires that 'all type conversions shall
be explicit'. Especially implicit cases on the signedness of variables
shall be avoided.

This patch either adds explicit type casts or adjust local variable types
to make sure that Booleans, signed and unsigned integers are not used
mixedly.

This patch has no semantic changes.

Tracked-On: #6776
Signed-off-by: Junjie Mao <junjie.mao@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-11-04 18:15:47 +08:00
Junjie Mao
db20e277b6 HV: treewide: fix violations of coding guideline C-TY-02
The coding guideline rule C-TY-02 requires that 'the operands of bit
operations shall be unsigned'. This patch adds explicit casts or literal
suffixes to make explicit the type of values involved in bit operations.
Explicit casts to widen integers before shift operations are also
introduced to make explicit that the variables are expanded BEFORE it is
shifted (which is already so in C99 but implicitly).

This patch has no semantic changes.

Tracked-On: #6776
Signed-off-by: Junjie Mao <junjie.mao@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-11-04 18:15:47 +08:00
Junjie Mao
fba343bd05 HV: treewide: fix violations of coding guideline C-FN-06
The coding guideline rule C-FN-06 requires that 'a parameter passed by
value to a function shall not be modified directly'. This patch rewrites
two functions which does modify its parameters today.

This patch has no semantic impact.

Tracked-On: #6776
Signed-off-by: Junjie Mao <junjie.mao@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-11-04 18:15:47 +08:00
Junjie Mao
3f3f4be642 HV: treewide: fix violations of coding guideline C-EP-05
The coding guideline rule C-EP-05 requires that 'parentheses shall be used
to set the operator precedence explicitly'. This patch adds the missing
parentheses detected by the static analyzer.

Tracked-On: #6776
Signed-off-by: Junjie Mao <junjie.mao@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-11-04 18:15:47 +08:00
Junjie Mao
4cf6c288cd HV: treewide: fix warnings raised by Clang
This patch fixes the following warnings detected by the LLVM/Clang
compiler:

  1. Unused static functions in C sources, which are fixed by explicitly
     tagging them with __unused

  2. Duplicated parentheses around branch conditions

  3. Assigning 64-bit constants to 32-bit variables, which is fixed by
     promoting the variables to uint64_t

  4. Using { '\0' } to zero-fill an array, which is fixed by replacing it
     with { 0 }

  5. Taking a bit out of a variable using && (which should be & instead)

Most changes do not have a semantic impact, except item 5 which is probably
a real code issue.

Tracked-On: #6776
Signed-off-by: Junjie Mao <junjie.mao@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-11-04 18:15:47 +08:00
Yifan Liu
10963b04d1 hv: Fix vcpu signaling racing problem in lock instruction emulation
In lock instruction emulation, we use vcpu_make_request and
signal_event pairs to shoot down/release other vcpus.
However, vcpu_make_request is async and does not guarantee an execution
of wait_event on target vcpu, and we want wait_event to be consistent
with signal_event.

Consider following scenarios:

1, When target vcpu's state has not yet turned to VCPU_RUNNING,
vcpu_make_request on ACRN_REQUEST_SPLIT_LOCK does not make sense, and will
not result in wait_event.

2, When target vcpu is already requested on ACRN_REQUEST_SPLIT_LOCK (i.e., the
corresponding bit in pending_req is set) but not yet handled,
the vcpu_make_request call does not result in wait_event as 1 bit is not
enough to cache multiple requests.

This patch tries to add checks in vcpu_kick_lock_instr_emulation and
vcpu_complete_lock_instr_emulation to resolve these issues.

Tracked-On: #6502
Signed-off-by: Yifan Liu <yifan1.liu@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
2021-11-02 15:01:20 +08:00
Liu Long
3f4ea38158 ACRN: misc: Unify terminology for service vm/user vm
Rename SOS_VM type to SERVICE_VM
rename UOS to User VM in XML description
rename uos_thread_pid to user_vm_thread_pid
rename devname_uos to devname_user_vm
rename uosid to user_vmid
rename UOS_ACK to USER_VM_ACK
rename SOS_VM_CONFIG_CPU_AFFINITY to SERVICE_VM_CONFIG_CPU_AFFINITY
rename SOS_COM to SERVICE_VM_COM
rename SOS_UART1_VALID_NUM" to SERVICE_VM_UART1_VALID_NUM
rename SOS_BOOTARGS_DIFF to SERVICE_VM_BOOTARGS_DIFF
rename uos to user_vm in launch script and xml

Tracked-On: #6744
Signed-off-by: Liu Long <long.liu@linux.intel.com>
Reviewed-by: Geoffroy Van Cutsem <geoffroy.vancutsem@intel.com>
2021-11-02 10:00:55 +08:00
Liu Long
e9c4ced460 ACRN: hv: Unify terminology for user vm
Rename gpa_uos to gpa_user_vm
rename base_gpa_in_uos to base_gpa_in_user_vm
rename UOS_VIRT_PCI_MMCFG_BASE to USER_VM_VIRT_PCI_MMCFG_BASE
rename UOS_VIRT_PCI_MMCFG_START_BUS to USER_VM_VIRT_PCI_MMCFG_START_BUS
rename UOS_VIRT_PCI_MMCFG_END_BUS to USER_VM_VIRT_PCI_MMCFG_END_BUS
rename UOS_VIRT_PCI_MEMBASE32 to USER_VM_VIRT_PCI_MEMBASE32
rename UOS_VIRT_PCI_MEMLIMIT32 to USER_VM_VIRT_PCI_MEMLIMIT32
rename UOS_VIRT_PCI_MEMBASE64 to USER_VM_VIRT_PCI_MEMBASE64
rename UOS_VIRT_PCI_MEMLIMIT64 to USER_VM_VIRT_PCI_MEMLIMIT64
rename UOS in comments message to User VM.

Tracked-On: #6744
Signed-off-by: Liu Long <long.liu@linux.intel.com>
Reviewed-by: Geoffroy Van Cutsem <geoffroy.vancutsem@intel.com>
2021-11-02 10:00:55 +08:00
Liu Long
92b7d6a9a3 ACRN: hv: Terminology modification in hv code
Rename sos_vm to service_vm.
rename sos_vmid to service_vmid.
rename sos_vm_ptr to service_vm_ptr.
rename get_sos_vm to get_service_vm.
rename sos_vm_gpa to service_vm_gpa.
rename sos_vm_e820 to service_vm_e820.
rename sos_efi_info to service_vm_efi_info.
rename sos_vm_config to service_vm_config.
rename sos_vm_hpa2gpa to service_vm_hpa2gpa.
rename vdev_in_sos to vdev_in_service_vm.
rename create_sos_vm_e820 to create_service_vm_e820.
rename sos_high64_max_ram to service_vm_high64_max_ram.
rename prepare_sos_vm_memmap to prepare_service_vm_memmap.
rename post_uos_sworld_memory to post_user_vm_sworld_memory
rename hcall_sos_offline_cpu to hcall_service_vm_offline_cpu.
rename filter_mem_from_sos_e820 to filter_mem_from_service_vm_e820.
rename create_sos_vm_efi_mmap_desc to create_service_vm_efi_mmap_desc.
rename HC_SOS_OFFLINE_CPU to HC_SERVICE_VM_OFFLINE_CPU.
rename SOS to Service VM in comments message.

Tracked-On: #6744
Signed-off-by: Liu Long <long.liu@linux.intel.com>
Reviewed-by: Geoffroy Van Cutsem <geoffroy.vancutsem@intel.com>
2021-11-02 10:00:55 +08:00
Liu Long
26e507a06e ACRN: hv: Unify terminology for service vm
Rename is_sos_vm to is_service_vm

Tracked-On: #6744
Signed-off-by: Liu Long <longliu@intel.com>
2021-11-02 10:00:55 +08:00
dongshen
dcafcadaf9 hv: rename some C preprocessor macros
Rename some C preprocessor macros:
  NUM_GUEST_MSRS --> NUM_EMULATED_MSRS
  CAT_MSR_START_INDEX --> FLEXIBLE_MSR_INDEX
  NUM_VCAT_MSRS --> NUM_CAT_MSRS
  NUM_VCAT_L2_MSRS --> NUM_CAT_L2_MSRS
  NUM_VCAT_L3_MSRS --> NUM_CAT_L3_MSRS

Tracked-On: #5917
Signed-off-by: Eddie Dong <eddie.dong@Intel.com>
2021-10-28 19:12:29 +08:00
dongshen
c0d95558c1 hv: vCAT: propagate vCBM to other vCPUs that share cache with vcpu
Implement the propagate_vcbm() function:
  Set vCBM to to all the vCPUs that share cache with vcpu
  to mimic hardware CAT behavior

Tracked-On: #5917
Signed-off-by: dongshen <dongsheng.x.zhang@intel.com>
Acked-by: Eddie Dong <eddie.dong@Intel.com>
2021-10-28 19:12:29 +08:00
dongshen
a7014f4654 hv: vCAT: implementing the vCAT MSRs write handler
Implement the write_vcbm() function to handle the
MSR_IA32_type_MASK_n vCBM MSRs write request

Call write_vclosid() to handle MSR_IA32_PQR_ASSOC MSR write request

Several vCAT P2V (physical to virtual) and V2P (virtual to physical)
mappings exist:

   struct acrn_vm_config *vm_config = get_vm_config(vm_id)

   max_pcbm = vm_config->max_type_pcbm (type: l2 or l3)
   mask_shift = ffs64(max_pcbm)

   vclosid = vmsr - MSR_IA32_type_MASK_0
   pclosid = vm_config->pclosids[vclosid]

   pmsr = MSR_IA32_type_MASK_0 + pclosid
   pcbm = vcbm << mask_shift
   vcbm = pcbm >> mask_shift

   Where
   MSR_IA32_type_MASK_n: L2 or L3 mask msr address for CLOSIDn, from
   0C90H through 0D8FH (inclusive).

   max_pcbm: a bitmask that selects all the physical cache ways assigned to the VM

   vclosid: virtual CLOSID, always starts from 0

   pclosid: corresponding physical CLOSID for a given vclosid

   vmsr: virtual msr address, passed to vCAT handlers by the
   caller functions rdmsr_vmexit_handler()/wrmsr_vmexit_handler()

   pmsr: physical msr address

   vcbm: virtual CBM, passed to vCAT handlers by the
   caller functions rdmsr_vmexit_handler()/wrmsr_vmexit_handler()

   pcbm: physical CBM

Tracked-On: #5917
Signed-off-by: dongshen <dongsheng.x.zhang@intel.com>
Acked-by: Eddie Dong <eddie.dong@Intel.com>
2021-10-28 19:12:29 +08:00
dongshen
3ab50f2ef5 hv: vCAT: implementing the vCAT MSRs read handlers
Implement the read_vcbm() and read_vclosid() functions to handle the MSR_IA32_PQR_ASSOC
and MSR_IA32_type_MASK_n vCAT MSRs read request.

Tracked-On: #5917
Signed-off-by: dongshen <dongsheng.x.zhang@intel.com>
Acked-by: Eddie Dong <eddie.dong@Intel.com>
2021-10-28 19:12:29 +08:00
dongshen
be855d2352 hv: vCAT: expose CAT capabilities to vCAT-enabled VM
Expose CAT feature to vCAT VM by reporting the number of
cache ways/CLOSIDs via the 04H/10H cpuid instructions, so that the
VM can take advantage of CAT to prioritize and partition cache
resource for its own tasks.

Add the vcat_pcbm_to_vcbm() function to map pcbm to vcbm

Tracked-On: #5917
Signed-off-by: dongshen <dongsheng.x.zhang@intel.com>
Acked-by: Eddie Dong <eddie.dong@Intel.com>
2021-10-28 19:12:29 +08:00
dongshen
77ae989379 hv: vCAT: initialize vCAT MSRs during vmcs init
Initialize vCBM MSRs

Initialize vCLOSID MSR

Add some vCAT functions:
 Retrieve max_vcbm and max_pcbm
 Check if vCAT is configured or not for the VM
 Map vclosid to pclosid
 write_vclosid: vCLOSID MSR write handler
 write_vcbm: vCBM MSR write handler

Tracked-On: #5917
Signed-off-by: dongshen <dongsheng.x.zhang@intel.com>
Acked-by: Eddie Dong <eddie.dong@Intel.com>
2021-10-28 19:12:29 +08:00
Fei Li
1905ed6124 hv: vMSR: minor fix about rdmsr_vmexit_handler
Specifying a reserved or unimplemented MSR address in ECX for rdmsr will cause a
general protection exception. In this case, we should not change the contents of
registers EDX:EAX.

Tracked-On: #4550
Signed-off-by: Fei Li <fei1.li@intel.com>
2021-10-27 08:23:43 +08:00
dongshen
39461ef9dd hv: vCAT: initialize the emulated_guest_msrs array for CAT msrs during platform initialization
Initialize the emulated_guest_msrs[] array at runtime for
MSR_IA32_type_MASK_n and MSR_IA32_PQR_ASSOC msrs, there is no good
way to do this initialization statically at build time

Tracked-On: #5917
Signed-off-by: dongshen <dongsheng.x.zhang@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-10-26 11:48:27 +08:00
dongshen
cb2bb78b6f hv/config_tools: amend the struct acrn_vm_config to make it compatible with vCAT
For vCAT, it may need to store more than MAX_VCPUS_PER_VM of closids,
change clos in vm_config.h to a pointer to accommodate this situation

Rename clos to pclosids

pclosids now is a pointer to an array of physical CLOSIDs that is defined
in vm_configurations.c by vmconfig. The number of elements in the array
must be equal to the value given by num_pclosids

Add max_type_pcbm (type: l2 or l3) to struct acrn_vm_config, which stores a bitmask
that selects/covers all the physical cache ways assigned to the VM

Change vmsr.c to accommodate this amended data structure

Change the config-tools to generate vm_configurations.c, and fill in the num_closids
and clos pointers based on the information from the scenario file.

Now vm_configurations.c.xsl generates all the clos related code so remove the same
code from misc_cfg.h.xsl.

Examples:

  Scenario file:

  <RDT>
    <RDT_ENABLED>y</RDT_ENABLED>
    <CDP_ENABLED>n</CDP_ENABLED>
    <VCAT_ENABLED>y</VCAT_ENABLED>
    <CLOS_MASK>0x7ff</CLOS_MASK>
    <CLOS_MASK>0x7ff</CLOS_MASK>
    <CLOS_MASK>0x7ff</CLOS_MASK>
    <CLOS_MASK>0xff800</CLOS_MASK>
    <CLOS_MASK>0xff800</CLOS_MASK>
    <CLOS_MASK>0xff800</CLOS_MASK>
    <CLOS_MASK>0xff800</CLOS_MASK>
    <CLOS_MASK>0xff800</CLOS_MASK>
  /RDT>

  <vm id="0">
   <guest_flags>
     <guest_flag>GUEST_FLAG_VCAT_ENABLED</guest_flag>
   </guest_flags>
   <clos>
     <vcpu_clos>3</vcpu_clos>
     <vcpu_clos>4</vcpu_clos>
     <vcpu_clos>5</vcpu_clos>
     <vcpu_clos>6</vcpu_clos>
     <vcpu_clos>7</vcpu_clos>
   </clos>
  </vm>

  <vm id="1">
   <clos>
     <vcpu_clos>1</vcpu_clos>
     <vcpu_clos>2</vcpu_clos>
   </clos>
  </vm>

 vm_configurations.c (generated by config-tools) with the above vCAT config:

  static uint16_t vm0_vcpu_clos[5U] = {3U, 4U, 5U, 6U, 7U};
  static uint16_t vm1_vcpu_clos[2U] = {1U, 2U};

  struct acrn_vm_config vm_configs[CONFIG_MAX_VM_NUM] = {
  {
  .guest_flags = (GUEST_FLAG_VCAT_ENABLED),
  .pclosids = vm0_vcpu_clos,
  .num_pclosids = 5U,
  .max_l3_pcbm = 0xff800U,
  },
  {
  .pclosids = vm1_vcpu_clos,
  .num_pclosids = 2U,
  },
  };

Tracked-On: #5917
Signed-off-by: dongshen <dongsheng.x.zhang@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-10-26 11:48:27 +08:00
Yonghua Huang
c8e2060d37 hv: unmap IOMMU register pages from service VM EPT
IOMMU hardware resource is owned by hypervisor, while
 IOMMU capability is reported to service VM in its ACPI
 table. In this case, Service VM may access IOMMU hardware
 resource, which is not expected.

 This patch unmaps all Intel IOMMU register pages for service VM EPT.

Tracked-On: #6677
Signed-off-by: Yonghua Huang <yonghua.huang@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
Reviewed-by: Victor Sun <victor.sun@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-10-22 09:31:10 +08:00
Fei Li
df7ffab441 hv: remove CONFIG_HV_RAM_SIZE
It's difficult to configure CONFIG_HV_RAM_SIZE properly at once. This patch
not only remove CONFIG_HV_RAM_SIZE, but also we use ld linker script to
dynamically get the size of HV RAM size.

Tracked-On: #6663
Signed-off-by: Fei Li <fei1.li@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-10-14 15:04:36 +08:00
Zide Chen
e48962faa6 hv: optimize run_vcpu() for nested
This patch implements a separate path for L2 VMEntry in run_vcpu(),
which has several benefits:

- keep run_vcpu() clean, to reduce the number of is_vcpu_in_l2_guest()
  statements:
  - current code has three is_vcpu_in_l2_guest() already.
  - supposed to have another 2 statement so that nested VMEntry won't
    hit the "Starting vCPU" and "vCPU launched" pr_info() and a few
    other statements in the VM launch path.

- save few other things in run_vcpu() that are not needed for nested.

Tracked-On: #6289
Signed-off-by: Zide Chen <zide.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@Intel.com>
2021-10-13 15:55:31 +08:00
Zide Chen
89bbc44962 hv: inject external interrupts only if LAPIC is not passthru
Tracked-On: #6289
Signed-off-by: Zide Chen <zide.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@Intel.com>
2021-10-08 09:18:34 +08:00
Zide Chen
228b052fdb hv: operations on vcpu->reg_cached/reg_updated don't need LOCK prefix
In run time, one vCPU won't read or write a register on other vCPUs,
thus we don't need the LOCK prefixed instructions on reg_cached and
reg_updated.

Tracked-On: #6289
Signed-off-by: Zide Chen <zide.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-10-08 09:11:10 +08:00
Zide Chen
2b683f8f5b hv: call vcpu_inject_exception() only when ACRN_REQUEST_EXCP is set
move the bitmap test call out of vcpu_inject_exception(), then we call
the expensive bitmap_test_and_clear_lock() only pending_req_bits is
non-zero and call vcpu_inject_exception() only if needed.

Tracked-On: #6289
Signed-off-by: Zide Chen <zide.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@Intel.com>
2021-10-07 20:48:43 +08:00
Zide Chen
f801ba4ed7 hv: update guest RIP only if vcpu->arch.inst_len is non zero
In very large number of VM extis, the VM-exit instruction length could be
zero, and it's no need to update VMX_GUEST_RIP.

Some examples:

- all external interrupt VM exits in non LAPIC passthru setup.
- for all the nested VM-exits that are reflecting to L1 hypervisor.

Tracked-On: #6289
Signed-off-by: Zide Chen <zide.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-10-07 20:47:07 +08:00
Zide Chen
b7e9a68923 hv: code cleanup in run_vcpu()
- wrap a new function exec_vmentry() to reduce code duplication.
- remove exec_vmread(VMX_GUEST_RSP) since ACRN doesn't need to know
  guest RSP in run time.

Tracked-On: #6289
Signed-off-by: Zide Chen <zide.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-10-07 20:47:07 +08:00
Zide Chen
ee12daff84 hv: nested: refine vmcs12_read/write_field APIs
Change "uint64_t vmcs_hva" to "void *vmcs_hva" in the input argument,
list, so that no type casting is needed when calling them from pointers.

Tracked-On: #6289
Signed-off-by: Zide Chen <zide.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-10-07 20:45:34 +08:00
Liu,Junming
4105ca2cb4 hv: deny the launch of VM if pass-thru PIO bar isn't identical mapping
In current design, when pass-thru dev,
for the PIO bar, need to ensure the guest PIO start address
equals to host PIO start address.
Then set the VMCS io bitmap to pass-thru the corresponding
port io to guest for performance.

ACRN-DM and acrn-config should ensure the identical mapping of PIO bar.
If ACRN-DM or acrn-config failed to achieve this,
we should deny the launch of VM

Tracked-On: #6508

Signed-off-by: Liu,Junming <junming.liu@intel.com>
Reviewed-by: Zhao Yakui <yakui.zhao@intel.com>
Reviewed-by: Fei Li <fei1.li@intel.com>
2021-09-28 08:49:01 +08:00
Zide Chen
a62dd6ad8a hv: nested: fixed vmxoff_vmexit_handler() issue
In VMXOFF vmexit handler, it's supposed to remove VMCS shadowing.

Tracked-On: #6289
Signed-off-by: Zide Chen <zide.chen@intel.com>
2021-09-26 08:49:35 +08:00
Zide Chen
45b036e028 hv: nested: enable multiple active VMCS12 support
This patch changes the size of vvmcs[] array from 1 to
PER_VCPU_ACTIVE_VVMCS_NUM, and actually enables multiple active VMCS12
support in ACRN.  The basic operations:

- if L1 VMPTRLDs a VMCS12 without previously VMCLEAR the current
  VMCS12, ACRN no longer unconditionally flushes the current VMCS12
  back to L1.  Instead, it tries to keep both the current and the newly
  loaded VMCS12 in the nested->vvmcs[] array, unless:

- if there is no more available vvmcs[] entry, ACRN flushes one active
  VMCS12 to make room for this new VMCS12.

Tracked-On: #6289
Signed-off-by: Zide Chen <zide.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-09-26 08:49:35 +08:00
Mingqiang Chi
f39c882359 hv:change log level for check_vmx_ctrl
Some processors don't support VMX_PROCBASED_CTLS_TERTIARY bit
and VMX_PROCBASED_CTLS2_UWAIT_PAUSE bit in MSRs
(IA32_VMX_PROCBASED_CTLS & IA32_VMX_PROCBASED_CTLS2),
HV will output error log which will cause confusion,
change the log level from pr_err to pr_info.

Tracked-On: #6397

Signed-off-by: Mingqiang Chi <mingqiang.chi@intel.com>
2021-09-24 10:17:19 +08:00
Jie Deng
064fd7647f hv: add priority based scheduler
This patch adds a new priority based scheduler to support
vCPU scheduling based on their pre-configured priorities.
A vCPU can be running only if there is no higher priority
vCPU running on the same pCPU.

Tracked-On: #6571
Signed-off-by: Jie Deng <jie.deng@intel.com>
2021-09-24 09:32:18 +08:00
Zide Chen
94cbe909ee hv: irq: identical vector mapping if LAPIC passthough
In local APIC passthrough case, when devices triggered a INTx interrupt, this
interrupt would be delivered to vCPU directly. For this case, need to set the
virtual vector in
the 'Interrupt Vector' field of physical IOxAPIC I/O REDIRECTION TABLE REGISTER
(bits 7:0) and 'Vector' field of vt-d Interrupt Remapping Table Entry (IRTE)
for Remapped Interrupts.

Assumption:
(a) IOAPIC pins won't be shared between LAPIC PT guest and other guests;
(b) The guest would not trigger this IRQ before it switched to x2 APIC mode.

Tracked-On: #5923
Signed-off-by: Zide Chen <zide.chen@intel.com>
2021-09-18 09:42:44 +08:00
Zide Chen
0466d7055f hv: nested: move the VMCS12 dirty flags to struct acrn_vvmcs
These dirty flags are supposed to be per VMCS12, so move them from the
per vCPU acrn_nested struct to the newly added acrn_vvmcs struct.

Tracked-On: #6289
Signed-off-by: Zide Chen <zide.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-09-17 10:58:43 +08:00
Zide Chen
4e54c3880b hv: nested: remove vcpu->arch.nested.current_vmcs12_ptr
This variable represents the L1 GPA of the current VMCS12.  But it's
no longer needed in the multiple active VMCS12 case, which uses the
following variables for this purpose.

- nested->current_vvmcs refers to the vvmcs[] entry which contains the
  cached current VMCS12, its associated VMCS02, and other context info.

- nested->current_vvmcs->vmcs12_gpa refers to the L1 GPA of this
  current VMCS12.

Tracked-On: #6289
Signed-off-by: Zide Chen <zide.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-09-17 10:58:43 +08:00
Zide Chen
799a4d332a hv: nested: initial implementation of struct acrn_vvmcs
Add an array of struct acrn_vvmcs to struct acrn_nested, so it is
possible to cache multiple active VMCS12s.

This patch declares the size of this array to 1, meaning that there is
only one active VMCS12.  This is to minimize the logical code changes.

Add pointer current_vvmcs to struct acrn_nested, which refers to the
current vvmcs[] entry.  In this patch, if any VMCS12 is active, it
always points to vvmcs[0].

Tracked-On: #6289
Signed-off-by: Zide Chen <zide.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-09-17 10:58:43 +08:00
Zide Chen
cf697e753d hv: nested: some API signature changes
No any logical changes, this patch is preparing for multiple active
VMCS12 support.

- currently it's easy to get the vmcs12 pointer from the vcpu pointer.
  In multiple active vmcs12 case, we need to explicitly add "struct
  acrn_vmcs12 *vmcs12" to certain APIs' input argument list, in order to
  get the desired vmcs12 pointer.

- merge flush_current_vmcs12() into clear_vmcs02() for multiple reasons:
  a) it's called only once; b) we don't wrap the opposite operation
  (loading vmcs12) in an API; c) this API has simple and clear logic.

Tracked-On: #6289
Signed-off-by: Zide Chen <zide.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-09-17 10:58:43 +08:00
Zide Chen
e9eb72d319 hv: nested: flush L2 VPID only when it could conflict with L1 VPIDs
By changing the way to assign L1 VPID from bottom-up to top-down,
the possibilities for VPID conflicts between L1 and L2 guests are
small.

Then we can flush VPID just in case of conflicting.

Tracked-On: #6289
Signed-off-by: Anthony Xu <anthony.xu@intel.com>
Signed-off-by: Zide Chen <zide.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-09-16 09:26:10 +08:00
Zide Chen
1ab65825ba hv: nested: merge gpa_field_dirty and control_field_dirty flag
In run time, it's rare for L1 to write to the intercepted non host-state
VMCS fields, and using multiple dirty flags is not necessary.

This patch uses one single dirty flag to manage all non host-state VMCS
fields.  This helps to simplify current code and in the future we may
not need to declare new dirty flags when we intercept more VMCS fields.

Tracked-On: #5923
Signed-off-by: Zide Chen <zide.chen@intel.com>
2021-09-13 15:50:01 +08:00
Zide Chen
6376d5a0d3 hv: nested: fix bug in syncing EPTP from VMCS12 to VMCS02
Currently vmptrld_vmexit_handler() doesn't sync VMX_EPT_POINTER_FULL
from vmcs12 to vmcs02, instead it sets gpa_field_dirty and relies on
nested_vmentry() to sync EPTP in next nested VMentry.

This creates readability issue since all other intercepted VMCS fields
are synced in sync_vmcs12_to_vmcs02().  Another issue is that other
VMCS fields managed by gpa_field_dirty are repeatedly synced in both
vmptrld and nested vmentry handler.

This patch moves get_nept_desc() ahead of sync_vmcs12_to_vmcs02(), such
that shadow_eptp is allocated before sync_vmcs12_to_vmcs02() which
can sync EPTP properly.

BTW, in nested_vmexit_handler(), don't need to read from VMCS to get
the exit reason, since vcpu->arch.exit_reason has it already.

Tracked-On: #5923
Signed-off-by: Zide Chen <zide.chen@intel.com>
2021-09-13 15:50:01 +08:00
Zide Chen
11c2f3eabb hv: check bitmap before calling bitmap_test_and_clear_lock()
The locked btr instruction is expensive.  This patch changes the
logic to ensure that the bitmap is non-zero before executing
bitmap_test_and_clear_lock().

The VMX transition time gets significant improvement.  SOS running
on TGL, the CPUID roundtrip reduces from ~2400 cycles to ~2000 cycles.

Tracked-On: #6289
Signed-off-by: Zide Chen <zide.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-09-02 16:09:33 +08:00
Zide Chen
aeb3690b6f hv: simplify is_lapic_pt_enabled()
is_lapic_pt_enabled() is called at least twice in one loop of the vCPU
thread, and it's called in vmexit_handler() frequently if LAPIC is not
pass-through.  Thus the efficiency of this function has direct
impact to the system performance.

Since the LAPIC mode is not changed in run time, we don't have to
calculate it on the fly in is_lapic_pt_enabled().

BTW, removed the unused lapic_mask from struct acrn_vcpu_arch.

Tracked-On: #6289
Signed-off-by: Zide Chen <zide.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-08-26 09:52:10 +08:00
Shiqing Gao
d90dbc0d91 hv: check the capability of XSAVES/XRSTORS instructions before execution
For platforms that do not support XSAVES/XRSTORS instructions, like QEMU,
executing these instructions causes #UD.
This patch adds the check before the execution of XSAVES/XRSTORS instructions.

It also refines the logic inside rstore_xsave_area for the following reason:
If XSAVES/XRSTORS instructions are supported, restore XSAVE area if any of the
following conditions is met:
 1. "vcpu->launched" is false (state initialization for guest)
 2. "vcpu->arch.xsave_enabled" is true (state restoring for guest)

 * Before vCPU is launched, condition 1 is satisfied.
 * After vCPU is launched, condition 2 is satisfied because
   is_valid_xsave_combination() guarantees that "vcpu->arch.xsave_enabled"
   is consistent with pcpu_has_cap(X86_FEATURE_XSAVES).
Therefore, the check against "vcpu->launched" and "vcpu->arch.xsave_enabled"
can be eliminated here.

Tracked-On: #6481

Signed-off-by: Shiqing Gao <shiqing.gao@intel.com>
Acked-by: Eddie Dong <eddie.dong@Intel.com>
2021-08-26 09:42:23 +08:00
Zide Chen
cbf3825140 hv: Pass-through IA32_TSC_AUX MSR to L1 guest
Use an unused MSR on host to save ACRN pcpu ID and avoid saving and
restoring TSC AUX MSR on VMX transitions.

Tracked-On: #6289
Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com>
Signed-off-by: Zide Chen <zide.chen@intel.com>
Reviewed-by: Eddie Dong <eddie.dong@intel.com>
2021-08-26 09:25:54 +08:00
Yifan Liu
d33c76f701 hv: quirks: SMBIOS passthrough for prelaunched-VM
This feature is guarded under config CONFIG_SECURITY_VM_FIXUP, which
by default should be disabled.

This patch passthrough native SMBIOS information to prelaunched VM.
SMBIOS table contains a small entry point structure and a table, of which
the entry point structure will be put in 0xf0000-0xfffff region in guest
address space, and the table will be put in the ACPI_NVS region in guest
address space.

v2 -> v3:
uuid_is_equal moved to util.h as inline API
result -> pVendortable, in function efi_search_guid
recalc_checksum -> generate_checksum
efi_search_smbios -> efi_search_smbios_eps
scan_smbios_eps -> mem_search_smbios_eps
EFI GUID definition kept

Tracked-On: #6320
Signed-off-by: Yifan Liu <yifan1.liu@intel.com>
2021-08-26 09:24:50 +08:00
Zide Chen
6d7eb6d7b6 hv: emulate IA32_EFER and adjust Load EFER VMX controls
This helps to improve performance:

- Don't need to execute VMREAD in vcpu_get_efer(), which is frequently
  called.

- VMX_EXIT_CTLS_SAVE_EFER can be removed from VM-Exit Controls.

- If the value of IA32_EFER MSR is identical between the host and guest
  (highly likely), adjust the VMX controls not to load IA32_EFER on
  VMExit and VMEntry.

It's convenient to continue use the exiting vcpu_s/get_efer() APIs,
other than the common vcpu_s/get_guest_msr().

Tracked-On: #6289
Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com>
Signed-off-by: Zide Chen <zide.chen@intel.com>
2021-08-24 11:16:53 +08:00
Liang Yi
2b3620de7d hv: mask off LA57 in cpuid
Mask off support of 57-bit linear addresses and five-level paging.

ICX-D has LA57 but ACRN doesn't support 5-level paging yet.

Tracked-On: #6357
Signed-off-by: Liang Yi <yi.liang@intel.com>
Signed-off-by: Li, Fei1 <fei1.li@intel.com>
2021-08-20 11:02:21 +08:00
Shiqing Gao
651d44432c hv: initialize the XSAVE related processor state for guest
If SOS is using kernel 5.4, hypervisor got panic with #GP.

Here is an example on KBL showing how the panic occurs when kernel 5.4 is used:
Notes:
 * Physical MSR_IA32_XSS[bit 8] is 1 when physical CPU boots up.
 * vcpu_get_guest_msr(vcpu, MSR_IA32_XSS)[bit 8] is initialized to 0.

Following thread switches would happen at run time:
1. idle thread -> vcpu thread
   context_switch_in happens and rstore_xsave_area is called.
   At this moment, vcpu->arch.xsave_enabled is false as vcpu is not launched yet
   and init_vmcs is not called yet (where xsave_enabled is set to true).
   Thus, physical MSR_IA32_XSS is not updated with the value of guest MSR_IA32_XSS.

   States at this point:
    * Physical MSR_IA32_XSS[bit 8] is 1.
    * vcpu_get_guest_msr(vcpu, MSR_IA32_XSS)[bit 8] is 0.

2. vcpu thread -> idle thread
   context_switch_out happens and save_xsave_area is called.
   At this moment, vcpu->arch.xsave_enabled is true. Processor state is saved
   to memory with XSAVES instruction. As physical MSR_IA32_XSS[bit 8] is 1,
   ectx->xs_area.xsave_hdr.hdr.xcomp_bv[bit 8] is set to 1 after the execution
   of XSAVES instruction.

   States at this point:
    * Physical MSR_IA32_XSS[bit 8] is 1.
    * vcpu_get_guest_msr(vcpu, MSR_IA32_XSS)[bit 8] is 0.
    * ectx->xs_area.xsave_hdr.hdr.xcomp_bv[bit 8] is 1.

3. idle thread -> vcpu thread
   context_switch_in happens and rstore_xsave_area is called.
   At this moment, vcpu->arch.xsave_enabled is true. Physical MSR_IA32_XSS is
   updated with the value of guest MSR_IA32_XSS, which is 0.

   States at this point:
    * Physical MSR_IA32_XSS[bit 8] is 0.
    * vcpu_get_guest_msr(vcpu, MSR_IA32_XSS)[bit 8] is 0.
    * ectx->xs_area.xsave_hdr.hdr.xcomp_bv[bit 8] is 1.

   Processor state is restored from memory with XRSTORS instruction afterwards.
   According to SDM Vol1 13.12 OPERATION OF XRSTORS, a #GP occurs if XCOMP_BV
   sets a bit in the range 62:0 that is not set in XCR0 | IA32_XSS.
   So, #GP occurs once XRSTORS instruction is executed.

Such issue does not happen with kernel 5.10. Because kernel 5.10 writes to
MSR_IA32_XSS during initialization, while kernel 5.4 does not do such write.
Once guest writes to MSR_IA32_XSS, it would be trapped to hypervisor, then,
physical MSR_IA32_XSS and the value of MSR_IA32_XSS in vcpu->arch.guest_msrs
are updated with the value specified by guest. So, in the point 2 above,
correct processor state is saved. And #GP would not happen in the point 3.

This patch initializes the XSAVE related processor state for guest.
If vcpu is not launched yet, the processor state is initialized according to
the initial value of vcpu_get_guest_msr(vcpu, MSR_IA32_XSS), ectx->xcr0,
and ectx->xs_area. With this approach, the physical processor state is
consistent with the one presented to guest.

Tracked-On: #6434

Signed-off-by: Shiqing Gao <shiqing.gao@intel.com>
Reviewed-by: Li Fei1 <fei1.li@intel.com>
2021-08-20 09:46:09 +08:00
Zide Chen
2e6cf2b85b hv: nested: fix bugs in init_vmx_msrs()
Currently init_vmx_msrs() emulates same value for the IA32_VMX_xxx_CTLS
and IA32_VMX_TRUE_xxx_CTLS MSRs.

But the value of physical MSRs could be different between the pair,
and we need to adjust the emulated value accordingly.

Tracked-On: #6289
Signed-off-by: Zide Chen <zide.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-08-20 09:40:50 +08:00
Zide Chen
ad37553873 hv: nested: redundant permission check on nested_vmentry()
check_vmx_permission() is called in vmresume_vmexit_handler() and
vmlaunch_vmexit_handler() already.

Tracked-On: #6289
Signed-off-by: Zide Chen <zide.chen@intel.com>
2021-08-20 08:14:40 +08:00
Zhou, Wu
53f6720d13 HV: Combine the acpi loading fucntion to one place
Remove the acpi loading function from elf_loader, rawimage_loaer and
bzimage_loader, and call it together in vm_sw_loader.

Now the vm_sw_loader's job is not just loading sw, so we rename it to
prepare_os_image.

Tracked-On: #6323

Signed-off-by: Zhou, Wu <wu.zhou@intel.com>
Reviewed-by: Victor Sun <victor.sun@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-08-19 20:00:45 +08:00
Victor Sun
3124018917 HV: vm_load: rename vboot_info.h to vboot.h
vboot_info.h declares vm loader function also, so rename the file name to
vboot.h;

Tracked-On: #6323

Signed-off-by: Victor Sun <victor.sun@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-08-19 20:00:45 +08:00
Fei Li
2e7491a8ec hv: mmiodev: a minor bug fix about refine acrn_mmiodev data structure
Rename base_hpa to host_pa in acrn_mmiodev data structure.

Tracked-On: #6366
Signed-off-by: Fei Li <fei1.li@intel.com>
2021-08-19 12:01:35 +08:00
Liu,Junming
2c5c8754de hv:enable GVT-d for pre-launched linux guest in logical partion mode
When pass-thru GPU to pre-launched Linux guest,
need to pass GPU OpRegion to the guest.
Here's the detailed steps:
1. reserve a memory region in ve820 table for GPU OpRegion
2. build EPT mapping for GPU OpRegion to pass-thru OpRegion to guest
3. emulate the pci config register for OpRegion
For the third step, here's detailed description:
The address of OpRegion locates on PCI config space offset 0xFC,
Normal Linux guest won't write this register,
so we can regard this register as read-only.
When guest reads this register, return the emulated value.
When guest writes this register, ignore the operation.

Tracked-On: #6387

Signed-off-by: Liu,Junming <junming.liu@intel.com>
2021-08-19 11:56:26 +08:00
Fei Li
a705ff2dac hv: relocate ACPI DATA address to 0x7fe00000
Relocate ACPI address to 0x7fe00000 and ACPI NVS to 0x7ff00000 correspondingly.
In this case, we could include TPM event log region [0x7ffb0000, 0x80000000)
into ACPI NVS.

Tracked-On: #6320
Signed-off-by: Fei Li <fei1.li@intel.com>
2021-08-11 14:45:55 +08:00
Fei Li
74e68e39d1 hv: tpm2: do tpm2 fixup for security vm
ACRN used to prepare the vTPM2 ACPI Table for pre-launched VM at the build stage
using config tools. This is OK if the TPM2 ACPI Table never changes. However,
TPM2 ACPI Table may be changed in some conditions: change BIOS configuration or
update BIOS.

This patch do TPM2 fixup to update the vTPM2 ACPI Table and TPM2 MMIO resource
configuration according to the physical TPM2 ACPI Table.

Tracked-On: #6366
Signed-off-by: Tao Yuhong <yuhong.tao@intel.com>
Signed-off-by: Fei Li <fei1.li@intel.com>
2021-08-11 14:45:55 +08:00
Fei Li
f81b39225c HV: refine acrn_mmiodev data structure
1. add a name field to indicate what the MMIO Device is.
2. add two more MMIO resource to the acrn_mmiodev data structure.

Tracked-On: #6366
Signed-off-by: Tao Yuhong <yuhong.tao@intel.com>
Signed-off-by: Fei Li <fei1.li@intel.com>
2021-08-11 14:45:55 +08:00
Fei Li
20061b7c39 hv: remove xsave dependence
ACRN could run without XSAVE Capability. So remove XSAVE dependence to support
more (hardware or virtual) platforms.

Tracked-On: #6287
Signed-off-by: Fei Li <fei1.li@intel.com>
2021-08-10 16:36:15 +08:00
Tao Yuhong
171856c46b hv: uc-lock: Fix do not trap #GP
If HV enable trigger #GP for uc-lock, and is about to emulate guest uc-lock
instructions, should trap guest #GP. Guest uc-lock instrucction trigger #GP,
cause vmexit for #GP, HV handle this vmexit and emulate uc-lock
instruction.

Tracked-On: #6299
Signed-off-by: Tao Yuhong <yuhong.tao@intel.com>
2021-08-09 15:33:12 +08:00
Minggui Cao
80ae3224d9 hv: expose PMC to core partition VM
for core partition VM (like RTVM), PMC is always used for performance
profiling / tuning, so expose PMC capability and pass-through its MSRs
to the VM.

Tracked-On: #6307
Signed-off-by: Minggui Cao <minggui.cao@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-07-27 14:58:28 +08:00
Minggui Cao
eba8c4e78b hv: use ARRAY_SIZE to calc local array size
if one array just used in local only, and its size not used extern,
use ARRAY_SIZE macro to calculate its size.

Tracked-On: #6307
Signed-off-by: Minggui Cao <minggui.cao@intel.com>
Reviewed-by: Junjie Mao <junjie.mao@intel.com>
2021-07-27 14:58:28 +08:00
Yifan Liu
69fef2e685 hv: debug: Add hv console callback to VM-exit event
In some scenarios (e.g., nested) where lapic-pt is enabled for a vcpu
running on a pcpu hosting console timer, the hv console will be
inaccessible.

This patch adds the console callback to every VM-exit event so that the
console can still be somewhat functional under such circumstance.

Since this is VM-exit driven, the VM-exit/second can be low in certain
cases (e.g., idle or running stress workload). In extreme cases where
the guest panics/hangs, there will be no VM-exits at all.

In most cases, the shell is laggy but functional (probably enough for
debugging purpose).

Tracked-On: #6312
Signed-off-by: Yifan Liu <yifan1.liu@intel.com>
2021-07-22 10:08:23 +08:00
Tao Yuhong
8360c3dfe6 HV: enable #GP for UC lock
For an atomic operation using bus locking, it would generate LOCK# bus
signal, if it has Non-WB memory operand. This is an UC lock. It will
ruin the RT behavior of the system.
If MSR_IA32_CORE_CAPABILITIES[bit4] is 1, then CPU can trigger #GP
for instructions which cause UC lock. This feature is controlled by
MSR_TEST_CTL[bit28].
This patch enables #GP for guest UC lock.

Tracked-On: #6299
Signed-off-by: Tao Yuhong <yuhong.tao@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-07-21 11:25:47 +08:00
Tao Yuhong
2aba7f31db HV: rename splitlock file name
Because the emulation code is for both split-lock and uc-lock,
rename splitlock.c/splitlock.h to lock_instr_emul.c/lock_instr_emul.h

Tracked-On: #6299
Signed-off-by: Tao Yuhong <yuhong.tao@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-07-21 11:25:47 +08:00
Tao Yuhong
7926504011 HV: rename split-lock emulation APIs
Because the emulation code is for both split-lock and uc-lock, Changed
these API names:
vcpu_kick_splitlock_emulation() -> vcpu_kick_lock_instr_emulation()
vcpu_complete_splitlock_emulation() -> vcpu_complete_lock_instr_emulation()
emulate_splitlock() -> emulate_lock_instr()

Tracked-On: #6299
Signed-off-by: Tao Yuhong <yuhong.tao@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-07-21 11:25:47 +08:00
Tao Yuhong
bbd7b7091b HV: re-use split-lock emulation code for uc-lock
Split-lock emulation can be re-used for uc-lock. In emulate_splitlock(),
it only work if this vmexit is for #AC trap and guest do not handle
split-lock and HV enable #AC for splitlock.
Add another condition to let emulate_splitlock() also work for #GP trap
and guest do not handle uc-lock and HV enable #GP for uc-lock.

Tracked-On: #6299
Signed-off-by: Tao Yuhong <yuhong.tao@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-07-21 11:25:47 +08:00
Tao Yuhong
553d59644b HV: Fix decode_instruction() trigger #UD for emulating UC-lock
When ACRN uses decode_instruction to emulate split-lock/uc-lock
instruction, It is actually a try-decode to see if it is XCHG.
If the instruction is XCHG instruction, ACRN must emulate it
(inject #PF if it is triggered) with peer VCPUs paused, and advance
the guest IP. If the instruction is a LOCK prefixed instruction
with accessing the UC memory, ACRN Halted the peer VCPUs, and
advance the IP to skip the LOCK prefix, and then let the VCPU
Executes one instruction by enabling IRQ Windows vm-exit. For
other cases, ACRN injects the exception back to VCPU without
emulating it.
So change the API to decode_instruction(vcpu, bool full_decode),
when full_decode is true, the API does same thing as before. When
full_decode is false, the different is if decode_instruction() meet unknown
instruction, will keep return = -1 and do not inject #UD. We can use
this to distinguish that an #UD has been skipped, and need inject #AC/#GP back.

Tracked-On: #6299
Signed-off-by: Tao Yuhong <yuhong.tao@intel.com>
2021-07-21 11:25:47 +08:00
Shuo A Liu
6e0b12180c hv: dm: Use new power management data structures
struct cpu_px_data		->	struct acrn_pstate_data
struct cpu_cx_data		->	struct acrn_cstate_data
enum pm_cmd_type		->	enum acrn_pm_cmd_type
struct acpi_generic_address	->	struct acrn_acpi_generic_address
cpu_cx_data			->	acrn_cstate_data
cpu_px_data			->	acrn_pstate_data

IC_PM_GET_CPU_STATE		->	ACRN_IOCTL_PM_GET_CPU_STATE

PMCMD_GET_PX_CNT		->	ACRN_PMCMD_GET_PX_CNT
PMCMD_GET_CX_CNT		->	ACRN_PMCMD_GET_CX_CNT
PMCMD_GET_PX_DATA		->	ACRN_PMCMD_GET_PX_DATA
PMCMD_GET_CX_DATA		->	ACRN_PMCMD_GET_CX_DATA

Tracked-On: #6282
Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>
2021-07-15 11:53:54 +08:00
Shuo A Liu
9c910bae44 hv: dm: Use new I/O request data structures
struct vhm_request		->	struct acrn_io_request
union vhm_request_buffer	->	struct acrn_io_request_buffer
struct pio_request		->	struct acrn_pio_request
struct mmio_request		->	struct acrn_mmio_request
struct ioreq_notify		->	struct acrn_ioreq_notify

VHM_REQ_PIO_INVAL		->	IOREQ_PIO_INVAL
VHM_REQ_MMIO_INVAL		->	IOREQ_MMIO_INVAL
REQ_PORTIO			->	ACRN_IOREQ_TYPE_PORTIO
REQ_MMIO			->	ACRN_IOREQ_TYPE_MMIO
REQ_PCICFG			->	ACRN_IOREQ_TYPE_PCICFG
REQ_WP				->	ACRN_IOREQ_TYPE_WP

REQUEST_READ			->	ACRN_IOREQ_DIR_READ
REQUEST_WRITE			->	ACRN_IOREQ_DIR_WRITE
REQ_STATE_PROCESSING		->	ACRN_IOREQ_STATE_PROCESSING
REQ_STATE_PENDING		->	ACRN_IOREQ_STATE_PENDING
REQ_STATE_COMPLETE		->	ACRN_IOREQ_STATE_COMPLETE
REQ_STATE_FREE			->	ACRN_IOREQ_STATE_FREE

IC_CREATE_IOREQ_CLIENT		->	ACRN_IOCTL_CREATE_IOREQ_CLIENT
IC_DESTROY_IOREQ_CLIENT		->	ACRN_IOCTL_DESTROY_IOREQ_CLIENT
IC_ATTACH_IOREQ_CLIENT		->	ACRN_IOCTL_ATTACH_IOREQ_CLIENT
IC_NOTIFY_REQUEST_FINISH	->	ACRN_IOCTL_NOTIFY_REQUEST_FINISH
IC_CLEAR_VM_IOREQ		->	ACRN_IOCTL_CLEAR_VM_IOREQ
HYPERVISOR_CALLBACK_VHM_VECTOR	->	HYPERVISOR_CALLBACK_HSM_VECTOR

arch_fire_vhm_interrupt()	->	arch_fire_hsm_interrupt()
get_vhm_notification_vector()	->	get_hsm_notification_vector()
set_vhm_notification_vector()	->	set_hsm_notification_vector()
acrn_vhm_notification_vector	->	acrn_hsm_notification_vector
get_vhm_req_state()		->	get_io_req_state()
set_vhm_req_state()		->	set_io_req_state()

Below structures have slight difference with former ones.

  struct acrn_ioreq_notify
  strcut acrn_io_request

Tracked-On: #6282
Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>
2021-07-15 11:53:54 +08:00
Shuo A Liu
107cae316a hv: dm: Use new ioctl ACRN_IOCTL_SET_VCPU_REGS
struct acrn_set_vcpu_regs	->	struct acrn_vcpu_regs
struct acrn_vcpu_regs		->	struct acrn_regs
IC_SET_VCPU_REGS		->	ACRN_IOCTL_SET_VCPU_REGS

Tracked-On: #6282
Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>
2021-07-15 11:53:54 +08:00
Shuo A Liu
f476ca55ab hv: dm: Use new VM management ioctls
IC_CREATE_VM		->	ACRN_IOCTL_CREATE_VM
IC_DESTROY_VM		->	ACRN_IOCTL_DESTROY_VM
IC_START_VM		->	ACRN_IOCTL_START_VM
IC_PAUSE_VM		->	ACRN_IOCTL_PAUSE_VM
IC_RESET_VM		->	ACRN_IOCTL_RESET_VM

struct acrn_create_vm	->	struct acrn_vm_creation

Tracked-On: #6282
Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>
2021-07-15 11:53:54 +08:00
Shuo A Liu
9c1caad25a hv: nested: Keep privilege bits sync in shadow EPT entry
Guest may not use INVEPT instruction after enabling any of bits 2:0 from
0 to 1 of a present EPT entry, then the shadow EPT entry has no chance
to sync guest EPT entry. According to the SDM,
"""
Software may use the INVEPT instruction after modifying a present EPT
paging-structure entry (see Section 28.2.2) to change any of the
privilege bits 2:0 from 0 to 1.1 Failure to do so may cause an EPT
violation that would not otherwise occur. Because an EPT violation
invalidates any mappings that would be used by the access that caused
the EPT violation (see Section 28.3.3.1), an EPT violation will not
recur if the original access is performed again, even if the INVEPT
instruction is not executed.
"""

Sync the afterthought of privilege bits from guest EPT entry to shadow
EPT entry to cover above case.

Tracked-On: #5923
Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-07-02 09:24:12 +08:00
Shuo A Liu
a431cff94e hv: Use 64 bits definition for 64 bits MSR_IA32_VMX_EPT_VPID_CAP operation
MSR_IA32_VMX_EPT_VPID_CAP is 64 bits. Using 32 bits MACROs with it may
cause the bit expression wrong.

Unify the MSR_IA32_VMX_EPT_VPID_CAP operation with 64 bits definition.

Tracked-On: #5923
Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-07-02 09:24:12 +08:00
Victor Sun
e371432695 HV: avoid pre-launched VM modules being corrupted by SOS kernel load
When hypervisor boots, the multiboot modules have been loaded to host space
by bootloader already. The space range of pre-launched VM modules is also
exposed to SOS VM, so SOS VM kernel might pick this range to extract kernel
when KASLR enabled. This would corrupt pre-launched VM modules and result in
pre-launched VM boot fail.

This patch will try to fix this issue. The SOS VM will not be loaded to guest
space until all pre-launched VMs are loaded successfully.

Tracked-On: #5879

Signed-off-by: Victor Sun <victor.sun@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
2021-06-11 10:06:02 +08:00
Victor Sun
268d4c3f3c HV: boot guest with boot params
Previously the load GPA of LaaG boot params like zeropage/cmdline and
initgdt are all hard-coded, this would bring potential LaaG boot issues.

The patch will try to fix this issue by finding a 32KB load_params memory
block for LaaG to store these guest boot params.

For other guest with raw image, in general only vgdt need to be cared of so
the load_params will be put at 0x800 since it is a common place that most
guests won't touch for entering protected mode.

Tracked-On: #5626

Signed-off-by: Victor Sun <victor.sun@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
2021-06-11 10:06:02 +08:00
Victor Sun
ed97022646 HV: add find_space_from_ve820() api
The API would search ve820 table and return a valid GPA when the requested
size of memory is available in the specified memory range, or return
INVALID_GPA if the requested memory slot is not available;

Tracked-On: #5626

Signed-off-by: Victor Sun <victor.sun@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
2021-06-11 10:06:02 +08:00
Victor Sun
6127c0c5d2 HV: modify low 1MB area for pre-launched VM e820
The memory range of [0xA0000, 0xFFFFF] is a known reserved area for BIOS,
actually Linux kernel would enforce this area to be reserved during its
boot stage. Set this area to usable would cause potential compatibility
issues.

The patch set the range to reserved type to make it consistent with the
real world.

BTW, There should be a EBDA(Entended BIOS DATA Area) with reserved type
exist right before 0xA0000 in real world for non-EFI boot. But given ACRN
has no legacy BIOS emulation, we simply skipped the EBDA in vE820.

Tracked-On: #5626

Signed-off-by: Victor Sun <victor.sun@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
2021-06-11 10:06:02 +08:00
Victor Sun
28b7cee412 HV: modularization: rename multiboot.h to boot.h
Given the structure in multiboot.h could be used for any boot protocol,
use a more generic name "boot.h" instead;

Tracked-On: #5661

Signed-off-by: Victor Sun <victor.sun@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
2021-06-11 10:06:02 +08:00
Shuo A Liu
9ae32f96af hv: Wrap same code as a static function
vmptrld_vmexit_handler() has a same code snippet with
vmclear_vmexit_handler(). Wrap the same code snippet as a static
function clear_vmcs02().

There is only a small logic change that add
	nested->current_vmcs12_ptr = INVALID_GPA
in vmptrld_vmexit_handler() for the old VMCS. That's reasonable.

Tracked-On: #5923
Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-06-09 10:07:05 +08:00
Shuo A Liu
387ea23961 hv: Rename get_ept_entry() to get_eptp()
get_ept_entry() actually returns the EPTP of a VM. So rename it to
get_eptp() for readability.

Tracked-On: #5923
Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-06-09 10:07:05 +08:00
Zide Chen
b6b5373818 hv: deny access to HV owned legacy PIO UART from SOS
We need to deny accesses from SOS to the HV owned UART device, otherwise
SOS could have direct access to this physical device and mess up the HV
console.

If ACRN debug UART is configured as PIO based, For example,
CONFIG_SERIAL_PIO_BASE is generated from acrn-config tool, or the UART
config is overwritten by hypervisor parameter "uart=port@<port address>",
it could run into problem if ACRN doesn't emulate this UART PIO port
to SOS. For example:

- none of the ACRN emulated vUART devices has same PIO port with the
  port of the debug UART device.
- ACRN emulates PCI vUART for SOS (configure "console_vuart" with
  PCI_VUART in the scenario configuration)

This patch fixes the above issue by masking PIO accesses from SOS.
deny_hv_owned_devices() is moved after setup_io_bitmap() where
vm->arch_vm.io_bitmap is initialized.

Commit 50d852561 ("HV: deny HV owned PCI bar access from SOS") handles
the case that ACRN debug UART is configured as a PCI device. e.g.,
hypervisor parameter "uart=bdf@<BDF value>" is appended.

If the hypervisor debug UART is MMIO based, need to configured it as
a PCI type device, so that it can be hidden from SOS.

Tracked-On: #5923
Signed-off-by: Zide Chen <zide.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-06-08 16:16:14 +08:00
Yonghua Huang
25c0e3817e hv: validate input for dmar_free_irte function
Malicious input 'index' may trigger buffer
 overflow on array 'irte_alloc_bitmap[]'.

 This patch validate that 'index' shall be
 less than 'CONFIG_MAX_IR_ENTRIES' and also
 remove unnecessary check on 'index' in
 'ptirq_free_irte()' function with this fix.

Tracked-On: #6132
Signed-off-by: Yonghua Huang <yonghua.huang@intel.com>
2021-06-08 09:03:10 +08:00
Yonghua Huang
4acaeb91bd hv: remove unnecessary ASSERT in vlapic_write
vlapic_write handle 'offset' that is valid and ignore
 all other invalid 'offset'. so ASSERT on this 'offset'
 input is unnecessary.

 This patch removes above ASSERT to avoid potential
 hypervisor crash by guest malicious input when debug
 build is used.

Tracked-On: #6131
Signed-off-by: Yonghua Huang <yonghua.huang@intel.com>
2021-06-08 09:03:10 +08:00
Shuo A Liu
15e6c5b9cf hv: nested: audit guest EPT mapping during shadow EPT entries setup
generate_shadow_ept_entry() didn't verify the correctness of the requested
guest EPT mapping. That might leak host memory access to L2 VM.

To simplify the implementation of the guest EPT audit, hide capabilities
'map 2-Mbyte page' and 'map 1-Gbyte page' from L1 VM. In addition,
minimize the attribute bits of EPT entry when create a shadow EPT entry.
Also, for invalid requested mapping address, reflect the EPT_VIOLATION to
L1 VM.

Here, we have some TODOs:
1) Enable large page support in generate_shadow_ept_entry()
2) Evaluate if need to emulate the invalid GPA access of L2 in HV directly.
3) Minimize EPT entry attributes.

Tracked-On: #5923
Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-06-04 13:53:47 +08:00
Shuo A Liu
3110e70d0a hv: nested: INVEPT emulation supports shadow EPT
L1 VM changes the guest EPT and do INVEPT to invalidate the previous
TLB cache of EPT entries. The shadow EPT replies on INVEPT instruction
to do the update.

The target shadow EPTs can be found according to the 'type' of INVEPT.
Here are two types and their target shadow EPT,
  1) Single-context invalidation
     Get the EPTP from the INVEPT descriptor. Then find the target
     shadow EPT.
  2) Global invalidation
     All shadow EPTs of the L1 VM.

The INVEPT emulation handler invalidate all the EPT entries of the
target shadow EPTs.

Tracked-On: #5923
Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com>
Signed-off-by: Zide Chen <zide.chen@intel.com>
Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-06-04 13:53:47 +08:00
Shuo A Liu
1dc7b7f798 hv: nested: Introduce shadow EPT release function
When a shadow EPT is not used anymore, its resources need to be
released.

free_sept_table() is introduced to walk the whole shadow EPT table and
free the pagetable pages.

Please note, the PML4E page of shadow EPT is not freed by
free_sept_table() as it still be used to present a shadow EPT pointer.

Tracked-On: #5923
Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-06-04 13:53:47 +08:00
Shuo A Liu
b10b5658bd hv: nested: Introduce L2 VM EPT VIOLATION handler
With shadow EPT, the hypervisor walks through guest EPT table:

  * If the entry is not present in guest EPT, ACRN injects EPT_VIOLATION
    to L1 VM and resumes to L1 VM.

  * If the entry is present in guest EPT, do the EPT_MISCONFIG check.
    Inject EPT_MISCONFIG to L1 VM if the check failed.

  * If the entry is present in guest EPT, do permission check.
    Reflect EPT_VIOLATION to L1 VM if the check failed.

  * If the entry is present in guest EPT but shadow EPT entry is not
    present, create the shadow entry and resumes to L2 VM.

  * If the entry is present in guest EPT but the GPA in the entry is
    invalid, injects EPT_VIOLATION to L1 VM and resumes L1 VM.

Tracked-On: #5923
Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com>
Signed-off-by: Zide Chen <zide.chen@intel.com>
Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-06-04 13:53:47 +08:00
Shuo A Liu
8565750bbe hv: nested: Hide some capability bits from L1 guest
* Hide 5 level EPT capability, let L1 guest stick to 4 level EPT.

 * Access/Dirty bits are not support currently, hide corresponding EPT
   capability bits.

 * "Mode-based execute control for EPT" is also not support well
   currently, hide its capability bit from MSR_IA32_VMX_PROCBASED_CTLS2.

Tracked-On: #5923
Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com>
Signed-off-by: Zide Chen <zide.chen@intel.com>
Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-06-04 13:53:47 +08:00
Shuo A Liu
540a484147 hv: nested: Manage shadow EPTP according to guest VMCS change
'struct nept_desc' is used to associate guest EPTP with a shadow EPTP.
It's created in the first reference and be freed while no reference.

The life cycle seems like,

While guest VMCS VMX_EPT_POINTER_FULL is changed, the 'struct nept_desc'
of the new guest EPTP is referenced; the 'struct nept_desc' of the old
guest EPTP is dereferenced.

While guest VMCS be cleared(by VMCLEAR in L1 VM), the 'struct nept_desc'
of the old guest EPTP is dereferenced.

While a new guest VMCS be loaded(by VMPTRLD in L1 VM), the 'struct
nept_desc' of the new guest EPTP is referenced. The 'struct nept_desc'
of the old guest EPTP is dereferenced.

Tracked-On: #5923
Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com>
Signed-off-by: Zide Chen <zide.chen@intel.com>
Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-06-04 13:53:47 +08:00
Shuo A Liu
10ec896f99 hv: nested: Introduce shadow EPT infrastructure
To shadow guest EPT, the hypervisor needs construct a shadow EPT for each
guest EPT. The key to associate a shadow EPT and a guest EPT is the EPTP
(EPT pointer). This patch provides following structure to do the association.

	struct nept_desc {
	       /*
	        * A shadow EPTP.
	        * The format is same with 'EPT pointer' in VMCS.
	        * Its PML4 address field is a HVA of the hypervisor.
	        */
	       uint64_t shadow_eptp;
	       /*
	        * An guest EPTP configured by L1 VM.
	        * The format is same with 'EPT pointer' in VMCS.
	        * Its PML4 address field is a GPA of the L1 VM.
	        */
	       uint64_t guest_eptp;
	       uint32_t ref_count;
	};

Due to lack of dynamic memory allocation of the hypervisor, a array
nept_bucket of type 'struct nept_desc' is introduced to store those
association information. A guest EPT might be shared between different
L2 vCPUs, so this patch provides several functions to handle the
reference of the structure.

Interface get_shadow_eptp() also is introduced. To find the shadow EPTP
of a specified guest EPTP.

Tracked-On: #5923
Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com>
Signed-off-by: Zide Chen <zide.chen@intel.com>
Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-06-04 13:53:47 +08:00
Shuo A Liu
17bc7f08c9 hv: nested: Create a page pool for shadow EPT construction
Shadow EPT uses lots of pages to construct the shadow page table. To
utilize the memory more efficient, a page poll sept_page_pool is
introduced.

For simplicity, total platform RAM size is considered to calculate the
memory needed for shadow page tables. This is not an accurate upper
bound.  This can satisfy typical use-cases where there is not a lot
of overcommitment and sharing of memory between L2 VMs.

Memory of the pool is marked as reserved from E820 table in early stage.

Tracked-On: #5923
Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com>
Signed-off-by: Zide Chen <zide.chen@intel.com>
Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
2021-06-04 13:53:47 +08:00
Zide Chen
811e367ad9 hv: nested: implement nested VM exit handler
Nested VM exits happen when vCPU is in guest mode (VMCS02 is current).
Initially we reflect all nested VM exits to L1 hypervisor. To prepare
the environment to run L1 guest:

- restore some VMCS fields to the value as what L1 hypervisor programmed.
- VMCLEAR VMCS02, VMPTRLD VMCS01 and enable VMCS shadowing.
- load the non-shadowing host states from VMCS12 to VMCS01 guest states.
- VMRESUME to L1 guest with this modified VMCS01.

Tracked-On: #5923
Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com>
Signed-off-by: Zide Chen <zide.chen@intel.com>
Signed-off-by: Alexander Merritt <alex.merritt@intel.com>
2021-06-03 15:23:25 +08:00