Commit cbf3825 "hv: Pass-through IA32_TSC_AUX MSR to L1 guest"
lets guest own the physical MSR IA32_TSC_AUX and does not handle this MSR
in the hypervisor.
If multiple vCPUs share the same pCPU, when one vCPU reads MSR IA32_TSC_AUX,
it may get the value set by other vCPUs.
To fix this issue, this patch does:
- initialize the MSR content to 0 for the given vCPU, which is consistent with
the value specified in SDM Vol3 "Table 9-1. IA-32 and Intel 64 Processor
States Following Power-up, Reset, or INIT"
- save/restore the MSR content for the given vCPU during context switch
v1 -> v2:
* According to Table 9-1, the content of IA32_TSC_AUX MSR is unchanged
following INIT, v2 updates the initialization logic so that the content for
vCPU is consistent with SDM.
Tracked-On: #6799
Signed-off-by: Shiqing Gao <shiqing.gao@intel.com>
Reviewed-by: Zide Chen <zide.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
Generate serial configuration file for service VM according
to scenario file and vUART ports base address allocated by
config tool.
Currently, some non-standard serial ports are emulated in
hypervisor and will be used to do communication between service
VM and user VM, so need to generate serial configuration file
to configure these serial ports for service VM.
v1-->v2:
Fix some type issues
Refine script code format
Tracked-On: #6652
Signed-off-by: Xiangyang Wu <xiangyang.wu@intel.com>
This patch moves the ssram area in ve820 tab, and reunites the
hpa1_low_part1/2 areas. The ve820 building code is refined.
before:
|<---low_1M--->|
|<---hpa1_low_part1--->|
|<---SSRAM--->|
|<---hpa1_low_part2--->|
|<---GPU_OpRegion--->|
|<---ACPI DATA--->|
|<---ACPI NVS--->|
---2G---
after:
|<---low_1M--->|
|<---hpa_low--->|
|<---SSRAM--->|
|<---GPU_OpRegion--->|
|<---ACPI DATA--->|
|<---ACPI NVS--->|
---2G---
The SSRAM area's address is described in the ACPI's RTCT/PTCT
table. To simplify the SSRAM implementation, SSRAM area was
identical mapped to GPA, and resulted in the divition of hpa_low.
Then the ve820 building logic became too complicated.
Now we managed to edit the guest's RTCT/PTCT table by offline
tools in the former patch, so we can move the guest's SSRAM
area, and reunite the hpa_low areas again.
After doing this, this patch rewrites the ve820 building code
in a much simpler way.
Tracked-On: #6674
Signed-off-by: Zhou, Wu <wu.zhou@intel.com>
Reviewed-by: Eddie Dong <eddie.dong@intel.com>
Reviewed-by: Wang, Yu1 <yu1.wang@intel.com>
The ve820 table' hpa1_low area is divided into two parts, which
is making the code too complicated and causing problems. Moving
the entries that divides the hpa1_low could make things easier.
This patch moves the GPU OpRegion to the tail area of 2G,
consecutive to the acpi data/nvs area.
before:
|<---low_1M--->|
|<---hpa1_low_part1--->|
|<---SSRAM--->|
|<---GPU_OpRegion--->|
|<---hpa1_low_part2--->|
|<---ACPI DATA--->|
|<---ACPI NVS--->|
---2G---
after:
|<---low_1M--->|
|<---hpa1_low_part1--->|
|<---SSRAM--->|
|<---hpa1_low_part2--->|
|<---GPU_OpRegion--->|
|<---ACPI DATA--->|
|<---ACPI NVS--->|
---2G---
Tracked-On: #6674
Signed-off-by: Zhou, Wu <wu.zhou@intel.com>
Reviewed-by: Victor Sun <victor.sun@intel.com>
Reviewed-by: Wang, Yu1 <yu1.wang@intel.com>
The length of the ACPI data entry in ve820 tab was 960K, while the
ACPI file is 1MB. It causes the ACPI file copy failed due to reserved
ACPI regions in ve820 table is not enough when loading pre-launched
VMs. This patch changes ACPI data area to 1MB to fix the problem.
And the ACPI data length was missed when calculating
ENTRY_HPA1_LOW_PART2 length. Fixed here too.
Also adds some refinement to the hard-coded ACPI base/addr definations
Tracked-On: #6674
Signed-off-by: Zhou, Wu <wu.zhou@intel.com>
Reviewed-by: Eddie Dong <eddie.dong@intel.com>
Reviewed-by: Wang, Yu1 <yu1.wang@intel.com>
Previously we prepared 32KB buffer to store guest VM boot parameters, in which
there is at most 17KB buffer for guest EFI memory map to accomodate ~400 EFI
memory descriptors. But this is uncertain to ensure working on all boards, now
change the algorithm that make the EFI mmap buffer adaptable with configured
MAX_EFI_MMAP_ENTRIES macro.
Tracked-On: #6442
Signed-off-by: Victor Sun <victor.sun@intel.com>
Acked-by: Eddie Dong <eddie.dong@Intel.com>
In previous implementation we leave MAX_EFI_MMAP_ENTRIES in config tool and
let end user to configure it. However it is hard for end user to understand
how to configure it, also it is hard for board_inspector to get this value
automatically because this info is only meaningful during the kernel boot
stage and there is no such info available after boot in Linux toolset.
This patch hardcode the value to 350, and ASSERT if the board need more efi
mmap entries to run ACRN. User could modify the MAX_EFI_MMAP_ENTRIES macro
in case ASSERT occurs in DEBUG stage.
The More size of hv_memdesc[] only consume very little memory, the overhead
is (size * sizeof(struct efi_memory_desc)), i.e. (size * 40) in bytes.
Tracked-On: #6442
Signed-off-by: Victor Sun <victor.sun@intel.com>
Acked-by: Eddie Dong <eddie.dong@Intel.com>
The coding guideline rules C-TY-27 and C-TY-28, combined, requires that
assignment and arithmetic operations shall be applied only on operands of the
same kind. This patch either adds explicit type casts or adjust types of
variables to align the types of operands.
The only semantic change introduced by this patch is the promotion of the
second argument of set_vmcs_bit() and clear_vmcs_bit() to
uint64_t (formerly uint32_t). This avoids clear_vmcs_bit() to accidentally
clears the upper 32 bits of the requested VMCS field.
Other than that, this patch has no semantic change. Specifically this patch
is not meant to fix buggy narrowing operations, only to make these
operations explicit.
Tracked-On: #6776
Signed-off-by: Junjie Mao <junjie.mao@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
The coding guideline rule C-TY-24 requires that 'cast shall not be
performed on a function pointer'. This patch removes a duplicated explicit
cast on timer_expired_handler in tsc_deadline_timer.c.
This patch has no semantic impacts.
Tracked-On: #6776
Signed-off-by: Junjie Mao <junjie.mao@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
The coding guideline rule C-TY-12 requires that 'all type conversions shall
be explicit'. Especially implicit cases on the signedness of variables
shall be avoided.
This patch either adds explicit type casts or adjust local variable types
to make sure that Booleans, signed and unsigned integers are not used
mixedly.
This patch has no semantic changes.
Tracked-On: #6776
Signed-off-by: Junjie Mao <junjie.mao@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
The coding guideline rule C-TY-02 requires that 'the operands of bit
operations shall be unsigned'. This patch adds explicit casts or literal
suffixes to make explicit the type of values involved in bit operations.
Explicit casts to widen integers before shift operations are also
introduced to make explicit that the variables are expanded BEFORE it is
shifted (which is already so in C99 but implicitly).
This patch has no semantic changes.
Tracked-On: #6776
Signed-off-by: Junjie Mao <junjie.mao@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
The coding guideline rule C-ST-02 requires that 'the loop body shall be
enclosed with brackets', or more specifically, braces. This patch adds
braces to the single-line loop bodies.
This patch has no semantic change.
Tracked-On: #6776
Signed-off-by: Junjie Mao <junjie.mao@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
The coding guideline rule C-PP-04 requires that 'parentheses shall be used
when referencing a MACRO parameter'. This patch adds parentheses to macro
parameters or expressions that are not yet wrapped properly.
This patch has no sematic impact.
Tracked-On: #6776
Signed-off-by: Junjie Mao <junjie.mao@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
The coding guideline gule C-FN-09 requires that 'the formal parameter name
of a function shall be consistent'. This patch fixes two places where the
formal parameters are named differently in declarations and
definitions. More specifically, the names in declarations are replaced with
those in definitions.
This patch has no semantic impact.
Tracked-On: #6776
Signed-off-by: Junjie Mao <junjie.mao@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
The coding guideline rule C-FN-06 requires that 'a parameter passed by
value to a function shall not be modified directly'. This patch rewrites
two functions which does modify its parameters today.
This patch has no semantic impact.
Tracked-On: #6776
Signed-off-by: Junjie Mao <junjie.mao@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
The coding guideline rule C-CU-02 requires that 'only one return statement shall
be in a function'. This patch refactors handle_dmar_devscope() which has
multiple return statements today.
This patch has no semantic changes.
Tracked-On: #6776
Signed-off-by: Junjie Mao <junjie.mao@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
The coding guideline rule C-EP-05 requires that 'parentheses shall be used
to set the operator precedence explicitly'. This patch adds the missing
parentheses detected by the static analyzer.
Tracked-On: #6776
Signed-off-by: Junjie Mao <junjie.mao@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
This patch fixes the following warnings detected by the LLVM/Clang
compiler:
1. Unused static functions in C sources, which are fixed by explicitly
tagging them with __unused
2. Duplicated parentheses around branch conditions
3. Assigning 64-bit constants to 32-bit variables, which is fixed by
promoting the variables to uint64_t
4. Using { '\0' } to zero-fill an array, which is fixed by replacing it
with { 0 }
5. Taking a bit out of a variable using && (which should be & instead)
Most changes do not have a semantic impact, except item 5 which is probably
a real code issue.
Tracked-On: #6776
Signed-off-by: Junjie Mao <junjie.mao@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
In lock instruction emulation, we use vcpu_make_request and
signal_event pairs to shoot down/release other vcpus.
However, vcpu_make_request is async and does not guarantee an execution
of wait_event on target vcpu, and we want wait_event to be consistent
with signal_event.
Consider following scenarios:
1, When target vcpu's state has not yet turned to VCPU_RUNNING,
vcpu_make_request on ACRN_REQUEST_SPLIT_LOCK does not make sense, and will
not result in wait_event.
2, When target vcpu is already requested on ACRN_REQUEST_SPLIT_LOCK (i.e., the
corresponding bit in pending_req is set) but not yet handled,
the vcpu_make_request call does not result in wait_event as 1 bit is not
enough to cache multiple requests.
This patch tries to add checks in vcpu_kick_lock_instr_emulation and
vcpu_complete_lock_instr_emulation to resolve these issues.
Tracked-On: #6502
Signed-off-by: Yifan Liu <yifan1.liu@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
Rename SOS_VM type to SERVICE_VM
rename UOS to User VM in XML description
rename uos_thread_pid to user_vm_thread_pid
rename devname_uos to devname_user_vm
rename uosid to user_vmid
rename UOS_ACK to USER_VM_ACK
rename SOS_VM_CONFIG_CPU_AFFINITY to SERVICE_VM_CONFIG_CPU_AFFINITY
rename SOS_COM to SERVICE_VM_COM
rename SOS_UART1_VALID_NUM" to SERVICE_VM_UART1_VALID_NUM
rename SOS_BOOTARGS_DIFF to SERVICE_VM_BOOTARGS_DIFF
rename uos to user_vm in launch script and xml
Tracked-On: #6744
Signed-off-by: Liu Long <long.liu@linux.intel.com>
Reviewed-by: Geoffroy Van Cutsem <geoffroy.vancutsem@intel.com>
Rename SOS_VM_NUM to SERVICE_VM_NUM.
rename SOS_SOCKET_PORT to SERVICE_VM_SOCKET_PORT.
rename PROCESS_RUN_IN_SOS to PROCESS_RUN_IN_SERVICE_VM.
rename PCI_DEV_TYPE_SOSEMUL to PCI_DEV_TYPE_SERVICE_VM_EMUL.
rename SHUTDOWN_REQ_FROM_SOS to SHUTDOWN_REQ_FROM_SERVICE_VM.
rename PROCESS_RUN_IN_SOS to PROCESS_RUN_IN_SERVICE_VM.
rename SHUTDOWN_REQ_FROM_UOS to SHUTDOWN_REQ_FROM_USER_VM.
rename UOS_SOCKET_PORT to USER_VM_SOCKET_PORT.
rename SOS_CONSOLE to SERVICE_VM_OS_CONSOLE.
rename SOS_LCS_SOCK to SERVICE_VM_LCS_SOCK.
rename SOS_VM_BOOTARGS to SERVICE_VM_OS_BOOTARGS.
rename SOS_ROOTFS to SERVICE_VM_ROOTFS.
rename SOS_IDLE to SERVICE_VM_IDLE.
rename SEVERITY_SOS to SEVERITY_SERVICE_VM.
rename SOS_VM_UUID to SERVICE_VM_UUID.
rename SOS_REQ to SERVICE_VM_REQ.
rename RTCT_NATIVE_FILE_PATH_IN_SOS to RTCT_NATIVE_FILE_PATH_IN_SERVICE_VM.
rename CBC_REQ_T_UOS_ACTIVE to CBC_REQ_T_USER_VM_ACTIVE.
rename CBC_REQ_T_UOS_INACTIVE to CBC_REQ_T_USER_VM_INACTIV.
rename uos_active to user_vm_active.
Tracked-On: #6744
Signed-off-by: Liu Long <long.liu@linux.intel.com>
Reviewed-by: Geoffroy Van Cutsem <geoffroy.vancutsem@intel.com>
Rename gpa_uos to gpa_user_vm
rename base_gpa_in_uos to base_gpa_in_user_vm
rename UOS_VIRT_PCI_MMCFG_BASE to USER_VM_VIRT_PCI_MMCFG_BASE
rename UOS_VIRT_PCI_MMCFG_START_BUS to USER_VM_VIRT_PCI_MMCFG_START_BUS
rename UOS_VIRT_PCI_MMCFG_END_BUS to USER_VM_VIRT_PCI_MMCFG_END_BUS
rename UOS_VIRT_PCI_MEMBASE32 to USER_VM_VIRT_PCI_MEMBASE32
rename UOS_VIRT_PCI_MEMLIMIT32 to USER_VM_VIRT_PCI_MEMLIMIT32
rename UOS_VIRT_PCI_MEMBASE64 to USER_VM_VIRT_PCI_MEMBASE64
rename UOS_VIRT_PCI_MEMLIMIT64 to USER_VM_VIRT_PCI_MEMLIMIT64
rename UOS in comments message to User VM.
Tracked-On: #6744
Signed-off-by: Liu Long <long.liu@linux.intel.com>
Reviewed-by: Geoffroy Van Cutsem <geoffroy.vancutsem@intel.com>
Rename sos_vm to service_vm.
rename sos_vmid to service_vmid.
rename sos_vm_ptr to service_vm_ptr.
rename get_sos_vm to get_service_vm.
rename sos_vm_gpa to service_vm_gpa.
rename sos_vm_e820 to service_vm_e820.
rename sos_efi_info to service_vm_efi_info.
rename sos_vm_config to service_vm_config.
rename sos_vm_hpa2gpa to service_vm_hpa2gpa.
rename vdev_in_sos to vdev_in_service_vm.
rename create_sos_vm_e820 to create_service_vm_e820.
rename sos_high64_max_ram to service_vm_high64_max_ram.
rename prepare_sos_vm_memmap to prepare_service_vm_memmap.
rename post_uos_sworld_memory to post_user_vm_sworld_memory
rename hcall_sos_offline_cpu to hcall_service_vm_offline_cpu.
rename filter_mem_from_sos_e820 to filter_mem_from_service_vm_e820.
rename create_sos_vm_efi_mmap_desc to create_service_vm_efi_mmap_desc.
rename HC_SOS_OFFLINE_CPU to HC_SERVICE_VM_OFFLINE_CPU.
rename SOS to Service VM in comments message.
Tracked-On: #6744
Signed-off-by: Liu Long <long.liu@linux.intel.com>
Reviewed-by: Geoffroy Van Cutsem <geoffroy.vancutsem@intel.com>
Implement the propagate_vcbm() function:
Set vCBM to to all the vCPUs that share cache with vcpu
to mimic hardware CAT behavior
Tracked-On: #5917
Signed-off-by: dongshen <dongsheng.x.zhang@intel.com>
Acked-by: Eddie Dong <eddie.dong@Intel.com>
Implement the write_vcbm() function to handle the
MSR_IA32_type_MASK_n vCBM MSRs write request
Call write_vclosid() to handle MSR_IA32_PQR_ASSOC MSR write request
Several vCAT P2V (physical to virtual) and V2P (virtual to physical)
mappings exist:
struct acrn_vm_config *vm_config = get_vm_config(vm_id)
max_pcbm = vm_config->max_type_pcbm (type: l2 or l3)
mask_shift = ffs64(max_pcbm)
vclosid = vmsr - MSR_IA32_type_MASK_0
pclosid = vm_config->pclosids[vclosid]
pmsr = MSR_IA32_type_MASK_0 + pclosid
pcbm = vcbm << mask_shift
vcbm = pcbm >> mask_shift
Where
MSR_IA32_type_MASK_n: L2 or L3 mask msr address for CLOSIDn, from
0C90H through 0D8FH (inclusive).
max_pcbm: a bitmask that selects all the physical cache ways assigned to the VM
vclosid: virtual CLOSID, always starts from 0
pclosid: corresponding physical CLOSID for a given vclosid
vmsr: virtual msr address, passed to vCAT handlers by the
caller functions rdmsr_vmexit_handler()/wrmsr_vmexit_handler()
pmsr: physical msr address
vcbm: virtual CBM, passed to vCAT handlers by the
caller functions rdmsr_vmexit_handler()/wrmsr_vmexit_handler()
pcbm: physical CBM
Tracked-On: #5917
Signed-off-by: dongshen <dongsheng.x.zhang@intel.com>
Acked-by: Eddie Dong <eddie.dong@Intel.com>
Implement the read_vcbm() and read_vclosid() functions to handle the MSR_IA32_PQR_ASSOC
and MSR_IA32_type_MASK_n vCAT MSRs read request.
Tracked-On: #5917
Signed-off-by: dongshen <dongsheng.x.zhang@intel.com>
Acked-by: Eddie Dong <eddie.dong@Intel.com>
Expose CAT feature to vCAT VM by reporting the number of
cache ways/CLOSIDs via the 04H/10H cpuid instructions, so that the
VM can take advantage of CAT to prioritize and partition cache
resource for its own tasks.
Add the vcat_pcbm_to_vcbm() function to map pcbm to vcbm
Tracked-On: #5917
Signed-off-by: dongshen <dongsheng.x.zhang@intel.com>
Acked-by: Eddie Dong <eddie.dong@Intel.com>
Initialize vCBM MSRs
Initialize vCLOSID MSR
Add some vCAT functions:
Retrieve max_vcbm and max_pcbm
Check if vCAT is configured or not for the VM
Map vclosid to pclosid
write_vclosid: vCLOSID MSR write handler
write_vcbm: vCBM MSR write handler
Tracked-On: #5917
Signed-off-by: dongshen <dongsheng.x.zhang@intel.com>
Acked-by: Eddie Dong <eddie.dong@Intel.com>
Specifying a reserved or unimplemented MSR address in ECX for rdmsr will cause a
general protection exception. In this case, we should not change the contents of
registers EDX:EAX.
Tracked-On: #4550
Signed-off-by: Fei Li <fei1.li@intel.com>
Initialize the emulated_guest_msrs[] array at runtime for
MSR_IA32_type_MASK_n and MSR_IA32_PQR_ASSOC msrs, there is no good
way to do this initialization statically at build time
Tracked-On: #5917
Signed-off-by: dongshen <dongsheng.x.zhang@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
For vCAT, it may need to store more than MAX_VCPUS_PER_VM of closids,
change clos in vm_config.h to a pointer to accommodate this situation
Rename clos to pclosids
pclosids now is a pointer to an array of physical CLOSIDs that is defined
in vm_configurations.c by vmconfig. The number of elements in the array
must be equal to the value given by num_pclosids
Add max_type_pcbm (type: l2 or l3) to struct acrn_vm_config, which stores a bitmask
that selects/covers all the physical cache ways assigned to the VM
Change vmsr.c to accommodate this amended data structure
Change the config-tools to generate vm_configurations.c, and fill in the num_closids
and clos pointers based on the information from the scenario file.
Now vm_configurations.c.xsl generates all the clos related code so remove the same
code from misc_cfg.h.xsl.
Examples:
Scenario file:
<RDT>
<RDT_ENABLED>y</RDT_ENABLED>
<CDP_ENABLED>n</CDP_ENABLED>
<VCAT_ENABLED>y</VCAT_ENABLED>
<CLOS_MASK>0x7ff</CLOS_MASK>
<CLOS_MASK>0x7ff</CLOS_MASK>
<CLOS_MASK>0x7ff</CLOS_MASK>
<CLOS_MASK>0xff800</CLOS_MASK>
<CLOS_MASK>0xff800</CLOS_MASK>
<CLOS_MASK>0xff800</CLOS_MASK>
<CLOS_MASK>0xff800</CLOS_MASK>
<CLOS_MASK>0xff800</CLOS_MASK>
/RDT>
<vm id="0">
<guest_flags>
<guest_flag>GUEST_FLAG_VCAT_ENABLED</guest_flag>
</guest_flags>
<clos>
<vcpu_clos>3</vcpu_clos>
<vcpu_clos>4</vcpu_clos>
<vcpu_clos>5</vcpu_clos>
<vcpu_clos>6</vcpu_clos>
<vcpu_clos>7</vcpu_clos>
</clos>
</vm>
<vm id="1">
<clos>
<vcpu_clos>1</vcpu_clos>
<vcpu_clos>2</vcpu_clos>
</clos>
</vm>
vm_configurations.c (generated by config-tools) with the above vCAT config:
static uint16_t vm0_vcpu_clos[5U] = {3U, 4U, 5U, 6U, 7U};
static uint16_t vm1_vcpu_clos[2U] = {1U, 2U};
struct acrn_vm_config vm_configs[CONFIG_MAX_VM_NUM] = {
{
.guest_flags = (GUEST_FLAG_VCAT_ENABLED),
.pclosids = vm0_vcpu_clos,
.num_pclosids = 5U,
.max_l3_pcbm = 0xff800U,
},
{
.pclosids = vm1_vcpu_clos,
.num_pclosids = 2U,
},
};
Tracked-On: #5917
Signed-off-by: dongshen <dongsheng.x.zhang@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
Add the VCAT_ENABLED element to RDTType so that user can enable/disable vCAT globally
Add the GUEST_FLAG_VCAT_ENABLED guest flag to enable/disable vCAT per-VM.
Currently we have the following per-VM clos element in scenario file for RDT use:
<clos>
<vcpu_clos>0</vcpu_clos>
<vcpu_clos>0</vcpu_clos>
</clos>
When the GUEST_FLAG_VCAT_ENABLED guest flag is not specified, clos is for RDT use,
vcpu_clos is per-CPU and it configures each CPU in VMs to a desired CLOS ID.
When the GUEST_FLAG_VCAT_ENABLED guest flag is specified, vCAT is enabled for this VM,
clos is for vCAT use, vcpu_clos is not per-CPU anymore in this case, just a list of
physical CLOSIDs (minimum 2) that are assigned to VMs for vCAT use. Each vcpu_clos
will be mapped to a virtual CLOSID, the first vcpu_clos is mapped to virtual CLOSID
0 and the second is mapped to virtual CLOSID 1, etc
Add xs:assert to prevent any problems with invalid configuration data for vCAT:
If any GUEST_FLAG_VCAT_ENABLED guest flag is specified, both RDT_ENABLED and VCAT_ENABLED
must be 'y'
If VCAT_ENABLED is 'y', RDT_ENABLED must be 'y' and CDP_ENABLED must be 'n'
For a vCAT VM, vcpu_clos cannot be set to CLOSID 0, CLOSID 0 is reserved to be used by hypervisor
For a vCAT VM, number of clos/vcpu_clos elements must be greater than 1
For a vCAT VM, each clos/vcpu_clos must be less than L2/L3 COS_MAX
For a vCAT VM, its clos/vcpu_clos elements cannot contain duplicate values
There should not be any CLOS IDs overlap between a vCAT VM and any other VMs
Tracked-On: #5917
Signed-off-by: dongshen <dongsheng.x.zhang@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
IOMMU hardware resource is owned by hypervisor, while
IOMMU capability is reported to service VM in its ACPI
table. In this case, Service VM may access IOMMU hardware
resource, which is not expected.
This patch unmaps all Intel IOMMU register pages for service VM EPT.
Tracked-On: #6677
Signed-off-by: Yonghua Huang <yonghua.huang@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
Reviewed-by: Victor Sun <victor.sun@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
PR #6283 updated code and docs to the new kernel HSM driver. Fix
some references to VHM missed in the doxygen comments. Also fixed some
misspellings while in these files.
Tracked-On: #6282
Signed-off-by: David B. Kinder <david.b.kinder@intel.com>
For the IGD device the opregion addr is returned by reading the 0xFC config of
0:02.0 bdf. And the opregion addr is required by GPU driver.
The opregion_addr should be the GPA addr.
When the IGD is assigned to pre-launched VM, the value in 0xFC of igd_vdev is
programmed into with new GPA addr. In such case the prelaunched VM reads
the value from 0xFC of 0:02.0 vdev.
But for the Service VM, the IGD is initialized by using the same policy as other PCI
devices. We only initialize the vdev_head_conf(0x0-0x3F) by checking the
corresponding pbdf. The remaining pci_config_space will be read by
leveraging the corresponding pdev. But as the above code doesn't handle the
scenario for Service VM, it causes that the Service VM fails to
read the 0xFC config_space for IGD vdev.
Then the i915 GPU driver in SOS has some issues because of incorrect 0xFC
pci_conf_space.
This patch initializes offset 0xfc of CFG space of IGD for Service VM,
it is simple and can cover post-launched VM too.
Tracked-On: #6387
Signed-off-by: Liu,Junming <junming.liu@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
In the current hypervisor, only support at most two legacy vuarts
(COM1 and COM2) for a VM, COM1 is usually configured as VM console,
COM2 is configured as communication channel of S5 feature.
Hypervisor can support MAX_VUART_NUM_PER_VM(8) legacy vuart, but only
register handlers for two legacy vuart since the assumption (legacy
vuart is less than 2) is made.
In the current hypervisor configurtion, io port (2F8H) is always
allocated for virtual COM2, it will be not friendly if user wants to
assign this port to physical COM2.
Legacy vuart is common communication channel between service VM and
user VM, it can work in polling mode and its driver exits in each
guest OS. The channel can be used to send shutdown command to user VM
in S5 featuare, so need to config serval vuarts for service VM and one
vuart for each user VM.
The following changes will be made to support at most
MAX_VUART_NUM_PER_VM legacy vuarts:
- Refine legacy vuarts initialization to register PIO handler for
related vuart.
- Update assumption of legacy vuart number.
BTW, config tools updates about legacy vuarts will be made in other
patch.
v1-->v2:
Update commit message to make this patch's purpose clearer;
If vuart index is valid, register handler for it.
Tracked-On: #6652
Signed-off-by: Xiangyang Wu <xiangyang.wu@intel.com>
Acked-by: Eddie Dong <eddie.dong@Intel.com>
It's difficult to configure CONFIG_HV_RAM_SIZE properly at once. This patch
not only remove CONFIG_HV_RAM_SIZE, but also we use ld linker script to
dynamically get the size of HV RAM size.
Tracked-On: #6663
Signed-off-by: Fei Li <fei1.li@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
This patch implements a separate path for L2 VMEntry in run_vcpu(),
which has several benefits:
- keep run_vcpu() clean, to reduce the number of is_vcpu_in_l2_guest()
statements:
- current code has three is_vcpu_in_l2_guest() already.
- supposed to have another 2 statement so that nested VMEntry won't
hit the "Starting vCPU" and "vCPU launched" pr_info() and a few
other statements in the VM launch path.
- save few other things in run_vcpu() that are not needed for nested.
Tracked-On: #6289
Signed-off-by: Zide Chen <zide.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@Intel.com>
In run time, one vCPU won't read or write a register on other vCPUs,
thus we don't need the LOCK prefixed instructions on reg_cached and
reg_updated.
Tracked-On: #6289
Signed-off-by: Zide Chen <zide.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
move the bitmap test call out of vcpu_inject_exception(), then we call
the expensive bitmap_test_and_clear_lock() only pending_req_bits is
non-zero and call vcpu_inject_exception() only if needed.
Tracked-On: #6289
Signed-off-by: Zide Chen <zide.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@Intel.com>
In very large number of VM extis, the VM-exit instruction length could be
zero, and it's no need to update VMX_GUEST_RIP.
Some examples:
- all external interrupt VM exits in non LAPIC passthru setup.
- for all the nested VM-exits that are reflecting to L1 hypervisor.
Tracked-On: #6289
Signed-off-by: Zide Chen <zide.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
- wrap a new function exec_vmentry() to reduce code duplication.
- remove exec_vmread(VMX_GUEST_RSP) since ACRN doesn't need to know
guest RSP in run time.
Tracked-On: #6289
Signed-off-by: Zide Chen <zide.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
Change "uint64_t vmcs_hva" to "void *vmcs_hva" in the input argument,
list, so that no type casting is needed when calling them from pointers.
Tracked-On: #6289
Signed-off-by: Zide Chen <zide.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
update board name from nuc7i7dnb to nuc11tnbi5 in makefile because
we have removed the nuc7i7dnb board folder, and also update the
scenario name from industry to shared to fix "make all" build issue.
Tracked-On: #6315
Signed-off-by: Kunhui-Li <kunhuix.li@intel.com>
In current design, when pass-thru dev,
for the PIO bar, need to ensure the guest PIO start address
equals to host PIO start address.
But malicious guest may reprogram the PIO bar,
then hv will pass-thru the reprogramed PIO address to guest.
This isn't safe behavior.
When guest tries to reprogram pass-thru dev PIO bar,
inject #GP to guest directly.
Tracked-On: #6508
Signed-off-by: Liu,Junming <junming.liu@intel.com>
Reviewed-by: Zhao Yakui <yakui.zhao@intel.com>
Reviewed-by: Fei Li <fei1.li@intel.com>
In current design, when pass-thru dev,
for the PIO bar, need to ensure the guest PIO start address
equals to host PIO start address.
Then set the VMCS io bitmap to pass-thru the corresponding
port io to guest for performance.
ACRN-DM and acrn-config should ensure the identical mapping of PIO bar.
If ACRN-DM or acrn-config failed to achieve this,
we should deny the launch of VM
Tracked-On: #6508
Signed-off-by: Liu,Junming <junming.liu@intel.com>
Reviewed-by: Zhao Yakui <yakui.zhao@intel.com>
Reviewed-by: Fei Li <fei1.li@intel.com>
In the commit of 4e1deab3d9, we changed the
init sequence that init paging first and then init e820 because we worried
about the efi memory map could be beyond 4GB space on some platform.
After we double checked multiboot2 spec, when system boot from multiboot2
protocol, the efi memory map info will be embedded in multiboot info so it
is guaranteed that the efi memory map must be under 4GB space. Consider that
the page table will be allocated in free memory space in future, we have
to change the init sequence back that init e820 first and then init paging.
If we need to support other boot protocol in future that the efi memory map
might be put beyond 4GB, we could have below options:
1. Request bootloader put efi memory map below 4GB;
2. Call EFI_BOOT_SERVICES.GetMemoryMap() before ExitBootServices();
3. Enable a early 64bit page table to get the efi memory map only;
Tracked-On: #5626
Signed-off-by: Victor Sun <victor.sun@intel.com>
This patch changes the size of vvmcs[] array from 1 to
PER_VCPU_ACTIVE_VVMCS_NUM, and actually enables multiple active VMCS12
support in ACRN. The basic operations:
- if L1 VMPTRLDs a VMCS12 without previously VMCLEAR the current
VMCS12, ACRN no longer unconditionally flushes the current VMCS12
back to L1. Instead, it tries to keep both the current and the newly
loaded VMCS12 in the nested->vvmcs[] array, unless:
- if there is no more available vvmcs[] entry, ACRN flushes one active
VMCS12 to make room for this new VMCS12.
Tracked-On: #6289
Signed-off-by: Zide Chen <zide.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
Some processors don't support VMX_PROCBASED_CTLS_TERTIARY bit
and VMX_PROCBASED_CTLS2_UWAIT_PAUSE bit in MSRs
(IA32_VMX_PROCBASED_CTLS & IA32_VMX_PROCBASED_CTLS2),
HV will output error log which will cause confusion,
change the log level from pr_err to pr_info.
Tracked-On: #6397
Signed-off-by: Mingqiang Chi <mingqiang.chi@intel.com>
This patch adds a new priority based scheduler to support
vCPU scheduling based on their pre-configured priorities.
A vCPU can be running only if there is no higher priority
vCPU running on the same pCPU.
Tracked-On: #6571
Signed-off-by: Jie Deng <jie.deng@intel.com>
The current config.mk uses the variable BOARD_FILE as the path to the board
XML when generating an unmodified copy of configuration files for
comparison, which is incorrect. The right variable is HV_BOARD_XML which is
the path to the copy of board XML that is actually used for the build.
This patch corrects the bug above.
In addition, this patch also skips binary files (which are not meant to be
edited manually) when calculating the differences.
Tracked-On: #6592
Signed-off-by: Junjie Mao <junjie.mao@intel.com>
For local APIC passthrough case, EOI would not trigger VM-exit. So virtual
'Remote IRR' would not be updated. Needs to read physical IOxAPIC RTE to
update virtual 'Remote IRR' field each time when guest wants to read I/O
REDIRECTION TABLE REGISTERS
Tracked-On: #5923
Signed-off-by: Fei Li <fei1.li@intel.com>
In local APIC passthrough case, when devices triggered a INTx interrupt, this
interrupt would be delivered to vCPU directly. For this case, need to set the
virtual vector in
the 'Interrupt Vector' field of physical IOxAPIC I/O REDIRECTION TABLE REGISTER
(bits 7:0) and 'Vector' field of vt-d Interrupt Remapping Table Entry (IRTE)
for Remapped Interrupts.
Assumption:
(a) IOAPIC pins won't be shared between LAPIC PT guest and other guests;
(b) The guest would not trigger this IRQ before it switched to x2 APIC mode.
Tracked-On: #5923
Signed-off-by: Zide Chen <zide.chen@intel.com>
check some essential vmx capablility,
will panic if processor doesn't support it.
Tracked-On: #6584
Signed-off-by: Mingqiang Chi <mingqiang.chi@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
In current RDT code, if CDP is configured, L2/L3 resources' num_closids calculation
is wrong:
res_cap_info[res].num_closids = (uint16_t)((edx & 0xffffU) >> 1U) + 1U;
Should be:
res_cap_info[res].num_closids = (uint16_t)((edx & 0xffffU) >> 1U + 1) >> 1U;
Aslo, in order to enable CDP system-wide, need to enable the CDP bit (bit 0) on all pcpus,
not just on pcpu 0.
Tracked-On: #5917
Signed-off-by: dongshen <dongsheng.x.zhang@intel.com>
Rename the clos_max field in struct rdt_info to num_closids
Rename variable valid_clos_num to common_num_closids and make it static
Tracked-On: #5917
Signed-off-by: dongshen <dongsheng.x.zhang@intel.com>
Normalize hypervisor help command output format, remove the 10 lines
limit for one screen, fix the misspelled words.
Tracked-On: #5112
Signed-off-by: Liu Long <long.liu@intel.com>
Reviewed-by: VanCutsem, Geoffroy <geoffroy.vancutsem@intel.com>
These dirty flags are supposed to be per VMCS12, so move them from the
per vCPU acrn_nested struct to the newly added acrn_vvmcs struct.
Tracked-On: #6289
Signed-off-by: Zide Chen <zide.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
This variable represents the L1 GPA of the current VMCS12. But it's
no longer needed in the multiple active VMCS12 case, which uses the
following variables for this purpose.
- nested->current_vvmcs refers to the vvmcs[] entry which contains the
cached current VMCS12, its associated VMCS02, and other context info.
- nested->current_vvmcs->vmcs12_gpa refers to the L1 GPA of this
current VMCS12.
Tracked-On: #6289
Signed-off-by: Zide Chen <zide.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
Add an array of struct acrn_vvmcs to struct acrn_nested, so it is
possible to cache multiple active VMCS12s.
This patch declares the size of this array to 1, meaning that there is
only one active VMCS12. This is to minimize the logical code changes.
Add pointer current_vvmcs to struct acrn_nested, which refers to the
current vvmcs[] entry. In this patch, if any VMCS12 is active, it
always points to vvmcs[0].
Tracked-On: #6289
Signed-off-by: Zide Chen <zide.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
No any logical changes, this patch is preparing for multiple active
VMCS12 support.
- currently it's easy to get the vmcs12 pointer from the vcpu pointer.
In multiple active vmcs12 case, we need to explicitly add "struct
acrn_vmcs12 *vmcs12" to certain APIs' input argument list, in order to
get the desired vmcs12 pointer.
- merge flush_current_vmcs12() into clear_vmcs02() for multiple reasons:
a) it's called only once; b) we don't wrap the opposite operation
(loading vmcs12) in an API; c) this API has simple and clear logic.
Tracked-On: #6289
Signed-off-by: Zide Chen <zide.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
By changing the way to assign L1 VPID from bottom-up to top-down,
the possibilities for VPID conflicts between L1 and L2 guests are
small.
Then we can flush VPID just in case of conflicting.
Tracked-On: #6289
Signed-off-by: Anthony Xu <anthony.xu@intel.com>
Signed-off-by: Zide Chen <zide.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
Before checking whether a PCI device is a Multi-Function Device or not, we need
make sure this PCI device is a valid PCI device. For a valid PCI device, the
'Header Layout' field in Header Type Register must be 000 0000b (Type 0 PCI device)
or 000 0001b (Type 1 PCI device).
So for a valid PCI device, the Header Type can't be 0xff.
Tracked-On: #4134
Signed-off-by: Fei Li <fei1.li@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
In run time, it's rare for L1 to write to the intercepted non host-state
VMCS fields, and using multiple dirty flags is not necessary.
This patch uses one single dirty flag to manage all non host-state VMCS
fields. This helps to simplify current code and in the future we may
not need to declare new dirty flags when we intercept more VMCS fields.
Tracked-On: #5923
Signed-off-by: Zide Chen <zide.chen@intel.com>
Currently vmptrld_vmexit_handler() doesn't sync VMX_EPT_POINTER_FULL
from vmcs12 to vmcs02, instead it sets gpa_field_dirty and relies on
nested_vmentry() to sync EPTP in next nested VMentry.
This creates readability issue since all other intercepted VMCS fields
are synced in sync_vmcs12_to_vmcs02(). Another issue is that other
VMCS fields managed by gpa_field_dirty are repeatedly synced in both
vmptrld and nested vmentry handler.
This patch moves get_nept_desc() ahead of sync_vmcs12_to_vmcs02(), such
that shadow_eptp is allocated before sync_vmcs12_to_vmcs02() which
can sync EPTP properly.
BTW, in nested_vmexit_handler(), don't need to read from VMCS to get
the exit reason, since vcpu->arch.exit_reason has it already.
Tracked-On: #5923
Signed-off-by: Zide Chen <zide.chen@intel.com>
Add a couple of missing dependencies in the ACRN Makefiles:
1. 'acrn.bin' is required before the hypervisor can be installed
2. The 'acrn_mngr.h' needs to be installed ('tools-install') in
the build folder.
Tracked-On: #6360
Signed-off-by: Geoffroy Van Cutsem <geoffroy.vancutsem@intel.com>
While we have default values of configuration entries stated in the schema
of scenario XMLs, today we still require user-given scenario XMLs to
contain literally ALL XML nodes. Missing of a single node will cause schema
validation errors even though we can use its default value defined in the
schema.
This patch allows user-given scenario XMLs to ignore nodes with default
values. It is done by adding the missing nodes, all containing the defined
default values, to the input scenario XML when copying it to the build
directory. This approach imposes no changes to either the schema or
subsequent scripts in the build system.
Tracked-On: #6292
Signed-off-by: Junjie Mao <junjie.mao@intel.com>
Previously it is (falsely) assumed that the major_ver of 32-bit SMBIOS
entry point structure (which is called SMBIOS 2.1 in spec, or SMBIOS2 in code)
will have a value of 2 and major_ver of 64-bit SMBIOS (which is called SMBIOS
3.0 in spec, and SMBIOS3 in code) will have a value of 3. This turned out to be
wrong. This major_ver refers to the implemented doc revision, and 32-bit SMBIOS2
can have its major_ver to be 3 (current most recent implementation).
This patch removes the use of major_ver to distinguish between
SMBIOS2/3, and use a doc-defined anchor string instead.
Tracked-On: #6528
Signed-off-by: Yifan Liu <yifan1.liu@intel.com>
The locked btr instruction is expensive. This patch changes the
logic to ensure that the bitmap is non-zero before executing
bitmap_test_and_clear_lock().
The VMX transition time gets significant improvement. SOS running
on TGL, the CPUID roundtrip reduces from ~2400 cycles to ~2000 cycles.
Tracked-On: #6289
Signed-off-by: Zide Chen <zide.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
Currently ACRN assumes firmware setup IA32_PAT correctly. This patch
explicitly initializes host IA32_PAT MSR according to ISDM Table 11-12.
Memory Type Setting of PAT Entries Following a Power-up or Reset.
ACRN creates host page tables based on PAT0 (WB) and PAT3 (UC).
Tracked-On: #6289
Signed-off-by: Zide Chen <zide.chen@intel.com>
is_lapic_pt_enabled() is called at least twice in one loop of the vCPU
thread, and it's called in vmexit_handler() frequently if LAPIC is not
pass-through. Thus the efficiency of this function has direct
impact to the system performance.
Since the LAPIC mode is not changed in run time, we don't have to
calculate it on the fly in is_lapic_pt_enabled().
BTW, removed the unused lapic_mask from struct acrn_vcpu_arch.
Tracked-On: #6289
Signed-off-by: Zide Chen <zide.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
For platforms that do not support XSAVES/XRSTORS instructions, like QEMU,
executing these instructions causes #UD.
This patch adds the check before the execution of XSAVES/XRSTORS instructions.
It also refines the logic inside rstore_xsave_area for the following reason:
If XSAVES/XRSTORS instructions are supported, restore XSAVE area if any of the
following conditions is met:
1. "vcpu->launched" is false (state initialization for guest)
2. "vcpu->arch.xsave_enabled" is true (state restoring for guest)
* Before vCPU is launched, condition 1 is satisfied.
* After vCPU is launched, condition 2 is satisfied because
is_valid_xsave_combination() guarantees that "vcpu->arch.xsave_enabled"
is consistent with pcpu_has_cap(X86_FEATURE_XSAVES).
Therefore, the check against "vcpu->launched" and "vcpu->arch.xsave_enabled"
can be eliminated here.
Tracked-On: #6481
Signed-off-by: Shiqing Gao <shiqing.gao@intel.com>
Acked-by: Eddie Dong <eddie.dong@Intel.com>
Use an unused MSR on host to save ACRN pcpu ID and avoid saving and
restoring TSC AUX MSR on VMX transitions.
Tracked-On: #6289
Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com>
Signed-off-by: Zide Chen <zide.chen@intel.com>
Reviewed-by: Eddie Dong <eddie.dong@intel.com>
This feature is guarded under config CONFIG_SECURITY_VM_FIXUP, which
by default should be disabled.
This patch passthrough native SMBIOS information to prelaunched VM.
SMBIOS table contains a small entry point structure and a table, of which
the entry point structure will be put in 0xf0000-0xfffff region in guest
address space, and the table will be put in the ACPI_NVS region in guest
address space.
v2 -> v3:
uuid_is_equal moved to util.h as inline API
result -> pVendortable, in function efi_search_guid
recalc_checksum -> generate_checksum
efi_search_smbios -> efi_search_smbios_eps
scan_smbios_eps -> mem_search_smbios_eps
EFI GUID definition kept
Tracked-On: #6320
Signed-off-by: Yifan Liu <yifan1.liu@intel.com>
This patch renames the GUEST_FLAG_TPM2_FIXUP to
GUEST_FLAG_SECURITY_VM.
v2 -> v3:
The "FIXUP" suffix is removed.
Tracked-On: #6320
Signed-off-by: Yifan Liu <yifan1.liu@intel.com>
In ACRN RT VM if the lapic is passthrough to the guest, the ipi can't
trigger VM_EXIT and the vNMI is just for notification, it can't handle
the smp_call function. Modify vcpu_dumpreg function prompt user switch
to vLAPIC mode for vCPU register dump.
Tracked-On: #6473
Signed-off-by: Liu Long <long.liu@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
- remove vcpu->arch.nrexits which is useless.
- record full 32 bits of exit_reason to TRACE_2L(). Make the code simpler.
Tracked-On: #6289
Signed-off-by: Zide Chen <zide.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
GSI of hcall_set_irqline should be checked against target_vm's
total GSI count instead of SOS's total GSI count.
Tracked-On: #6357
Signed-off-by: Jian Jun Chen <jian.jun.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
This helps to improve performance:
- Don't need to execute VMREAD in vcpu_get_efer(), which is frequently
called.
- VMX_EXIT_CTLS_SAVE_EFER can be removed from VM-Exit Controls.
- If the value of IA32_EFER MSR is identical between the host and guest
(highly likely), adjust the VMX controls not to load IA32_EFER on
VMExit and VMEntry.
It's convenient to continue use the exiting vcpu_s/get_efer() APIs,
other than the common vcpu_s/get_guest_msr().
Tracked-On: #6289
Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com>
Signed-off-by: Zide Chen <zide.chen@intel.com>
MAXIMUM_PA_WIDTH will be calculated from board information.
Tracked-On: #6357
Signed-off-by: Liang Yi <yi.liang@intel.com>
Signed-off-by: Junjie Mao <junjie.mao@intel.com>
Mask off support of 57-bit linear addresses and five-level paging.
ICX-D has LA57 but ACRN doesn't support 5-level paging yet.
Tracked-On: #6357
Signed-off-by: Liang Yi <yi.liang@intel.com>
Signed-off-by: Li, Fei1 <fei1.li@intel.com>
It is used to specify the maximum number of EFI memmap entries.
On some platforms, like Tiger Lake, the number of EFI memmap entries
becomes 268 when the BIOS settings are changed.
The current value of MAX_EFI_MMAP_ENTRIES (256) defined in hypervisor
is not big enough to cover such cases.
As the number of EFI memmap entries depends on the platforms and the
BIOS settings, this patch introduces a new entry MAX_EFI_MMAP_ENTRIES
in configurations so that it can be adjusted for different cases.
Tracked-On: #6442
Signed-off-by: Shiqing Gao <shiqing.gao@intel.com>
If SOS is using kernel 5.4, hypervisor got panic with #GP.
Here is an example on KBL showing how the panic occurs when kernel 5.4 is used:
Notes:
* Physical MSR_IA32_XSS[bit 8] is 1 when physical CPU boots up.
* vcpu_get_guest_msr(vcpu, MSR_IA32_XSS)[bit 8] is initialized to 0.
Following thread switches would happen at run time:
1. idle thread -> vcpu thread
context_switch_in happens and rstore_xsave_area is called.
At this moment, vcpu->arch.xsave_enabled is false as vcpu is not launched yet
and init_vmcs is not called yet (where xsave_enabled is set to true).
Thus, physical MSR_IA32_XSS is not updated with the value of guest MSR_IA32_XSS.
States at this point:
* Physical MSR_IA32_XSS[bit 8] is 1.
* vcpu_get_guest_msr(vcpu, MSR_IA32_XSS)[bit 8] is 0.
2. vcpu thread -> idle thread
context_switch_out happens and save_xsave_area is called.
At this moment, vcpu->arch.xsave_enabled is true. Processor state is saved
to memory with XSAVES instruction. As physical MSR_IA32_XSS[bit 8] is 1,
ectx->xs_area.xsave_hdr.hdr.xcomp_bv[bit 8] is set to 1 after the execution
of XSAVES instruction.
States at this point:
* Physical MSR_IA32_XSS[bit 8] is 1.
* vcpu_get_guest_msr(vcpu, MSR_IA32_XSS)[bit 8] is 0.
* ectx->xs_area.xsave_hdr.hdr.xcomp_bv[bit 8] is 1.
3. idle thread -> vcpu thread
context_switch_in happens and rstore_xsave_area is called.
At this moment, vcpu->arch.xsave_enabled is true. Physical MSR_IA32_XSS is
updated with the value of guest MSR_IA32_XSS, which is 0.
States at this point:
* Physical MSR_IA32_XSS[bit 8] is 0.
* vcpu_get_guest_msr(vcpu, MSR_IA32_XSS)[bit 8] is 0.
* ectx->xs_area.xsave_hdr.hdr.xcomp_bv[bit 8] is 1.
Processor state is restored from memory with XRSTORS instruction afterwards.
According to SDM Vol1 13.12 OPERATION OF XRSTORS, a #GP occurs if XCOMP_BV
sets a bit in the range 62:0 that is not set in XCR0 | IA32_XSS.
So, #GP occurs once XRSTORS instruction is executed.
Such issue does not happen with kernel 5.10. Because kernel 5.10 writes to
MSR_IA32_XSS during initialization, while kernel 5.4 does not do such write.
Once guest writes to MSR_IA32_XSS, it would be trapped to hypervisor, then,
physical MSR_IA32_XSS and the value of MSR_IA32_XSS in vcpu->arch.guest_msrs
are updated with the value specified by guest. So, in the point 2 above,
correct processor state is saved. And #GP would not happen in the point 3.
This patch initializes the XSAVE related processor state for guest.
If vcpu is not launched yet, the processor state is initialized according to
the initial value of vcpu_get_guest_msr(vcpu, MSR_IA32_XSS), ectx->xcr0,
and ectx->xs_area. With this approach, the physical processor state is
consistent with the one presented to guest.
Tracked-On: #6434
Signed-off-by: Shiqing Gao <shiqing.gao@intel.com>
Reviewed-by: Li Fei1 <fei1.li@intel.com>
Currently init_vmx_msrs() emulates same value for the IA32_VMX_xxx_CTLS
and IA32_VMX_TRUE_xxx_CTLS MSRs.
But the value of physical MSRs could be different between the pair,
and we need to adjust the emulated value accordingly.
Tracked-On: #6289
Signed-off-by: Zide Chen <zide.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
check_vmx_permission() is called in vmresume_vmexit_handler() and
vmlaunch_vmexit_handler() already.
Tracked-On: #6289
Signed-off-by: Zide Chen <zide.chen@intel.com>
Currently the sched event handling may encounter data race problem, and
as a result some vcpus might be stalled forever.
One example can be wbinvd handling where more than 1 vcpus are doing
wbinvd concurrently. The following is a possible execution of 3 vcpus:
-------
0 1 2
req [Note: 0]
req bit0 set [Note: 1]
IPI -> 0
req bit2 set
IPI -> 2
VMExit
req bit2 cleared
wait
vcpu2 descheduled
VMExit
req bit0 cleared
wait
vcpu0 descheduled
signal 0
event0->set=true
wake 0
signal 2
event2->set=true [Note: 3]
wake 2
vcpu2 scheduled
event2->set=false
resume
req
req bit0 set
IPI -> 0
req bit1 set
IPI -> 1
(doesn't matter)
vcpu0 scheduled [Note: 4]
signal 0
event0->set=true
(no wake) [Note: 2]
event0->set=false (the rest doesn't matter)
resume
Any VMExit
req bit0 cleared
wait
idle running
(blocked forever)
Notes:
0: req: vcpu_make_request(vcpu, ACRN_REQUEST_WAIT_WBINVD).
1: req bit: Bit in pending_req_bits. Bit0 stands for bit for vcpu0.
2: In function signal_event, At this time the event->waiting_thread
is not NULL, so wake_thread will not execute
3: eventX: struct sched_event of vcpuX.
4: In function wait_event, the lock does not strictly cover the execution between
schedule() and event->set=false, so other threads may kick in.
-----
As shown in above example, before the last random VMExit, vcpu0 ended up
with request bit set but event->set==false, so blocked forever.
This patch proposes to change event->set from a boolean variable to an
integer. The semantic is very similar to a semaphore. The wait_event
will add 1 to this value, and block when this value is > 0, whereas signal_event
will decrease this value by 1.
It may happen that this value was decreased to a negative number but that
is OK. As long as the wait_event and signal_event are paired and
program order is observed (that is, wait_event always happens-before signal_event
on a single vcpu), this value will eventually be 0.
Tracked-On: #6405
Signed-off-by: Yifan Liu <yifan1.liu@intel.com>
This is a simply implement for the 32bit and 64bit elf loader.
The loading function first reads the image header, and finds the program
entries that are marked as PT_LOAD, then loads segments from elf file to
guest ram. After that, it finds the bss section in the elf section entries, and
clear the ram area it points to.
Limitations:
1. The e_type of the elf image must be ET_EXEC(executable). Relocatable or
dynamic code is not supported.
2. The loader only copies program segments that has a p_type of
PT_LOAD(loadable segment). Other segments are ignored.
3. The loader doesn’t support Sections that are relocatable
(sh_type is SHT_REL or SHT_RELA)
4. The 64bit elf’s entry address must below 4G.
5. The elf is assumed to be able to put segments to valid guest memory.
Tracked-On: #6323
Signed-off-by: Zhou, Wu <wu.zhou@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
This patch adds a function elf_loader() to load elf image.
It checks the elf header, get its 32/64 bit type, then calls
the corresponding loading routines, which are empty, and
will be realized later.
Tracked-On: #6323
Signed-off-by: Zhou, Wu <wu.zhou@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
Source: https://github.com/freebsd/freebsd-src/blob/main/sys/sys/elf_common.h
Trimed to meet the minimal requirements for the Zephyr elf file to be loaded
Also added elf file header data struct and program/section entry data structs.
Tracked-On: #6323
Signed-off-by: Zhou, Wu <wu.zhou@intel.com>
Reviewed-by: Victor Sun <victor.sun@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
In order to make better sense, vm_elf_loader, vm_bzimage_loader and
vm_rawimage_loader are changed to elf_loaer, bzimage_loaer and
rawimage_loader.
Tracked-On: #6323
Signed-off-by: Zhou, Wu <wu.zhou@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
Remove the acpi loading function from elf_loader, rawimage_loaer and
bzimage_loader, and call it together in vm_sw_loader.
Now the vm_sw_loader's job is not just loading sw, so we rename it to
prepare_os_image.
Tracked-On: #6323
Signed-off-by: Zhou, Wu <wu.zhou@intel.com>
Reviewed-by: Victor Sun <victor.sun@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
For the guest OS loaders, prapare_loading_xxx are not accurate for
what those functions actually do. Now they are changed to load_xxx:
load_rawimage, load_bzimage.
And the 'bsp' expression is confusing in the comments for
init_vcpu_protect_mode_regs, changed to a better way.
Tracked-On: #6323
Signed-off-by: Zhou, Wu <wu.zhou@intel.com>
Reviewed-by: Victor Sun <victor.sun@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
vboot_info.h declares vm loader function also, so rename the file name to
vboot.h;
Tracked-On: #6323
Signed-off-by: Victor Sun <victor.sun@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
The patch splits the vm_load.c to three parts, the loader function of bzImage
kernel is moved to bzimage_loader.c, the loader function of raw image kernel
is moved to rawimage_loader.c, the stub is still stayed in vm_load.c to load
the corresponding kernel loader function. Each loader function could be
isolated by CONFIG_GUEST_KERNEL_XXX macro which generated by config tool.
Tracked-On: #6323
Signed-off-by: Victor Sun <victor.sun@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>