diff --git a/doc/tutorials/rtvm_performance_tips.rst b/doc/tutorials/rtvm_performance_tips.rst index f988352c1..a53b18cc7 100644 --- a/doc/tutorials/rtvm_performance_tips.rst +++ b/doc/tutorials/rtvm_performance_tips.rst @@ -1,213 +1,196 @@ -.. _rt_perf_tips_rtvm: - -ACRN Real-Time VM Performance Tips -################################## - -Background -********** - -The ACRN real-time VM (RTVM) is a special type of ACRN post-launched VM. In -order to achieve bare metal-like RT performance, a set of constraints and -technologies are applied to the RTVM compared to the ACRN standard VM. With -these additional constraints and technologies, RT tasks can run on the RTVM -without a VM-exit, which is a key virtualization overhead issue. - -In addition to the VM-exit, interference from neighbor VMs, such as Service -VMs, Human-Machine-Interface (HMI) VMs, or other RT VMs may affect the -execution of real-time tasks on a certain RTVM. Other technologies are -applied to isolate noise from the neighbor VMs. - -Here is the list of key technologies applied to enable the bare metal-like -RT performance: - -- LAPIC passthrough with core partitioning. -- PCIe Device Passthrough: Only MSI interrupt-capable PCI devices will be - supported for the RTVM. -- Enable CAT (Cache Allocation Technology)-based cache isolation: RTVM uses - a dedicated CLOS (Class of Service). While others may share CLOS, the GPU - uses a CLOS that will not overlap with the RTVM CLOS. -- PMD virtio: Both virtio BE and FE work in polling mode so that the - interrupts or notification between the Service VM and RTVM are not needed. - The RTVM guest memory is hidden from the Service VM except for the virtio - queue memory which is all that the Service VM can access. - -This document list tips that are summarized from issues encountered and -resolved during real-time development and performance tuning. - -Mandatory options for an RTVM -***************************** - -An RTVM is a post-launched VM with LAPIC passthrough. To launch an ACRN -RTVM, take note of the following options: - -**Tip 1:** Apply the acrn-dm option "--lapic_pt" and make the guest RTVM -operate under the LAPIC X2APIC mode to enable the LAPIC passthrough. - -The LAPIC passthrough feature of ACRN is configured via the "--lapic_pt" -option, but the feature is actually enabled when LAPIC is switched to X2APIC -mode. So, both conditions should be met to enable an RTVM. The "--rtvm" -option will be automatically attached once "--lapic_pt" is applied. - -**Tip 2:** If necessary, use virtio polling mode to prevent the frontend of -the VM-exit from sending a notification to the backend. - -We recommend that you passthrough a physical peripheral device to an RTVM, -such as block or an ethernet device. If no physical device is available, -ACRN supports virtio devices and enables the polling mode to avoid a VM-exit -at the frontend. Virtio polling mode can be enabled via the option -"--virtio_poll [polling interval]". - -Avoid VM-exit latency -********************* - -VM-exit has a significant negative impact on virtualization performance. -A single VM-exit can cause several micro-second latencies, or even longer, -depending on what's done in VMX-root mode. VM-exit is classified into two -types: triggered by external CPU events or triggered by operations initiated -by the vCPU. - -ACRN eliminates almost all VM-exits triggered by external events via the -LAPIC passthrough. A few exceptions exist: - -- SMI - it will bring the processor into the SMM, causing a much longer - performance impact. The SMI should be handled in the BIOS. - -- NMI - ACRN uses NMI for system-level notification. - -Users should take care of VM-exits that are triggered by operations -initiated by the vCPU. Refer to the Intel SMD: "Instructions Cause VM-exits -Unconditionally" (SDM V3, 25.1.2) and "Instructions That Cause VM-exits -Conditionally" (SDM V3, 25.1.3). - -**Tip 3:** Do not use CPUID in the RT critical section. - -CPUID is an instruction that causes VM-exits unconditionally. As to the -normal usage of CPUID, this can be avoided by detecting the CPU capability -before entering the RT critical section. CPUID can be executed at any -privilege level to serialize instruction execution and its high efficiency -of execution. It's commonly used as a serializing instruction in an -application, and a typical case is using CPUID immediately before and after -RDTSC. In order to remove CPUID in this case, use RDTSCP instead of RDTSC. -Because RDTSCP waits until all previous instructions have been executed -before reading the counter, and the subsequent instructions after the RDTSCP -normally have data dependency on it, they must wait until the RDTSCP has -been executed. - -RDMSR or WRMSR are instructions that cause VM-exits conditionally. On the -ACRN RTVM, most MSRs are not intercepted by the HV, so they won't cause a -VM-exit. But there are exceptions for security consideration: 1) read from -APICID and LDR; 2) write to TSC_ADJUST if VMX_TSC_OFFSET_FULL is zero; -otherwise, read and write to TSC_ADJUST and TSC_DEADLINE; 3) write to ICR. - -**Tip 4:** Do not use RDMSR to access APICID and LDR at the RT critical -section. - -ACRN does not intend to present a physical APICID to a guest so that APICID -and LDR are virtualized even though LAPIC is passthrough. As a result, -access to APICID and LDR can cause a VM-exit. - -**Tip 5:** Guarantee that VMX_TSC_OFFSET_FULL is zero; otherwise, do not -access TSC_ADJUST and TSC_DEADLINE in the RT critical section. - -ACRN uses VMX_TSC_OFFSET_FULL as the offset between vTSC_ADJUST and -pTSC_ADJUST; therefore, if VMX_TSC_OFFSET_FULL is zero, intercepting -TSC_ADJUST and TSC_DEADLINE is not necessary. Otherwise, they should be -intercepted to guarantee functionality. - -**Tip 6:** Utilize Preempt-RT Linux mechanisms to reduce the access of ICR -from the RT core: - -#. Add "domain" to the "isolcpus" ( “isolcpus=nohz,domain,1” ) to the kernel parameters. -#. Add "idle=poll" to the kernel parameters. -#. Add "rcu_nocb_poll" along with "rcu_nocbs=1" to the kernel parameters. -#. Disable the logging service like journald, syslogd if possible. - -These parameters are recommended for the guest Preempt-RT Linux. For a UP -RTVM, ICR interception is not a problem. But for an SMP RTVM, IPI may be -needed between vCPUs; these tips are about to reduce the ICR access. The -example above assumes it is a dual-core RTVM, while core 0 is a housekeeping -core and core 1 is a real-time core. The "domain" flag makes strong -isolation of the RT core from the general SMP balancing and scheduling -algorithms. "idle=poll" and "rcu_nocb_poll" could prevent the RT core from -sending reschedule IPI to wakeup tasks on core 0 in most cases. And the -disabling of the logging service is because an IPI may be issued to the -housekeeping core to notify the logging service when there are kernel -messages output on the RT core. - -.. note:: - If an ICR access is inevitable within the RT critical section, please be - aware of the extra 3~4 us latency from each access. - -**TIP 7:** Create and initialize the RT tasks at the beginning to avoid -runtime access to control registers. - -The access to Control Registers is another cause of a VM-exit. An ACRN access -to CR3 and CR8 do not cause a VM-exit, but writes to CR0 and CR4 may cause a -VM-exit, which would happen at the spawning and initialization of a new task. - -Isolating the impact of neighbor VMs -************************************ - -ACRN makes use of several technologies and hardware features to avoid the -impact to the RTVM from neighbor VMs: - -**TIP 8:** Do not share CPUs allocated to the RTVM with other RT/non-RT VMs. - -ACRN enables CPU sharing to improve the utilization of CPU resources. -However, for RT VM, CPUs should be dedicatedly allocated for the determinism. - -**TIP 9:** Use RDT such as CAT and MBA to allocate dedicated resources to -the RTVM. - -ACRN enables the Intel® Resource Director Technology, such as CAT and MBA, -components such as the GPU via memory hierarchy. The availability of RDT is -hardware-specific. Refer to the :ref:`rdt_configuration`. - -**TIP 10:** Lock the GPU to a feasible lowest frequency. - -GPU can put heavy pressure on the power/memory subsystem, so locking the GPU -frequency as low as possible can help to improve the determinism of RT -performance. It can be locked in the BIOS, but the availability of certain -BIOS option is platform-specific. - -Miscellaneous -************* - -**TIP 11:** Disable timer migration on Preempt-RT Linux. - -Because most tasks are set affinitive to the housekeeping core, the timer -armed by RT tasks might be migrated to the nearest busy CPU for power -saving. But it will hurt the determinism because the timer interrupts raised -on the housekeeping core need to be resent to the RT core. The timer -migration could be disabled by cmd: "echo 0 > /proc/kernel/timer_migration" - -**TIP 12:** Add "mce=off" to RT VM kernel parameters. - -"mce=off" can disable the mce periodic timer in order to void a VM-exit. - -**TIP 13:** Disable the Intel processor C-State and P-State of the RTVM. - -Power management of a processor could save power, but it could also impact -the RT performance because the power state is changing. C-State and P-State -PM mechanism can be disabled by adding "processor.max_cstate=0 -intel_idle.max_cstate=0 intel_pstate=disabled" to the kernel parameters. - -**TIP 14:** Exercise caution when setting /proc/sys/kernel/sched_rt_runtime_us. - -Setting /proc/sys/kernel/sched_rt_runtime_us to -1 can be dangerous. A value -of -1 allows RT tasks to monopolize a CPU, so that the mechanism such as -"nohz" might get no chance to work, which can hurt the RT performance or -even (potentially) lock up a system. - -**TIP 15:** Disable the software workaround for Machine Check Error on Page -Size Change. - -By default, the software workaround for Machine Check Error on Page Size -Change is conditionally applied to the models that may be affected by the -issue. However, the software workaround has a negative impact on -performance. If all guest OS kernels are trusted, the -:option:`CONFIG_MCE_ON_PSC_WORKAROUND_DISABLED` option could be set for performance. - -.. note:: - The tips for preempt-RT Linux is mostly applicable to the Linux-based RT OS as well, such as Xenomai. - +.. _rt_perf_tips_rtvm: + +ACRN Real-Time VM Performance Tips +################################## + +Background +********** + +The ACRN real-time VM (RTVM) is a special type of ACRN post-launched VM. +This document shows how you can configure RTVMs to potentially achieve +near bare-metal performance by configuring certain key technologies and +eliminating use of a VM-exit within RT tasks, thereby avoiding this +common virtualization overhead issue. + +Neighbor VMs such as Service VMs, Human-Machine-Interface (HMI) VMs, or +other real-time VMs, may negatively affect the execution of real-time +tasks on an RTVM. This document also shows technologies used to isolate +potential runtime noise from neighbor VMs. + +Here are some key technologies that can significantly improve +RTVM performance: + +- LAPIC passthrough with core partitioning. +- PCIe Device Passthrough: Only MSI interrupt-capable PCI devices are + supported for the RTVM. +- Enable CAT (Cache Allocation Technology)-based cache isolation: RTVM uses + a dedicated CLOS (Class of Service). While others may share CLOS, the GPU + uses a CLOS that will not overlap with the RTVM CLOS. +- PMD virtio: Both virtio BE and FE work in polling mode so + interrupts and notification between the Service VM and RTVM are not needed. + All RTVM guest memory is hidden from the Service VM except for the virtio + queue memory. + +This document summarizes tips from issues encountered and +resolved during real-time development and performance tuning. + +Mandatory options for an RTVM +***************************** + +An RTVM is a post-launched VM with LAPIC passthrough. Pay attention to +these options when you launch an ACRN RTVM: + +Tip: Apply the acrn-dm option ``--lapic_pt`` + The LAPIC passthrough feature of ACRN is configured via the + ``--lapic_pt`` option, but the feature is actually enabled when LAPIC is + switched to X2APIC mode. Both conditions should be met to enable an + RTVM. The ``--rtvm`` option will be automatically attached once + ``--lapic_pt`` is applied. + +Tip: Use virtio polling mode + Polling mode prevents the frontend of the VM-exit from sending a + notification to the backend. We recommend that you passthrough a + physical peripheral device (such as block or an ethernet device), to an + RTVM. If no physical device is available, ACRN supports virtio devices + and enables polling mode to avoid a VM-exit at the frontend. Enable + virtio polling mode via the option ``--virtio_poll [polling interval]``. + +Avoid VM-exit latency +********************* + +VM-exit has a significant negative impact on virtualization performance. +A single VM-exit causes a several micro-second or longer latency, +depending on what's done in VMX-root mode. VM-exit is classified into two +types: triggered by external CPU events or triggered by operations initiated +by the vCPU. + +ACRN eliminates almost all VM-exits triggered by external events by +using LAPIC passthrough. A few exceptions exist: + +- SMI - This brings the processor into the SMM, causing a much longer + performance impact. The SMI should be handled in the BIOS. + +- NMI - ACRN uses NMI for system-level notification. + +You should avoid VM-exits triggered by operations initiated by the +vCPU. Refer to the `Intel Software Developer Manuals (SMD) +`_ "Instructions +Cause VM-exits Unconditionally" (SDM V3, 25.1.2) and "Instructions That +Cause VM-exits Conditionally" (SDM V3, 25.1.3). + +Tip: Do not use CPUID in a real-time critical section. + The CPUID instruction causes VM-exits unconditionally. You should + detect CPU capability **before** entering a RT-critical section. + CPUID can be executed at any privilege level to serialize instruction + execution and its high efficiency of execution. It's commonly used as a + serializing instruction in an application by using CPUID + immediately before and after RDTSC. Remove use of CPUID in this case by + using RDTSCP instead of RDTSC. RDTSCP waits until all previous + instructions have been executed before reading the counter, and the + subsequent instructions after the RDTSCP normally have data dependency + on it, so they must wait until the RDTSCP has been executed. + + RDMSR or WRMSR are instructions that cause VM-exits conditionally. On the + ACRN RTVM, most MSRs are not intercepted by the HV, so they won't cause a + VM-exit. But there are exceptions for security consideration: + + 1) read from APICID and LDR; + 2) write to TSC_ADJUST if VMX_TSC_OFFSET_FULL is zero; + otherwise, read and write to TSC_ADJUST and TSC_DEADLINE; + 3) write to ICR. + +Tip: Do not use RDMSR to access APICID and LDR in an RT critical section. + ACRN does not present a physical APICID to a guest, so APICID + and LDR are virtualized even though LAPIC is passthrough. As a result, + access to APICID and LDR can cause a VM-exit. + +Tip: Guarantee that VMX_TSC_OFFSET_FULL is zero; otherwise, do not access TSC_ADJUST and TSC_DEADLINE in the RT critical section. + ACRN uses VMX_TSC_OFFSET_FULL as the offset between vTSC_ADJUST and + pTSC_ADJUST. If VMX_TSC_OFFSET_FULL is zero, intercepting + TSC_ADJUST and TSC_DEADLINE is not necessary. Otherwise, they should be + intercepted to guarantee functionality. + +Tip: Utilize Preempt-RT Linux mechanisms to reduce the access of ICR from the RT core. + #. Add ``domain`` to ``isolcpus`` ( ``isolcpus=nohz,domain,1`` ) to the kernel parameters. + #. Add ``idle=poll`` to the kernel parameters. + #. Add ``rcu_nocb_poll`` along with ``rcu_nocbs=1`` to the kernel parameters. + #. Disable the logging service like journald, syslogd if possible. + + The parameters shown above are recommended for the guest Preempt-RT + Linux. For an UP RTVM, ICR interception is not a problem. But for an SMP + RTVM, IPI may be needed between vCPUs. These tips are about reducing ICR + access. The example above assumes it is a dual-core RTVM, while core 0 + is a housekeeping core and core 1 is a real-time core. The ``domain`` + flag makes strong isolation of the RT core from the general SMP + balancing and scheduling algorithms. The parameters ``idle=poll`` and + ``rcu_nocb_poll`` could prevent the RT core from sending reschedule IPI + to wakeup tasks on core 0 in most cases. The logging service is disabled + because an IPI may be issued to the housekeeping core to notify the + logging service when there are kernel messages output on the RT core. + + .. note:: + If an ICR access is inevitable within the RT critical section, be + aware of the extra 3~4 microsecont latency for each access. + +Tip: Create and initialize the RT tasks at the beginning to avoid runtime access to control registers. + Accessing Control Registers is another cause of a VM-exit. An ACRN access + to CR3 and CR8 does not cause a VM-exit. However, writes to CR0 and CR4 may cause a + VM-exit, which would happen at the spawning and initialization of a new task. + +Isolating the impact of neighbor VMs +************************************ + +ACRN makes use of several technologies and hardware features to avoid +performance impact on the RTVM by neighbor VMs: + +Tip: Do not share CPUs allocated to the RTVM with other RT or non-RT VMs. + ACRN enables CPU sharing to improve the utilization of CPU resources. + However, for an RT VM, CPUs should be dedicatedly allocated for determinism. + +Tip: Use RDT such as CAT and MBA to allocate dedicated resources to the RTVM. + ACRN enables Intel® Resource Director Technology such as CAT, and MBA + components such as the GPU via the memory hierarchy. The availability of RDT is + hardware-specific. Refer to the :ref:`rdt_configuration`. + +Tip: Lock the GPU to a feasible lowest frequency. + A GPU can put a heavy load on the power/memory subsystem. Locking + the GPU frequency as low as possible can help improve RT performance + determinism. GPU frequency can usually be locked in the BIOS, but such + BIOS support is platform-specific. + +Miscellaneous +************* + +Tip: Disable timer migration on Preempt-RT Linux. + Because most tasks are set affinitive to the housekeeping core, the timer + armed by RT tasks might be migrated to the nearest busy CPU for power + saving. But it will hurt RT determinism because the timer interrupts raised + on the housekeeping core need to be resent to the RT core. The timer + migration can be disabled by the command:: + + echo 0 > /proc/kernel/timer_migration + +Tip: Add ``mce=off`` to RT VM kernel parameters. + This parameter disables the mce periodic timer and avoids a VM-exit. + +Tip: Disable the Intel processor C-State and P-State of the RTVM. + Power management of a processor could save power, but it could also impact + the RT performance because the power state is changing. C-State and P-State + PM mechanism can be disabled by adding ``processor.max_cstate=0 + intel_idle.max_cstate=0 intel_pstate=disabled`` to the kernel parameters. + +Tip: Exercise caution when setting ``/proc/sys/kernel/sched_rt_runtime_us``. + Setting ``/proc/sys/kernel/sched_rt_runtime_us`` to ``-1`` can be a + problem. A value of ``-1`` allows RT tasks to monopolize a CPU, so that + a mechanism such as ``nohz`` might get no chance to work, which can hurt + the RT performance or even (potentially) lock up a system. + +Tip: Disable the software workaround for Machine Check Error on Page Size Change. + By default, the software workaround for Machine Check Error on Page Size + Change is conditionally applied to the models that may be affected by the + issue. However, the software workaround has a negative impact on + performance. If all guest OS kernels are trusted, the + :option:`CONFIG_MCE_ON_PSC_WORKAROUND_DISABLED` option could be set for performance. + +.. note:: + The tips for preempt-RT Linux are mostly applicable to the Linux-based RT OS as well, such as Xenomai.