(For more resources related to this topic, see here.)
Causes of virtual machine performance problems
In a perfect virtual infrastructure, you will never experience any performance problems and everything will work well within the budget that you allocated. But should there be circumstances that happen in this perfect utopian datacenter you've designed, hopefully this section will help you to identify and resolve the problems easier.
CPU performance issues
The following is a summary list of some the common CPU performance issues you may experience in your virtual infrastructure. While the following is not an exhaustive list of every possible problem you can experience with CPUs, it can help guide you in the right direction to solve CPU-related performance issues:
- High ready time: When your ready time is above 10 percent, this could indicate CPU contention and could be impacting the performance of any CPU-intensive applications. This is not a guarantee of a problem; however, applications which are not nearly as sensitive can still report high values and perform well within guidelines. Whether your application is CPU-ready is measured in milliseconds to get percentage conversion; see KB 2002181.
- High costop time: The costop time will often correlate to contention in multi-vCPU virtual machines. Costop time exceeding 10 percent could cause challenges when vSphere tries to schedule all vCPUs in your multi-vCPU servers.
- CPU limits: As discussed earlier, you will often experience performance problems if your virtual machine tries to use more resources than have been configured in your limits.
- Host CPU saturation: When the vSphere host utilization runs at above 80 percent, you may experience host saturation issues. This can introduce performance problems across the host as the CPU scheduler tries to assign resources to virtual machines.
- Guest CPU saturation: This is experienced on high utilization of vCPU resources within the operating system of your virtual machines. This can be mitigated, if required, by adding additional vCPUs to improve the performance of the application.
- Misconfigured affinity: Affinity is enabled by default in vSphere; however, if manually configured to be assigned to a specific physical CPU, problems can be encountered. This can often be experienced when creating a VM with affinity settings and then cloning the VM. VMware advises against manually configuring affinity.
- Oversizing vCPUs: When assigning multiple vCPUs to a virtual machine, you would want to ensure that the operating system is able to take advantage of the CPUs, threads, and your applications can support them. The overhead associated with unused vCPUs can impact other applications and resource scheduling within the vSphere host.
- Low guest usage: Sometimes poor performance problems with low CPU utilization will help identify the problem existing as I/O or memory. This is often a good guiding indicator that your CPU being underused can be caused by additional resources or even configuration.
Memory performance issues
Additionally, the following list is a summary of some common memory performance issues you may experience in your virtual infrastructure. The way VMware vSphere handles memory management, there is a unique set of challenges with troubleshooting and resolving performance problems as they arise:
- Host memory: Host memory is both a finite and very limited resource. While VMware vSphere incorporates some creative mechanisms to leverage and maximize the amount of available memory through features such as page sharing, memory management, and resource-allocation controls, there are several memory features that will only take effect when the host is under stress.
- Transparent page sharing: This is the method by which redundant copies of pages are eliminated. TPS, enabled by default, will break up regular pages into 4 KB chunks for better performance. When virtual machines have large physical pages (2 MB instead of 4 KB), vSphere will not attempt to enable TPS for these as the likelihood of multiple 2 MB chunks being similar is less than 4 KB. This can cause a system to experience memory overcommit and performance problems may be experienced; if memory stress is then experienced, vSphere may break these 2 MB chunks into 4 KB chunks to allow TPS to consolidate the pages.
- Host memory consumed: When measuring utilization for capacity planning, the value of host memory consumed can often be deceiving as it does not always reflect the actual memory utilization. Instead, the active memory or memory demand should be used as a better guide of actual memory utilized as features such as TPS can reflect a more accurate picture of memory utilization.
- Memory over-allocation: Memory over-allocation will usually be fine for most applications in most environments. It is typically safe to have over 20 percent memory allocation especially with similar applications and operating systems. The more similarity you have between your applications and environment, the higher you can take that number.
- Swap to disk: If you over-allocate your memory too high, you may start to experience memory swapping to disk, which can result in performance problems if not caught early enough. It is best, in those circumstances, to evaluate which guests are swapping to disk to help correct either the application or the infrastructure as appropriate.
For additional details on vSphere Memory management and monitoring, see KB 2017642.
Storage performance issues
When it comes to storage performance issues within your virtual machine infrastructure, there are a few areas you will want to pay particular attention to. Although most storage-related problems you are likely to experience will be more reliant upon your backend infrastructure, the following are a few that you can look at when identifying if it is the VM's storage or the SAN itself:
- Storage latency: Latency experienced at the storage level is usually expressed as a combination of the latency of the storage stack, guest operating system, VMkernel virtualization layer, and the physical hardware. Typically, if you experience slowness and are noticing high latencies, one or more aspects of your storage could be the cause.
- Three layers of latency: ESXi and vCenter typically report on three primary latencies. These are Guest Average Latency (GAVG), Device Average Latency (DAVG), and Kernel Average Latency (KAVG).
- Guest Average Latency (GAVG): This value is the total amount of latency that ESXi is able to detect. This is not to say that it is the total amount of latency being experienced but is just the figure of what ESXi is reporting against. So if you're experiencing a 5 ms latency with GAVG and a performance application such as Perfmon is identifying a storage latency of 50 ms, something within the guest operating system is incurring a penalty of 45 ms latency. In circumstances such as these, you should investigate the VM and its operating system to troubleshoot.
- Device Average Latency (DAVG): Device Average Latency tends to focus on the more physical of things aligned with the device; for instance, if the storage adapters, HBA, or interface is having any latency or communication backend to the storage array. Problems experienced here tend to fall more on the storage itself and less so as a problem which can be easily troubleshooted within ESXi itself. Some exceptions to this being firmware or adapter drivers, which may be introducing problems or queue depth in your HBA. More details on queue depth can be found at KB 1267.
- Kernel Average Latency (KAVG): Kernel Average Latency is actually not a specific number as it is a calculation of "Total Latency - DAVG = KAVG"; thus, when using this metric you should be wary of a few values. The typical value of KAVG should be zero, anything greater may be I/O moving through the kernel queue and can be generally dismissed. When your latencies are 2 ms or consistently greater, this may indicate a storage performance issue with your VMs, adapters, and queues should be reviewed for bottlenecks or problems.
The following are some KB articles that can help you further troubleshoot virtual machine storage:
- Using esxtop to identify storage performance issues (KB1008205)
- Troubleshooting ESX/ESXi virtual machine performance issues (KB2001003)
- Testing virtual machine storage I/O performance for VMware ESX and ESXi (KB1006821)
Network performance issues
Lastly, when it comes to addressing network performance issues, there are a few areas you will want to consider. Similar to the storage performance issues, a lot of these are often addressed by the backend networking infrastructure. However, there are a few items you'll want to investigate within the virtual machines to ensure network reliability.
- Networking error, IP already assigned to another adapter: This is a common problem experienced post V2V or P2V migrations, which results in ghosted network adapters. VMware KB 1179 guides you through the steps to go about removing these ghosted network adapters.
- Speed or duplex mismatch within the OS: Left at defaults, the virtual machine will use auto-negotiation to get maximum network performance; if configured down from that speed, this can introduce virtual machine limitations.
- Choose the correct network adapter for your VM: Newer operating systems should support the VMXNET3 adapter while some virtual machines, either legacy or upgraded from previous versions, may run older network adapter types. See KB 1001805 to help decide which adapters are correct for your usage.
The following are some KB articles that can help you further troubleshoot virtual machine networking:
- Troubleshooting virtual machine network connection issues (KB 1003893)
- Troubleshooting network performance issues in a vSphere environment (KB 1004097)
With this article, you should be able to inspect existing VMs while following design principles that will lead to correctly sized and deployed virtual machines. You also should have a better understanding of when your configuration is meeting your needs, and how to go about identifying performance problems associated with your VMs.
Resources for Article:
- Introduction to vSphere Distributed switches [Article]
- Network Virtualization and vSphere [Article]
- Networking Performance Design [Article]