vSphere Design Best Practices — Save 50%
Apply industry-accepted best practices to design reliable high-performance datacenters for your business needs with this book and ebook.
In this article, by Christopher Kusek and Brian Bolander, the author of vSphere Design Best Practices, you will learn the concepts of the virtual machines themselves.
(For more resources related to this topic, see here.)
Causes of virtual machine performance problems
In a perfect virtual infrastructure, you will never experience any performance problems and everything will work well within the budget that you allocated. But should there be circumstances that happen in this perfect utopian datacenter you've designed, hopefully this section will help you to identify and resolve the problems easier.
CPU performance issues
The following is a summary list of some the common CPU performance issues you may experience in your virtual infrastructure. While the following is not an exhaustive list of every possible problem you can experience with CPUs, it can help guide you in the right direction to solve CPU-related performance issues:
- High ready time: When your ready time is above 10 percent, this could indicate CPU contention and could be impacting the performance of any CPU-intensive applications. This is not a guarantee of a problem; however, applications which are not nearly as sensitive can still report high values and perform well within guidelines. Whether your application is CPU-ready is measured in milliseconds to get percentage conversion; see KB 2002181.
- High costop time: The costop time will often correlate to contention in multi-vCPU virtual machines. Costop time exceeding 10 percent could cause challenges when vSphere tries to schedule all vCPUs in your multi-vCPU servers.
- CPU limits: As discussed earlier, you will often experience performance problems if your virtual machine tries to use more resources than have been configured in your limits.
- Host CPU saturation: When the vSphere host utilization runs at above 80 percent, you may experience host saturation issues. This can introduce performance problems across the host as the CPU scheduler tries to assign resources to virtual machines.
- Guest CPU saturation: This is experienced on high utilization of vCPU resources within the operating system of your virtual machines. This can be mitigated, if required, by adding additional vCPUs to improve the performance of the application.
- Misconfigured affinity: Affinity is enabled by default in vSphere; however, if manually configured to be assigned to a specific physical CPU, problems can be encountered. This can often be experienced when creating a VM with affinity settings and then cloning the VM. VMware advises against manually configuring affinity.
- Oversizing vCPUs: When assigning multiple vCPUs to a virtual machine, you would want to ensure that the operating system is able to take advantage of the CPUs, threads, and your applications can support them. The overhead associated with unused vCPUs can impact other applications and resource scheduling within the vSphere host.
- Low guest usage: Sometimes poor performance problems with low CPU utilization will help identify the problem existing as I/O or memory. This is often a good guiding indicator that your CPU being underused can be caused by additional resources or even configuration.
Memory performance issues
Additionally, the following list is a summary of some common memory performance issues you may experience in your virtual infrastructure. The way VMware vSphere handles memory management, there is a unique set of challenges with troubleshooting and resolving performance problems as they arise:
- Host memory: Host memory is both a finite and very limited resource. While VMware vSphere incorporates some creative mechanisms to leverage and maximize the amount of available memory through features such as page sharing, memory management, and resource-allocation controls, there are several memory features that will only take effect when the host is under stress.
- Transparent page sharing: This is the method by which redundant copies of pages are eliminated. TPS, enabled by default, will break up regular pages into 4 KB chunks for better performance. When virtual machines have large physical pages (2 MB instead of 4 KB), vSphere will not attempt to enable TPS for these as the likelihood of multiple 2 MB chunks being similar is less than 4 KB. This can cause a system to experience memory overcommit and performance problems may be experienced; if memory stress is then experienced, vSphere may break these 2 MB chunks into 4 KB chunks to allow TPS to consolidate the pages.
- Host memory consumed: When measuring utilization for capacity planning, the value of host memory consumed can often be deceiving as it does not always reflect the actual memory utilization. Instead, the active memory or memory demand should be used as a better guide of actual memory utilized as features such as TPS can reflect a more accurate picture of memory utilization.
- Memory over-allocation: Memory over-allocation will usually be fine for most applications in most environments. It is typically safe to have over 20 percent memory allocation especially with similar applications and operating systems. The more similarity you have between your applications and environment, the higher you can take that number.
- Swap to disk: If you over-allocate your memory too high, you may start to experience memory swapping to disk, which can result in performance problems if not caught early enough. It is best, in those circumstances, to evaluate which guests are swapping to disk to help correct either the application or the infrastructure as appropriate.
For additional details on vSphere Memory management and monitoring, see KB 2017642.
Storage performance issues
When it comes to storage performance issues within your virtual machine infrastructure, there are a few areas you will want to pay particular attention to. Although most storage-related problems you are likely to experience will be more reliant upon your backend infrastructure, the following are a few that you can look at when identifying if it is the VM's storage or the SAN itself:
- Storage latency: Latency experienced at the storage level is usually expressed as a combination of the latency of the storage stack, guest operating system, VMkernel virtualization layer, and the physical hardware. Typically, if you experience slowness and are noticing high latencies, one or more aspects of your storage could be the cause.
- Three layers of latency: ESXi and vCenter typically report on three primary latencies. These are Guest Average Latency (GAVG), Device Average Latency (DAVG), and Kernel Average Latency (KAVG).
- Guest Average Latency (GAVG): This value is the total amount of latency that ESXi is able to detect. This is not to say that it is the total amount of latency being experienced but is just the figure of what ESXi is reporting against. So if you're experiencing a 5 ms latency with GAVG and a performance application such as Perfmon is identifying a storage latency of 50 ms, something within the guest operating system is incurring a penalty of 45 ms latency. In circumstances such as these, you should investigate the VM and its operating system to troubleshoot.
- Device Average Latency (DAVG): Device Average Latency tends to focus on the more physical of things aligned with the device; for instance, if the storage adapters, HBA, or interface is having any latency or communication backend to the storage array. Problems experienced here tend to fall more on the storage itself and less so as a problem which can be easily troubleshooted within ESXi itself. Some exceptions to this being firmware or adapter drivers, which may be introducing problems or queue depth in your HBA. More details on queue depth can be found at KB 1267.
- Kernel Average Latency (KAVG): Kernel Average Latency is actually not a specific number as it is a calculation of "Total Latency - DAVG = KAVG"; thus, when using this metric you should be wary of a few values. The typical value of KAVG should be zero, anything greater may be I/O moving through the kernel queue and can be generally dismissed. When your latencies are 2 ms or consistently greater, this may indicate a storage performance issue with your VMs, adapters, and queues should be reviewed for bottlenecks or problems.
The following are some KB articles that can help you further troubleshoot virtual machine storage:
- Using esxtop to identify storage performance issues (KB1008205)
- Troubleshooting ESX/ESXi virtual machine performance issues (KB2001003)
- Testing virtual machine storage I/O performance for VMware ESX and ESXi (KB1006821)
Network performance issues
Lastly, when it comes to addressing network performance issues, there are a few areas you will want to consider. Similar to the storage performance issues, a lot of these are often addressed by the backend networking infrastructure. However, there are a few items you'll want to investigate within the virtual machines to ensure network reliability.
- Networking error, IP already assigned to another adapter: This is a common problem experienced post V2V or P2V migrations, which results in ghosted network adapters. VMware KB 1179 guides you through the steps to go about removing these ghosted network adapters.
- Speed or duplex mismatch within the OS: Left at defaults, the virtual machine will use auto-negotiation to get maximum network performance; if configured down from that speed, this can introduce virtual machine limitations.
- Choose the correct network adapter for your VM: Newer operating systems should support the VMXNET3 adapter while some virtual machines, either legacy or upgraded from previous versions, may run older network adapter types. See KB 1001805 to help decide which adapters are correct for your usage.
The following are some KB articles that can help you further troubleshoot virtual machine networking:
- Troubleshooting virtual machine network connection issues (KB 1003893)
- Troubleshooting network performance issues in a vSphere environment (KB 1004097)
With this article, you should be able to inspect existing VMs while following design principles that will lead to correctly sized and deployed virtual machines. You also should have a better understanding of when your configuration is meeting your needs, and how to go about identifying performance problems associated with your VMs.
Resources for Article:
- Introduction to vSphere Distributed switches [Article]
- Network Virtualization and vSphere [Article]
- Networking Performance Design [Article]
|Apply industry-accepted best practices to design reliable high-performance datacenters for your business needs with this book and ebook.|
eBook Price: $16.99
Book Price: $26.99
About the Author :
Brian Bolander spent 13 years on active duty in the United States Air Force. A veteran of Operation Enduring Freedom, he was honorably discharged in 2005. He immediately returned to Afghanistan and worked on datacenter operations for the US Department of Defense (DoD) at various locations in southwest Asia. Invited to join a select team of internal consultants and troubleshooters responsible for operations in five countries, he was also the project manager for what was then the largest datacenter built in Afghanistan.
After leaving Afghanistan in 2011, he managed IT operations in a DoD datacenter in the San Francisco Bay Area. His team was responsible for dozens of multimillion dollar programs running on VMware and supporting users across the globe.
He scratched his adrenaline itch in 2011 when he went back "downrange", this time directing the premier engineering and installation team for the DoD in Afghanistan. It was his privilege to lead this talented group of engineers who were responsible for architecting virtual datacenter installations, IT project management, infrastructure upgrades, and technology deployments on VMware for the entire country from 2011 to 2013.
Selected as a vExpert in 2014, he is currently supporting Operation Enduring Freedom as a senior virtualization and storage engineer.
He loves his family, friends, and rescue mutts. He digs science, technology, and geek gear. He's an audiophile, a horologist, an avid collector, a woodworker, an amateur photographer, and a virtualization nerd.
Christopher Kusek had a unique opportunity presented to him in 2013, to take the leadership position responsible for theater-wide infrastructure operations for the war effort in Afghanistan. Leveraging his leadership skills and expertise in virtualization, storage, applications, and security, he's been able to provide enterprise-quality service while operating in an environment that includes the real and regular challenges of heat, dust, rockets, and earthquakes.
He has over 20 years' experience in the industry, with virtualization experience running back to the pre-1.0 days of VMware. He has shared his expertise with many far and wide through conferences, presentations, #CXIParty, and sponsoring or presenting at community events and outings whether it is focused on storage, VMworld, or cloud.
He is the author of VMware vSphere 5 Administration Instant Reference, Sybex, 2012, and VMware vSphere Performance: Designing CPU, Memory, Storage, and Networking for Performance-Intensive Workloads, Sybex, 2014. He is a frequent contributor to VMware Communities' Podcasts and vBrownbag, and has been an active blogger for over a decade.
A proud VMware vExpert and huge supporter of the program and growth of the virtualization community, Christopher continues to find new ways to outreach and spread the joys of virtualization and the transformative properties it has on individuals and businesses alike. He was named an EMC Elect in 2013, 2014, and continues to contribute to the storage community, whether directly or indirectly, with analysis and regular review.
He continues to update his blog with useful stories of virtualization and storage and his adventures throughout the world, which currently include stories of his times in Afghanistan. You can read his blog at http://pkguild.com or more likely catch him on Twitter; his twitter handle is @cxi.
When he is not busy changing the world, one virtual machine at a time or Facetiming with his family on the other side of the world, he's trying to fi nd awesome vegan food in the world at large or somewhat edible food for a vegan in a war zone.