This first chapter introduces the reader to vRealize Operations Manager, starting with an overview of the solution, and how it fits into the rest of the VMware vRealize family of products.
The topics to be covered in this chapter include:
Operational disciplines addressed by the solution, including performance management and capacity planning
Solution architecture, addressing the needs of scalability and resilience
Product packaging and licensing
vRealize Operations Manager is the core component of vRealize Operations, which itself is a suite of integrated products that provide intelligent operational capabilities for IT departments. The solution has been built, not just to monitor and manage vSphere, but also various pieces of infrastructure, such as storage, other hypervisors, operating systems, and applications.
vRealize Operations Manager
vRealize Hyperic and End Point Operations
vRealize Infrastructure Navigator
vRealize Configuration Manager
Other solutions within the vRealize family of products that integrate tightly with vRealize Operations are:
vRealize Log Insight
vRealize Business Standard
The main focus of this book is the vRealize Operations Manager component of vRealize Operations. However, we will look at vRealize Log Insight integration in Chapter 9, vRealize Log Insight Integration and at End Point Operations in Chapter 10, End Point Operations.
Integrations with the other components of the vRealize portfolio, or with other hardware such as EMC, Dell, or Hitachi storage arrays, are provided by Management Packs. Management Packs and VMware's Solution Exchange will be covered in Chapter 7, vRealize Operations Manager Solutions.
As the name of the solution suggests, vRealize Operations is an operational management solution. It has been designed to address the operational disciplines of Performance, Capacity, Configuration, and Compliance.
Each of these can be thought of as being related and acting in concert with each other. Together they define the level of availability achieved by the infrastructure being managed, and whether the Service Level Agreements (SLAs) in place between the business and the IT department are being met.
For example, if there is insufficient capacity in a cluster, the performance of VMs in that cluster may deteriorate, and the service or application that these VMs support may become unavailable.
vRealize Operations uses a variety of features such as content, alerts, symptoms, management packs, and reporting to provide the required visibility and control of the infrastructure, and deliver on these operational disciplines. Let's look at them in more detail.
vRealize Operations Manager monitors the performance of managed systems, and provides the system administrators with a set of very intuitive dashboards that provide them quick visualization of problems and issues that may arise. When the performance of the systems is not as expected, the solution helps with troubleshooting by directing the administrator quickly to the root cause of the problem. This is all underpinned with analytics and content.
Every five minutes, vRealize Operations collects and stores the metric and property data about every resource it manages. The data is kept for six months at full granularity and is used by the Analytics engine to allow the system to understand normal behavior.
The frequency of data collection and retention is tunable from the default 5 minute data collection and 6 months data retention periods. However, care must be taken when changing these as they can affect, quite significantly, the sizing requirements of the vRealize Operations nodes.
Every night, a set of analytics algorithms are run against every metric's historical dataset, to determine the expected behavior of each metric for the upcoming 24 hours. This expected behavior for a metric is called a Dynamic Threshold (DT). As metrics are collected and stored, they are compared against the DT to determine whether the object is exhibiting normal behavior. This is described in more detail in Figure 1.1.
The analytics are designed to look for different patterns of behavior, such as hourly, daily, weekly, monthly, and quarterly.
It will obviously take some time for vRealize Operations to learn all the expected behavior, as it needs to observe at least three data points to start seeing a trend, and many more to predict the trend with greater confidence. For example, a metric exhibiting a weekly cadence of behavior requires at least three weeks of data for a weekly trend to be detected.
The preceding simplified example shows how a DT and metric may be measured and tracked. The grey shading is the DT, and the diagram shows that during the early morning it is expecting this metric's value to be 0-10%, then 50-60% during the work day, and then back down to 0-10% for the evening. There is a short peak just before midnight, which is possibly a batch or a backup job. The black line is the observed metric and we can see that normal behavior has occurred; so in this case, there is no alerting to be done as the metric is operating normally.
If an observed metric deviates outside of the DT range, it is classed as an Anomaly and highlighted in yellow in the Metric Selectors and the associated Metric Graphs in the vRealize Operations dashboards.
The number of anomalies observed over time is also recorded for every object, and vRealize Operations uses these derived metrics to determine whether the number of anomalies being observed is significant and if it is required that an alert is generated.
Performance or availability problems are generally caused by something different happening with the resources within an environment, and this "something different" causes associated metrics to breach their DTs. This means that the majority of alerts that are performance or metric related will only be generated when abnormal behavior occurs. This dramatically reduces the number of alerts that IT operations receive and increases the quality of those alerts.
The content baked into vRealize Operations is how the solution creates the intelligent and meaningful alerts. There is a lot of content provided by the solution and much more content will be added with the installation of Management Packs. Custom content can also be created very easily and will be described in Chapter 5, Alerts, Symptoms, Recommendations, and Actions.
Symptom(s): They are descriptions of one or more conditions under which the alert is triggered. In the preceding example, the symptoms are that a VM is swapping to disk, has high ballooning or has memory compressed, and has high memory contention.
Recommendations: They are remediation action(s) that can be taken to resolve the symptoms. In the preceding case, the action may be to add a memory reservation or initiate a vMotion to migrate some VMs to another host or cluster with more capacity.
Actions: They act on the recommendation(s). vRealize Operations has the capability to initiate actions using Python scripts or vRealize Orchestrator workflows to carry out the recommendations. In the preceding example, the out of the box Python script can be used to set a memory reservation, or vRealize Orchestrator can be used to initiate a vMotion.
Alerting based on metrics, which are outside the range of the calculated DTs, can be considered fairly generic and caused by "things happening differently". They tend to be used to troubleshoot and alert on unexpected behavior.
As well as triggering alerts based on unexpected behavior, much of the content in vRealize Operations Manager is based on specific behavior and documented best practices. For instance, storage latency would generally be considered performance impacting by a storage administrator, when it reaches 20-30ms.
Content within vRealize Operations Manager can also include Hard Thresholds (HTs), such as a figure of 20-30ms for storage latency, which can trigger alerts regardless of the state of the DT for the given metrics.
Content and alerts will be covered in much more depth in Chapter 5, Alerts, Symptoms, Recommendations, and Actions.
Capacity management is one of the most important disciplines in IT Operations. Unfortunately, as virtualization has matured, traditional capacity management techniques have tended not to keep up with the technology. My experience of working with clients with mature virtualized environments and outdated capacity management practices is that they find themselves with a lot of underutilized infrastructure, resulting in a lot of wasted resources.
Capacity remaining: Taking into account the reserved capacity for vSphere HA and the headroom buffers, it answers the question about how much capacity remaining does a given resource have?. In the preceding screenshot, we can see that we have enough capacity available in this resource to support a further 32 average sized virtual machines.
Time remaining: Again, taking into account the reserved capacity for vSphere HA and the headroom buffers, it answers the question when am I going to run out of capacity?. In the following screenshot, we can see that the capacity for this resource is going to run out in 87 days and that CPU is the constrained resource.
Every object or resource in vRealize Operations can have a capacity model configured against it. This describes the metric(s) used to determine the capacity and the other factors, or constraints to be considered, such as vSphere HA. The models themselves are not configurable, however, how they are applied generally is configurable, and is managed within the policies section of vRealize Operations.
Many of the VMware and third-party Management Packs have capacity models associated with the resources they are managing. The documentation for these Management Packs usually provides the administrator detail on how the capacity of a given object type is calculated.
The policies governing the capacity management in vRealize Operations are very granular and controllable. This allows the administrator to define what combination of demand or allocation capacity policies are applied against specific resources or groups of resources. This will be covered in detail in Chapter 6, Capacity Planning and Capacity Projects.
As well as understanding the current capacity and the time remaining, many organizations will have ongoing projects that are going to add planned workload or additional hardware to their infrastructure. A new feature, Capacity Projects, introduced in vRealize Operations 6.0, allows the administrator to define these forecasted changes in the workload or resources, and assign a date against them.
The effect on capacity and the time remaining can then be visualized and any capacity shortfalls identified. The projects can be subsequently committed and they will then be reflected in the real-time capacity reporting.
For example, if an infrastructure has the capacity for a further 50 average sized VMs, but a project is planned to implement 20 average sized VMs, the capacity dashboards, badges, and reports will all change to reflect that there is now only capacity for 30 average sized VMs.
The final operational disciplines being addressed are configuration and compliance. Misconfiguration of systems is the root cause of a large proportion of system outages; so ensuring that all your systems are configured the way you want them to be is one of the key weapons in ensuring up-time.
As well as ensuring up-time, there may be legal and regulatory reasons, such as PCI-DSS, for the systems to be configured in a certain way. Alternatively, there may be security or hardening standards that an organization's security department determines are essential, to ensure that the integrity of the systems is maintained. Both of these would be classed as compliance requirements.
For in-depth configuration and compliance, vCenter Configuration Manager is provided as part of vRealize Operations Advanced and Enterprise editions. However, the use of vCenter Configuration Manager is not covered in this book.
With the release of vRealize Operations 6.0, some configuration and compliance capabilities have been introduced into the vRealize Operations Manager platform. As well as collecting metrics, vRealize Operations now collects properties from the ESXi hosts and the VM containers.
These properties can be used to assess the configuration posture of the ESXi hosts and the VM containers, using the Alerts, Symptoms, Recommendations, and Actions framework.
Content has been created that reflects the vSphere Hardening Guidelines, which means that, out of the box, vRealize Operations can now report on how compliant the ESXi hosts and the VM containers are against these guidelines. The reporting is available through the alerts, views, and reports functionality, and also via the Compliance badge in the vRealize Operations dashboards.
vSphere Hardening Guidelines will be covered in Chapter 5, Alerts, Symptoms, Recommendations, and Actions.
vRealize Operations Manager can also be installed on Windows or Linux. However, that needs special consideration, so is outside the scope of this book.
The basic building block of vRealize Operations is the virtual appliance node. The solution can be scaled from a single node to a maximum of 16 nodes, to support larger scale deployments or high availability (HA). Remote collector nodes can also be additionally installed to collect the metric data from remote datacenters with limited bandwidth connectivity.
Regardless of the role of a node, the same OVF is used to deploy the virtual appliance, and, as of vRealize Operations Manager 6.0, there is no longer the requirement to host remote collectors on Linux or Windows.
To a great extent, this has made the design, implementation, and management of vRealize Operations very straightforward, relative to the complexity and capabilities the solution provides.
The preceding diagram shows the different roles the nodes can take. Although they are all installed using the same virtual appliance OVA file and contain the same code, the nodes will only run the services required to fulfill their role. The roles are as follows:
If the scalability limits of a single large node is reached.
When a larger number of smaller nodes is desired. For example, the ESXi hosts for the nodes may not be able to support the size of the VM required by a large node.
When bandwidth to the remote datacenter is limited. A Remote Collector Node reduces the bandwidth requirements by approximately 65%.
If a firewall is in place between the Master Node and the vCenter server, and the appropriate ports cannot be opened.
Product and Admin User Interface: The Product UI is the main UI used to access vRealize Operations and is available on all the nodes except for Remote Collectors. The Admin UI is used for cluster administration and is available on all the nodes.
The first solution to be configured will be to one or more vSphere environments, connecting the built in adapter to the vCenter Server(s) supporting those environments.
Extending the solution to provide performance and health monitoring, and the capacity planning for storage is enabled, either by using the generic vRealize Operations Management Pack for Storage Devices, or one of the specific Management Packs from the storage vendor or third party developers. More information on these integrations can be found in Chapter 7, vRealize Operations Manager Solutions.
Management Packs written by VMware are included with editions of vRealize Operations, depending on the edition licensed. Management packs written by third parties are provided and supported by the third party and may require an additional license fee.
If vCloud Air or Amazon EC2 are used, Management Packs are available to connect the solution to these environments.
Integration with the other vRealize components, such as vRealize Hyperic or End Point Operations and vRealize Log Insight is also enabled using the Management Packs.
Finally, if there are remote datacenters, remote collector nodes can be configured to collect the metrics if low bandwidth conditions exist.
Every 5 minutes, vRealize collects and stores about 250 metrics per VM, 500 metrics per host, and, typically, 50 or more metrics for the other object types. As a result, the total quantity of stored data, and the volume of the metrics being analyzed every night, can become significant.
An environment with 1,000 VMs will typically have over 300,000 metrics being collected every 5 minutes and analyzed every night. With the default 6 month data retention period and HA in place, that means over 30 billion data points being stored!
VMware provides detailed sizing guidelines, and it is very important that these are followed. If insufficient resources are made available for vRealize Operations, it will cease to function correctly!
The latest version of vRealize Operations Manager, version 6.1, supports up to 120,000 objects and 30,000,000 metrics, with 16 nodes in a non-HA configuration. If an HA environment is required then these figures are halved.
VMware publishes links to the most up to date sizing guidelines and spreadsheets at http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2093783.
vRealize Operations is available in three editions, and these define what functionality and components you are entitled to. The three editions are Standard, Advanced, and Enterprise, and as you move up the range, more features and components are added.
The solution is evolving with every major and minor release, and as functionality is added/removed or moved from one component to another, the capability in each edition can change slightly.
The following table highlights the main capabilities of each edition, with the release of vRealize Operations 6.1. A full and current list of features available in each edition can be found on VMware's website at http://www.vmware.com/uk/products/vrealize-operations/compare.
This edition is designed for vSphere environments only, and provides the core vRealize Operations Manager platform to deliver the predictive analytics capabilities and associated smart alerts. The alerting framework of symptoms, recommendations, and actions is introduced, as is policy management, so that different SLAs and alerting can be applied to different parts of your environment.
Capacity management of the vSphere environment is also included in the Standard Edition. Within the capacity management framework, capacity projects can be defined to enable you to understand the impact of future workloads changes.
Finally, the first step in configuration management is included, with the ability to assess the compliance of the vSphere hosts and the VM containers against the vSphere Hardening Guidelines.
As well as offering more features as described next, the Advanced Edition provides the additional components, vRealize Configuration Manager, vRealize Hyperic/End Point Operations, and vRealize Infrastructure Navigator.
The key enhancements to vRealize Operations Manager Advanced Edition, over the Standard Edition, is the ability to add infrastructure Management Packs, as well as allowing you to create customized dashboards and reports.
The vRealize Operations Management Pack for Storage Devices is included with the Advanced Edition. This edition also supports third party management packs, including those from Hitachi, EMC, Dell, HP, and Blue Medora (for NetApp).
This ability, to manage performance, health, and capacity of storage in the same place and alongside vSphere Management, is very compelling for the IT administrators I have worked with.
Customized dashboards and reporting mean that you can effectively visualize, report on, and export all the information within vRealize Operations in almost any way and in any context. We will cover this in detail in Chapter 3, Dashboards, Badges, and Widgets and Chapter 4, Views and Reports.
The capacity projects feature is enhanced with the ability to commit capacity projects. This means that all capacity reporting has the committed projects included in the calculation when they report on available capacity and time remaining.
From a configuration management perspective, the vSphere Hardening Guidelines capabilities introduced in the Standard Edition are enhanced, to allow the reporting of the configuration posture with an additional sub-badge, Compliance, under the Risk badge.
vRealize Infrastructure Navigator is added, which provides application discovery and visualization of dependencies between the VMs within your infrastructure. Metadata about applications running on the VMs and the groupings of the VMs as applications within vRealize Infrastructure Navigator, is exposed in vRealize Operations Manager. This can be used to automate dynamic groupings, or as the basis for the construction of alerts and symptoms. We will look at vRealize Infrastructure Navigator in Chapter 8, vRealize Infrastructure Navigator.
vRealizeHyperic/End Point Operations is also added. This additional solution allows for the collection of the OS metric and property data by vRealize Operations Manager. It also offers the capability to monitor the OS, allowing you through an agent to, for example, monitor specific Windows services and act accordingly if they become unavailable or start to consume too much resource. We will cover the transition of Hyperic to End Point Operations and look at the solution in detail in Chapter 10, End Point Operations.
Finally, vRealize Configuration Manager for vSphere is included, which allows a rich capability to manage the vSphere hosts and the VM containers configuration, including templates to assess against configuration and regulatory compliance requirements, such as PCI-DSS, HIPAA and SOX.
It extends the licensing in vRealize Configuration Manager to manage the VM guests (OS and Application) and extends vRealizeHyperic/End Point Operations to manage applications such as SQL and Exchange, and bring their metrics and properties into vRealize Operations Manager.
Additional management packs with an application focus are also made available with this edition.
vSphere with Operations Management (VSOM): VSOM is a license package that includes vSphere Standard, Enterprise, or Enterprise Plus Edition, combined with vSphere Operations Standard Edition. It is available on a per-CPU basis.
vRealize Operations Insight: This is an upgrade to VSOM that upgrades vRealize Operations to the Advanced Edition and adds Log Insight for all the licensed CPUs. Again, this license is provided on a per-CPU basis.
vCloud Suite: There are 3 editions of vCloud Suite - vCloud Suite Standard includes vRealize Operations Standard, vCloud Suite Advanced includes vRealize Operations Advanced, and vCloud Suite Enterprise includes vRealize Operations Enterprise. All are licensed on a per-CPU basis.
vRealize Suite: There are 2 editions of vRealize Suite - vRealize Suite Advanced includes vRealize Operations Advanced, and vRealize Suite Enterprise includes vRealize Operations Enterprise. Both versions of vRealize Suite also include licensing for Log Insight and are available either on a per-CPU basis or in packs of 25 OSIs.
Per-CPU and per OSI can be mixed, as long as every resource being managed has a valid license associated with it.
vRealize Operations Advanced and Enterprise editions can be mixed on a single vRealize Operations Manager cluster. This is to allow you to use vRealize Operations Advanced to manage your entire estate and have a subset, perhaps your business critical applications, managed with the additional functionality provided by vRealize Operations Enterprise.
The latest information on product licensing is available in the VMware Product Guide at http://www.vmware.com/files/pdf/vmware-product-guide.pdf.
In this chapter, we introduced vRealize Operations Manager and explained how it fits with the rest of the vRealize family. We looked at the operational disciplines of performance, capacity, configuration, and compliance, how vRealize Operations addresses them, and the benefits to the administrator.
We then looked at the architecture scalability and resilience, and finally covered the somewhat complex topics of editions and licensing.
In the next chapter, we will look at how to install and administer vRealize Operations Manager.