System Center 2012 R2 Operations Manager (SCOM) can be a very in-depth monitoring solution owing to its ability to gain an insight of various areas of an infrastructure. SCOM has the ability to start helping you understand your environment from the physical hardware of your network switches, SANs, and physical servers right through to the operating systems, whether they be Unix, Linux, or Windows operating systems, and the applications running either within your data centers or externally in public cloud solutions such as Microsoft Azure.
Because of the wide-reaching and powerful solution that System Center 2012 R2 Operations Manager delivers, it is very important before implementing it to understand what the various roles and features of SCOM are as well as how they function and, more importantly, relate to the business and its needs.
This chapter will help break down these areas to enable you to gain that knowledge as well as show you what areas of your design can be architected to provide high availability and how to size your environment.
While in small or test environments you may consider installing all of the roles on a single server, large or production environments will often have the various roles spread across multiple servers for both performance and availability purposes.
The management server is the brains of SCOM. It is a role that coordinates management pack distribution, monitoring and rule application, as well as agent communication and the interface between the system and you, the admin, via the console.
Every deployment of SCOM will contain at least one management server, but adding more management servers will allow you to start to scale out the implementation for both performance and availability.
When implementing the first management server, SCOM creates what is known as a management group. This can be seen as a control boundary allowing you to select which servers are managed by this implementation of SCOM and, if required, to implement multiple management groups, each with their own sets of management servers, for different purposes.
The operational database is the database backend used by the management servers for short-term storage of data and processing of information related to the management packs implemented within your deployment and their rules, monitors, and overrides, which is the system configuration.
Every management group requires one unique operational database.
The SCOM data warehouse consists of another SQL database but is used for long-term storage of data, the default period being 400 days. Data is written in parallel to the data warehouse while the data is simultaneously being written to the operational database, but over time the data in the data warehouse such as performance metrics is aggregated, rather than it being stored as raw data.
The data warehouse database is a required component for an SCOM management group, but it can be shared between different management groups, allowing for a centralized data warehouse to be implemented, providing you with a rolled-up and consolidated view of the health and performance of the different monitored areas of your environment.
Refer to the Sizing the Environment recipe in this chapter for further information on connected management groups.
The reporting server, while an optional extra, is highly recommended as this is the role that provides access to the reporting features of SCOM. It requires a server with a dedicated SQL Server Reporting Services (SSRS) instance to be designated as the reporting server. SCOM requires a dedicated SSRS instance as it will modify its security to match that of the role-based access model used for SCOM, potentially removing access to any reports you might have previously had on the SSRS instance. It is not recommended or supported to use this SSRS instance for any purpose other than for SCOM reporting.
A gateway can be placed outside of the security boundary where the main management servers reside, such as within an isolated Demilitarized Zone (DMZ), workgroup, or domain environments, without trusts established.
A gateway within an Active Directory environment can communicate to agents and then act as the communication point from the untrusted environment back to the management group using certificates to secure the communication channel.
A gateway can also be used to manage non-domain joined devices and then all agents will communicate to the gateway using certificates, and the gateway in turn will communicate back to the management group via certificate-secured channels.
Agents can be set to communicate with the gateway instead of directly with the management servers. This is useful for low-bandwidth remote sites where instead of having multiple agents reporting data directly across the network link, they report their data to the gateway, which can then compress that data and send it in batches instead. The compression can be as much as 50 percent.
SCOM offers users the ability to access a web-based version of the operator console using a Silverlight-rendered console. This role can either be deployed on a separate server or on an existing management server. However, it is worth noting that if installed on a separate server, a management role cannot be deployed to that same server after installation of the web console role.
Alongside the web-based operator console, the web portals for Application Performance Monitoring (APM) are also deployed as part of the web console role. These consoles give access to the rich diagnostic and performance monitoring that is gathered for .NET and Java applications.
Audit Collection Services (ACS) allows security events generated by audit policies applied to monitored systems to be collected to a central location for review and monitoring. When enabled within your environment, a service installed as part of the SCOM agent called ACS Forwarder will send all security events to the ACS Collector.
The ACS Collector is a role that is enabled on a management server and will then filter and write to the ACS Database any security events you define as being monitored. The ACS Database is a dedicated database for security events. Each ACS Collector will require its own individual database.
System Center 2012 R2 Operations Manager uses either Agent-based or Agentless communication to collect data from servers and devices. Servers with agents will push this data to the management servers or gateways that they have been assigned to, while agentless managed servers and devices, such as network switches, will generally have their information pulled from management servers.
The flow of information and/or connection points around the infrastructure can be visually represented as follows:
SCOM uses a mechanism known as management packs to control what type of information is collected and how to react to this information. These management packs are XML formatted files that define rules, monitors, scripts, and workflows for SCOM to use and essentially tell it how an aspect of your infrastructure should be monitored.
Most of these management packs will come from the suppliers of the software and devices used within your infrastructure, but there is nothing to stop you from creating your own management packs to fill a gap in monitoring if you find one. Chapter 7, Authoring Custom Monitoring Solutions with System Center 2012 R2 Operations Manager, will detail how to approach this.
You are also able to override predefined options within management packs to better tune the monitoring for your environment. Again, this is covered in more detail in Chapter 4, Operating System Center 2012 R2 Operations Manager.
While you've just been introduced to the main roles that will be encountered within almost all deployment scenarios of System Center 2012 R2 Operations Manager, there are a couple more, well, features rather than roles worth introducing.
GSM allows you to configure a watcher node outside of your organization utilizing Microsoft's Azure platform, which can then be used to perform availability and performance monitoring of your externally accessible web-based application.
This allows you to gain a true 360 degree perspective on your environment with both internal monitoring happening from within your data center and a customer perspective from outside your network.
This information can then be surfaced through dashboards to see a visual representation of access to your services from different locations around the world.
Another feature introduced fully with the 2012 R2 release is System Center Advisor integration. System Center Advisor is a standalone cloud-based service that helps in the proactive monitoring of the configuration of infrastructure systems and provides suggestions in line with best practices.
At the time of writing this, Microsoft had a preview of the replacement for Advisor in testing named Azure Operational Insights. This allows configuration information from SCOM to be uploaded into the cloud service and for the data to be analyzed for different purposes such as capacity planning, change tracking, and security. Chapter 9, Integrating System Center 2012 R2 with Other Components, for further information.
In versions of Operations Manager prior to 2012, there was a role known as the Root Management Server (RMS). This role was typically held by the first management server deployed into a management group and was responsible for running some distinct workflows such as AD assignment, notifications, and database maintenance.
This meant special attention was required when considering high availability with Failover Clustering and adding a layer of complexity. It also meant consideration of placement was required in relation to other components, such as the data warehouse or operator console access, owing to the SDK running on the RMS that scripts and consoles used to connect to SCOM.
The RMSE is present to only provide backward compatibility for legacy management packs, which may still contain a workflow that specifically targets the RMS (for example, the Microsoft Exchange 2010 management pack). Most management packs, especially those from Microsoft, should now be re-released with the RMS requirement removed, but if you have any in-house management packs, it is recommended you check whether they are targeting the Root Management Server class instance (Target="SC!Microsoft.SystemCenter.RootManagementServer").
You can identify which management server in your environment is running the RMSE by using the
Get-SCOMRMSEmulator PowerShell command.
Running that command will display which management server is currently responsible for hosting and running the RMSE. In the event of a failure, however, the RMSE role will not fail over to another server, mainly as it isn't considered critical and should have limited impact on the environment.
If the RMSE does require to move to another management server in the event of a failure or just for proactive maintenance reasons, you can use this PowerShell command:
To move the RMSE with a single PowerShell line, you would combine the command to get the details of the target management server with the
Set command as in this example where the command would move the RMSE to a server named
Get-SCOMManagementServer –Name "PONSCOM01" | Set-SCOMRMSEmulator
Microsoft TechNet—About Gateway Servers in Operations Manager: http://technet.microsoft.com/en-us/library/hh212823.aspx
System Center Advisor: https://www.systemcenteradvisor.com/
Microsoft TechNet—Global Service Monitor: http://technet.microsoft.com/en-us/library/jj860368.aspx
In addition to understanding the various roles and how they are installed, as well as the performance, capacity sizing, and availability requirements for System Center 2012 R2 Operations Manager, it is equally important to understand what the business requires from its monitoring solution.
Without getting a clearly defined set of requirements, you could run the risk of not implementing high availability on the roles in highest demand, not implementing them at all, or focusing on monitoring areas of the business that provide no value.
The following information should give you a good idea of the areas and questions that you can then take back to the business and seek answers from those involved in the decision-making processes.
Are you mandated to provide a five nines (99.999 percent) service or in reality can you provide a 98 percent uptime service? Most organizations like the sound of a five nines service, but in reality when they see the costs and controls associated with obtaining this uptime, requirements are often re-thought.
Try gathering information regarding your key systems and their priority. Once ranked, work with the business to agree on individual uptime percentages for each application rather than as a whole, as some may be less critical to the business and therefore shouldn't have the same amount of high availability and expense associated with them.
Rather than concentrating on only the time that an application should be up, again work with the business to correctly identify periods of time that the application is able to be taken out of service for planned maintenance. This can help maximize the percentage uptime by allowing you to schedule work around that application's maintenance window and track the different types of downtime to provide accurate metrics defining unplanned downtime, which lowers uptime, and planned downtime.
The cost of downtime helps to get an understanding from the business with regard to what downtime of the application actually does cost the business.
Is it a financial loss such as a stock exchange or mining corporation may see if a critical system is down? Maybe, it's a loss of productivity or reputation or a loss of life in the case of systems used within hospitals.
Whatever the cost of downtime may be, knowing this in advance as you start designing your monitoring solution will enable you to focus on priorities and develop targeted reports that can represent the costs, highlighting areas doing well or others that need investments.
Alongside simply deciding to deploy agents to monitor servers, you must also consider what business services are within the scope of monitoring. As part of this, you need to ensure individual components (servers, network, applications, and so on) are accounted for and the solution scaled to support.
With these business services to be monitored, there also arises the questions regarding any specific SLAs for performance and availability that may need to be set up against the services, along with any reports that may be needed.
This requires you to take into account not only the scale but also the extra work involved in the creation and maintenance of your services.
Alongside knowing the cost of downtime, you should also know whether there are specific areas of the infrastructure that, if down, will cause business-specific financial penalties so that these can again be prioritized for monitoring.
In addition to ensuring that you are monitoring key systems that may cause expenses to the business if problems aren't quickly identified, you may need to also capture areas within your business that earn revenue.
As multi-tenancy or even just the requirement to recoup costs from individual parts of the organization grows ever more important, you should start gathering information related to how much capital was expended on your infrastructure and how that can be equated to costs for individual resource usage of the components of that infrastructure.
Typically, you would assign costs to CPU, memory, storage, and networking utilization.
Not so much an area to gather specific information for, but to gain understanding from the business regarding at what level of utilization they require foresight into, for capacity planning and the purchasing of new equipment or redistribution of workloads.
For example, would the business like to know when the drive space is down to 20 percent or 40 percent of free space? Are they happy with the utilization of server memory at 80 percent or 95 percent?
Having this information on hand will help with the initial tuning of your new monitoring environment and the creation of any forecasting reports.
By gathering information from the outset before implementing your SCOM design, capacity planning allows you to understand exactly what the business is trying to achieve and how the SCOM implementation can best achieve that.
For example, if the business has no requirements to monitor access to files and systems, then implementing the ACS roles may be a waste of resources better served elsewhere. Again, if the business decides it has 50 applications that require extensive distributed apps for creating and monitoring, then be sure to scale the number of management servers appropriately.
Another area to consider is other systems and their integration. For example, does information regarding NetFlow data from another system need to be fed into SCOM or does SCOM need to output information into a Service Desk tool such as System Center 2012 R2 Service Manager?
These interactions, along with normal notifications and other subscriptions, can again place load on the solution and must be taken into consideration.
You should have already worked with the business using the previous recipe to gather information as to the business requirements.
The following steps will guide you through the various high availability options you have when designing a SCOM infrastructure.
In previous versions of SCOM, high availability was achieved through the use of clustering owing to the reliance on a component known as the RMS. Starting with the 2012 release and carrying forward to the R2 release, the RMS role was deprecated, and therefore, it no longer requires clustering.
In SCOM now, high availability is achieved by deploying multiple management servers and grouping these into resource pools.
It is recommended, even in a simple deployment, to always deploy a minimum of two management servers, as this will provide failover of monitoring and access while simplifying the maintenance of your SCOM infrastructure.
As you scale out your infrastructure for larger monitoring deployments, you need to consider adding management servers that can be allocated to dedicated resource pools used for specific areas of monitoring.
The typical resource pools seen in implementations are All Management Servers, Unix/Linux Servers and Network monitoring.
All management servers will by default be added to the All Management Servers resource pool. It is recommended that once you have correctly sized your environment, if you need to dedicate management servers to monitoring, say, network devices, then you need to ensure you add these servers to the dedicated resource pool and remove them from the All Management Servers resource pool.
You must be aware that at least 50 percent of the servers in the All Management Servers resource pool need to be running in order for SCOM to fully function so you should always have, at a minimum, two servers remaining in this pool.
To provide high availability for console connections and to employ a more seamless connection method, Network Load Balancing can be implemented across the management servers used to provide console (or even SDK access for systems like the System Center 2012 R2 Service Manager connectors) and the DNS name allocated to the virtual IP address used, instead of a specific management server.
Normal SQL high availability scenarios apply here with the use of either standard failover clustering or the newer SQL Server 2012 AlwaysOn availability groups.
SCOM uses SQL Server Reporting Services (SSRS) as the reporting mechanism and this cannot be made highly available. The underlying database for SSRS can be made highly available by utilizing either a traditional SQL Cluster or SQL 2012 Always-On.
Nevertheless, it is possible to quickly restore as long as the main SQL and SCOM components are still intact.
In a default deployment, ACS will usually be installed with a single ACS Collector and ACS Database pair. You can then implement multiple ACS Forwarders that point to this collector, but if the collector goes offline, the Security Event Log on the forwarder will effectively become a queue for the backlog until it can reconnect to a collector.
Using this configuration has the benefit of simplicity, and if the original ACS Collector can be brought back online within the Event Retention Period or the
ACSConfig.xml file restored to a new ACS Collector, then potentially there would be no loss or duplication of data.
ACS Collectors use a configuration file named
ACSConfig.xml, which is stored in
This configuration file, which is updated every 5 minutes, keeps track of each forwarder communicating with the collector and a sequence number corresponding to the EventRecordID. This allows the collector to be aware of which events have been collected.
Using this simple configuration, however, does leave open the possibility for loss of data (Security Events not captured) if the ACS Collector is offline for longer than the retention period (the default is 72 hours) or duplication of data in the database if the original
ACSConfig.xml file is not restored.
Another option would be to implement multiple ACS Collector/ACS Database pairs. This would allow you to specify a failover ACS Collector when deploying an ACS Forwarder and would provide you with automatic failover in case of an outage.
However, while this does provide automatic failover, it is important to note that each ACS Collector/ACS Database pair is independent, and after a failover, it would mean that Security Event data is spread across databases making it harder to query when reporting. It would also mean duplication of data after a failover as the
ACSConfig.xml file on the failover ACS Collector would not be aware of the EventRecordID sequence the other ACS Collector was at.
This could either be through the use of dedicated hardware-based network load balancers or the built-in Windows Network Load Balancing role.
Agents don't specifically have or need high availability at the client level as that would defeat the objective of monitoring to see whether the server went offline even if the agent was able to continue working. But you can implement multihoming, which allows the client to communicate with up to four management groups per agent. This is ideal for obtaining data from live systems in both test and production SCOM environments.
To do this, use PowerShell to designate the primary and failover(s) servers and then set the agent configuration using the
Set-SCOMParentManagementServer command with the
For example, to set Failover Management Server (PONSCOMGW02) on Agent (PONDC01), use the following command:
Set-SCOMParentManagementServer -Agent (Get-SCOMAgent -DNSHostName "PONDC01.PowerON.local") -FailoverServer (Get-SCOMManagementServer -Name "PONSCOMGW02.PowerON.local")
The gateway servers themselves can also be pointed at multiple management servers as both primary and failovers. This technique uses the same PowerShell command, but with the
It would make sense, even for a very small deployment, to have at least two management servers, as this allows for easier maintenance with reduced downtime and you can then scale from there.
Modifying Resource Pool Membership—http://technet.microsoft.com/en-us/library/hh230706.aspx#bkmk_modifyingresourcepoolmembership
Microsoft TechNet Documentation SQL Server 2012 Always On—http://technet.microsoft.com/en-us/library/jj899851.aspx
Configure an agent for Gateway failover—https://technet.microsoft.com/en-us/library/hh212733.aspx
Configure a Gateway for management server failover—https://technet.microsoft.com/en-us/library/hh212904.aspx
It will also certainly help mitigate the need to go back to the business and ask for more storage and compute resources further down the line.
This recipe will show you how make use of the SCOM Sizing Helper that can be downloaded from http://www.microsoft.com/en-us/download/details.aspx?id=29270.
This sizing helper should only be used as an indicative guide, as your final sizing and design will require further thought around your requirements, and this is discussed more at the end of the recipe.
Click the button under the 2. Standard Deployment heading:
Use the drop-down selectors to choose 500 Windows Computers, 100 Network Devices, and set the APM as Enabled and click Submit:
You will obtain a basic sizing output based on this scenario requiring 6 servers with 16 GB of memory and 4 cores each;
One Management Server to support the selected number of agents, plus one to add high availability
One SQL server for the operational database
One SQL server each for the data warehouse, SQL Reporting, and Web Console
The output will also calculate the disk space required for the SQL databases. What is important to note, however, is that as we choose to enable APM in this scenario, it defaults to calculating APM sizing for all of the 500 Windows computers that we chose.
It is highly unlikely that you will monitor application performance across every server so adjust this accordingly.
The Sizing Helper can be used to get a basic idea of the size of servers and disk space required for an environment of your size but must be used in conjunction with a more detailed plan of the infrastructure topology required for your environment.
For example, the preceding recipe includes no gateway servers, no resource pools dedicated to Unix or Linux monitoring, and no high availability across the SQL servers while also sharing the SQL server used for the data warehouse for the Web Console.
While there are no specific sizing fields to use to calculate for Unix or Linux, you can use the Network Devices fields as these will roughly calculate to the same values. It is worth exploring the other scenario tabs within the sizing calculator to see what other options you have and how they affect your sizing.
One area you may have immediately zoomed in on is the disk space requirements for the SQL databases. While it is important to stress how critical it is to correctly size the databases and even more so to ensure there is enough free disk space always available for normal operation and maintenance tasks, the sizing helper can be a little on the generous side when calculating the requirements.
This is due to the complexity that would be involved in trying to model and account for every management pack you may wish to run, the rules and collections scoped for your devices, the overrides you may put in place, the distributed applications you may build, and so on.
We have seen some environments where the calculator has estimated 200 GB for the data warehouse, yet the environments haven't grown past 80 GB after a couple of years of full use.
However, this is not to say that you can simply ignore the recommendations, as your mileage may vary and the authors cannot stress just how important it is to ensure you have enough free disk space on the volumes used for your SQL databases.
System Center 2012 R2 Operations Manager has its own set of internal SQL maintenance tasks for its SQL Databases. The main operational database, for example, has regular schedules to re-index and defragment the tables. This means there is a hard requirement to ensure there is always at least 40 percent free space within the database for these maintenance tasks to operate.
A contingency must also be in place for the possibility of an event storm. If a large influx of alerts were to occur within your environment, causing the database to dramatically grow in size in a short period of time, not having the free disk space to allow for this could potentially make your monitoring system go offline before you've had a chance to use it to identify the root cause.
However, SCOM does allow for multiple management groups to be implemented within an environment, along with the ability for these separate management groups to share a single data warehouse.
Multiple management groups may need to be considered for a couple of reasons. The first is the fact that the number of supported agents, namely 15,000, reduces significantly in relation to the number of open console connections. For example, this number of supported agents is based on 25 open consoles. With 50 open consoles, this supported number drops to 6,000 agents.
This is not something to consider lightly. Implementations on this scale are likely to already have very large data warehouses and adding more management groups is just going to increase the size and performance strain further.
These are still relatively high numbers for supported agents when considering just server monitoring, but can be quickly reached when also monitoring workstation class devices and must therefore be taken into account.
Another reason would be political reasons where Role-based Access Control (RBAC) cannot cover scoping of the console or where highly rigorous change control processes are in place across different parts of the infrastructure, which would make the normal deployment of management packs, overrides, and changes more challenging.
In addition to being able to deploy multiple management groups so that they share a single data warehouse, you have the option of connecting the individual management groups together in a connected method. This allows you to view and interact with alerts and discovery data in a consolidated view from a single console while also being able to run tasks on the monitored devices from other management groups.
Connected groups are joined together in a peer-to-peer relationship with a top-level local group in the hierarchy that has data fed from the other management groups, while groups have no visibility of each other, thus maintaining separation.
This is ideal for test or pre-production scenarios where typically you don't need a shared data warehouse and for them to remain separate, but with easy access from the console.
Detailed information for the recipes covered can be found here:
Sizing Helper Download: http://www.microsoft.com/en-us/download/details.aspx?id=29270
Kevin Holman's Blog: What SQL Maintenance Plans to use with SCOM? http://blogs.technet.com/b/kevinholman/archive/2008/04/12/what-sql-maintenance-should-i-perform-on-my-opsmgr-databases.aspx