Building the Control-M Infrastructure

Qiang Ding

January 2013

(For more resources related to this topic, see here.)

Three ages to workload automation

Enterprise-wide workload automation does not happen overnight. Converting nonstandardized batch processing into a centralized workload automation platform can be time consuming and risky. We need to gain full understanding of the existing running batch jobs before moving them into the new platform, that is, how the jobs are currently scheduled? What are the relationships between these jobs? Which are the higher priority ones? Then based on the amount of jobs and the complexity, we can decide the method of the migration process, which either can be performed automatically or manually. Once jobs are migrated, a series of test needs to be performed before moving into production. Each production cutover is preferably to be transparent to the business' normal operation as much as possible or to be performed within the agreed outage window.

Apart from the technical challenges, "people issues" can be the next road block. First of all, users need to be educated about the tool. It can take a lot of time for users to accept and get used to the new way of operation. The bigger challenge is to let the application developers take in and apply the "centralized workload automation" concept. So during each IT project's development phase, they can utilize built-in features provided by the workload automation platform as much as possible rather than reinventing the wheel.

Forcing users and developers to fully take in the centralized workload automation concept and change their ways of working with batch processing straightaway could lead the project to an ultimate failure. Instead, different areas of approach and actions should be taken in stages according to the actual IT environment condition. We can group these different approaches and actions into three "ages", that is, the stone age, iron age, and golden age.

Stone age

Unless we are building an IT infrastructure from scratch, for any reasonable size organization, they should have a noticeable amount of batch jobs running already to serve different business needs. Such processes are either running by OS/ application's inbuilt batch features or scheduled from homegrown scheduling tools (sometimes they can be manual tasks too). For example, these tasks can be:

  • End of day (EOD) reporting jobs

  • ERP application's overnight batch

  • Housekeeping jobs, for example, database backup and log recycling jobs

Depending on the organization's business requirements, the number of batch jobs required to achieve the outcome can start from a few hundred and go up to tens of thousands across a large number of different job execution hosts.In a heterogeneous environment it is extremely challenging to run cross-platform batch processing by using different tools, especially when the number of tasks is large and the batch window is small. Therefore, these batch processings are the most essential and critical ones to be consolidated into a centralized scheduling platform. On the other hand, these types of processing tasks are the "low hanging fruits" - relatively easy to identify and migrate, simply because they have already been clearly defined and scheduled by existing batch scheduling mechanisms, which means it is more likely that the job scheduling information can be extracted from these sources.

At the end of the day, it all comes down to the question of how to migrate the jobs into a centralized scheduling platform and how are they going to be triggered in the new environment. "How to migrate", as in, how the jobs should be extracted from the existing batch scheduling mechanism and how they should be imported into the new environment. It can be done by using a job migration program, if it is available, or else someone has to manually redefine the jobs from scratch. "How jobs should be triggered", as in, should the job directly trigger the script/command or use scheduling tool's extended features (that is, Control-M Control Modules) for batch processing within a particular application?

The bottom line is – this stage is all about standardizing the way the existing batch jobs are executed and managed by consolidating them into a centralized tool. The migration process should be relatively straightforward and should not require major modification to application codes as well as each application's architecture. However, this will change the way users manage and monitor batch jobs forever. It is the initial step for standardizing batch management and batch optimization, therefore we call it the "stone age".

The successful implementation of "stone age" will benefit the organization without a doubt. After a while, users will realize how easy it is to manage cross-platform batch flows from a centralized GUI. They no longer need to look at different screens to trigger jobs or monitor a job's execution. Instead, batch jobs are triggered automatically and are to be managed by exceptions.

Iron age

A lot of organizations stop improving and stop extending their centralized batch environment once they have completed the stone age. As the business rules are becoming more and more complex, it is common to see silos of batch processing existing in different applications that are related but not linked together, that is, they do not know about other processing taking place and how they relate. Plus on top of that, we have business process steps that are being "patched up" by mechanisms outside the centralized scheduling tool. As a result, batch flows within the centralized scheduling tool are commonly unable to present an end-to-end business process.

One possibility is that these organizations believe that they have already got everything that the centralized scheduling tool is capable of – triggering executables at a fixed time on a predefined day. Rather than someone taking the lead to discover and try other features within the batch scheduling tool, people in different parts of the organization always try to develop their own ways to handle more advanced requirements such as event triggering processing or inter-application file transfers.

In late 2010, I was involved in an EAI development project. During my meeting with some JAVA developers, I noticed they still think batch processing (in Control-M) is all about triggering some program to run at a fixed time and nothing more. Unless they change their views on batch processing and understand what "workload automation" is about, they won't be able to fully utilize the features provided by a workload automation tool for the applications they develop. As a result, after the application goes live, there will be a large amount of processing to be done by inbuilt or self-coded scheduling mechanisms while the other half is running in Control-M.

Iron age is about changing how batch processing is initially designed, that is, improving it by fully utilizing the capabilities of the batch scheduling tool. This requires ongoing education and letting application designers and developers accept and use features that are already available in a centralized scheduling tool rather than reinventing the wheel. This is a lot more challenging than simply extracting and importing batch-processing data from one tool to another during the stone age. Also, the benefits we get from the iron age are not as easy to measure as what we can directly see during the stone age. In the stone age, the users instantly get the benefits of managing batch from a centralized scheduling tool.

In reality, application development teams may rather write their own code to meet the processing requirements so that they can have total control of the application they have developed. Application developers may think "Why should we learn a new tool when we can simply write a few lines of code to achieve event-driven triggering?" or "In the future, if we want to change the way my application works, we might have to log a change request for the scheduling team to modify the batch job in Control-M, whereas having everything done in the code, we will have full control, therefore saving us a lot of hassle."

A certain degree of politics can also be involved. For example, the management of the application development team may think "If half of our work is done by the scheduling tool, where is our value?" or "We want to be more important to the organization's IT in front of the IT directors!" Another scenario with organizations is that they outsource their application development. Instead of building a new system from scratch for each project, the outsourcing companies try to modify what they have already implemented in other organizations for a similar project. In such cases, the outsourcing companies, most of the time, will refuse to do any major modifications to the application just to fit into the centralized scheduling tool. They might believe that by doing so, they can ensure that the project gets delivered with minimal time, cost, and risk. But in reality, the result always turns out the opposite.

In order to avoid falling into one of the categories mentioned above, the person who is going to promote "iron age" within an organization should work on people rather than expecting everything to turn out fine by only focusing on the technology. At the same time, higher-level management in the organization should provide a level of assistance by enforcing an organization-wide scheduling standard so the organization's IT can get the most out of the centralized batch scheduling platform and therefore maximize the business' ROI on it.

The definition of a successfully-implemented iron age is that the organization should see that batch flows are becoming more meaningful at the business service level (that is, presents the complete business process) and is optimized to process within a shorter batch window by using features provided with the batch scheduling tool (for example, percentage of processing is moved into the event triggered batch, which can happen outside the batch window). From an ongoing maintenance point of view, there are less homegrown tools to manage. The total time and effort required for application development may also reduce if the developers are familiar with the batch scheduling tool and know how to leverage the available features properly.

Golden age

Golden age refers to the full implementation of workload automation. It is not as easy to achieve as it sounds, because there are a number of prerequisites that need to be met before the organization even considers it.

First of all, the centralized scheduling platform needs to be upgraded to a more up-todate version that provides the workload automation ability, such as Control-M version 7. Secondly, in order to get the true value from workload automation, the organization needs to have both the stone age and the iron age successfully implemented, that is, jobs in the centralized scheduling tool need to be well defined and presenting the actual business processes. Furthermore, it depends on how far the organization wants to go down this road in order to reach the pinnacle. The IT environment may look at providing the foundation to allow the batch environment to become more dynamic by using resource virtualization and cloud computing technologies.

Once all prerequisites are met, implementing the golden age requires the batch environment designer to work closely with a system architect and application developers. They need to transform the existing bath to become more flexible (moving away from batch jobs' static nature), so the workload automation tool can schedule them according to business policies and route the workload according to runtime load to the best available virtual resource for execution. The batch job should also be designed by following the SOA design principles for reusability and should be loosely coupled.

In the golden age, batch workloads are managed according to the business policies and service agreement. The limited machine resource bottleneck of batch processing is not much of a concern because resources can be acquired whenever needed. In this case, the system can handle a sudden spark of processing requests, while still ensuring the process to complete within its agreed batch window or SLA.

Planning the Batch environment

The foundation of the golden age centralized workload automation is a welldesigned and properly configured batch software infrastructure. Although the technical aspect of installing commercial batch software such as Control-M is fairly straightforward, implementing a complete batch environment is not only about finding a server and installing the software. Building a batch environment without proper planning can turn out to be costly for the organization over the long run (for example, the penalty of constantly missing SLA due to an unstable batch environment, cost of modifying the infrastructure, and cost of additional licensing due to poor planning). It is also very challenging to make major structural modifications to the production batch environment without interrupting the business once everything has gone live and while working closely with each other.

A well-designed batch environment should be reliable, secure, and scalable. In order to properly plan such a batch environment from the beginning, the system architect needs to have an overall picture of the organization's IT environment, knowing the business requirements, technical challenges, and technical resource availability, as well as keeping in mind the technical requirements that come from the batch scheduling platform itself, that is, Control-M.

Control-M sizing consideration

Let's assume that we are at the beginning of the stone age. We start with interviewing each application owner to find out which batch processing they currently have and how they are running them at the moment. Based on this we are able to figure out which processing scenarios are suitable to be migrated to Control-M, their priority, and difficulty level. Once we have an estimation of the total number of existing tasks to be consolidated, we should be able to roughly figure out:

  • What is the total number of batch jobs run per day (average and peak)?

  • How many machines are involved in executing these batch jobs?

  • Who are the users currently managing these batch jobs and how many of them are going to directly interact with the Control-M GUI frontend for job management and monitoring?

Total number of batch jobs run per day

When we talk about the total number of batch jobs, it really depends on how the processing tasks are grouped. We can have a single Control-M job to perform multiple tasks, such as to begin with reading information from a file, filter it, load the output into the database, and then send an e-mail notification for completion. Or we can split these tasks into individual jobs. Dividing tasks can make troubleshooting easier and reduce the time required when rerunning a particular task, but at the same time, this approach increases the amount of jobs to be managed. Such decision should be made on a case-by-case basis in reference to a standard rather than using a one size fits all approach.

It is important to have an estimation of the maximum number of actual Control-M jobs run per day. This information might be related to Control-M licensing costs (depends on the licensing agreement) and will affect some Control-M database configuration parameters. Control-M components are stateless; all job-related information and active environment state changes are stored in its database. Therefore, Control-M needs to know this information during installation in order to configure the database to provide the capacity and performance. There are three levels to choose from – small (less than 2,000 jobs/day), medium (between 2,001 to 10,000 jobs/day) and large (more than 10,000 jobs/day). With mdern databases the database tablespace can be set to auto extend* to accommodate job and data growth. Multiple Control-M/Servers should be considered to ensure best performance and stability if the estimated maximum number of batch jobs run per day is higher than BMC Software's recommended figure.

When BMC-provided PostgreSQL database is used, the tablespace is set to auto extend by default.

Control-M does not support load balancing at the Control-M/Server level. It means that the jobs are predefined to permanently belong to a specific Control-M/Server. Control-M/Agents on job execution hosts are also preconfigured to communicate with a specific Control-M/Server. Partitioning jobs into multiple Control-M/Servers need to be carefully planned, because moving existing jobs from one Control-M/Server to another requires many manual actions, such as, a number of fields in each job definition need to be modified in order to shift the jobs from one Control-M/Server to another. Also the Control-M/Agent on the job execution host may need to be re-configured to accept a job submission from the new Control-M/Server.

Partitioning jobs into multiple Control-M/Servers is more than just dividing the total number of jobs by the recommended value and evenly distributing them into different Control-M/Servers. One of the biggest things that needs to be taken into consideration is the dependency between interrelated jobs. Control-M has a built-in functionality to support cross-datacenter job dependency, that is, global conditions. In order to use this feature, global conditions prefixes the need to be defined for those cross-datacenter dependencies in order to connect inter-related jobs that reside on multiple Control-M/Servers. Large amount of global conditions can increase complexity and processing overhead. Therefore, related jobs will be grouped into the same Control-M/Server as much as we can to reduce the needs for global conditions, unless some of the jobs in the flow have to run on a totally different environment. For example, mainframe jobs have to be scheduled in a mainframe Control-M/Server, and for global organizations, they often have multiple Control-M/Servers to handle jobs in specific geographic regions.

Total number of job execution hosts

The total number of job execution hosts managed by individual Control-M/Servers also needs to be taken into consideration during planning. Control-M/Server needs to concurrently maintain communication with every connected Control-M/Agent at all times, regardless of whether there are jobs scheduled on the Agent or not. Although Control-M/Server has no hard limit on the number of connected Control-M/Agents, managing these Control-M/Agents more or less creates overhead for Control-M/Server, especially when the number of Control-M/Agents is high. In a multiple Control-M/Server environmnet, we would balance the number of connected Control-M/Agents per Control-M/Server if there are thousands of hosts involved with batch processing.

Choosing which job execution hosts are to be grouped to the same Control-M/Server should be based on job relationships. Hosts that are likely going to execute inter-related jobs should be grouped into the same Control-M/Server to minimize the use of global conditions. The decision should be made in conjunction with the total number of jobs, as the total number of jobs running on these hosts can still exceed the recommendation of the maximum number of jobs run per day for each Control-M/Server.

Control-M/Agent itself is light-weight and really hasn't got a limit for the number of concurrent running jobs. The real limitation comes from the job execution machine's processing power and the operating system's maximum number of running processes. It is common to see that additional machines are added down the track to share the increasing processing workload. In this case, the additional Agents need to be added to the Control-M/Server that currently schedules the processing, so the additional Agents can be added as part of the job submission group, that is, node group. By doing so, Control-M/Server will be able to use its round-robin-based load balancing feature during job submission time to determine the best agent machine for job execution. This should be kept in mind during initial planning to allow each Control-M/Server with room to grow additional Control-M/Agents in order to share workload for a particular application's potential future batch processing needs.

Number of datacenters

Each running Control-M/Server needs to be defined into Control-M/Enterprise Manager as a datacenter along with an active gateway process in-between to handle the communication. When planning for the Control-M/Enterprise Manager server component machine, we need to take this factor into consideration to make sure the machine's CPU and RAM will have the capacity to handle those additional gateway processes running. On the other hand, the Control-M/Enterprise Manager database table will also require extra space to store the job definitions and active/archived job information to be presented in the GUI.

Smaller environments which are unlikely to have additional Control-M/Servers may choose to install Control-M/Enterprise Manager and Control-M/Server onto a single machine. However, for a larger environment, it is important to make sure the Control-M/Enterprise Manager machine has the extra capacity from both a, processing power and a storage point of in order to handle the future growth of the batch environment.

Amount of concurrent GUI users

Consolidating jobs into a centralized location changes the way jobs are managed. Users no longer manage and monitor jobs through different systems. Instead, they perform all these tasks from Control-M/Enterprise Manager GUI Clients. All GUI Clients are connected to the Control-M/ Enterprise Manager server component – GUI Server. The GUI Server does most of the work to allow job information to be displayed in graphical form on each client.

Multiple GUI server processes can be defined to share the workload, but there's no real guideline to decide how many GUI server processes are needed to handle the number of concurrent GUI client sessions because it depends on the frequency of GUI actions, type of GUI actions, the machine's processing power, and so on. GUI server processes can be added at any time without any major impact to users as well as to the batch environment, but additional GUI server processes will also increase CPU and memory usage to the machine. Additional GUI server processes can also be defined on another machine where a full Control-M/Enterprise Manager server components installation is installed.

After an increased number of GUI server processes, users may still experience slow response when performing mass job actions in the active environment (for example, hold or free a job, view job sysout, rerun a job). The CS process handles EM to Server GUI user requests (for example, hold/free/rerun a job) on the Control-M/Server side. The number of CS processes can be increased according to the demand. Again, there's no accurate guideline for how many CS processes are needed; CS processes can be increased or decreased at any time or set to dynamically increase/decrease.

Use Control-M/Agent or go Agentless

Control-M Agentless Technology provides rapid job deployment by avoiding the needs of the Control-M/Agent installation on the job execution host. It can reduce the TCO of the Control-M application because the user no longer needs to worry about patching and upgrading the Control-M/Agent installation on each remote job execution host. At the same time, it can potentially allow first day support on new application rollouts. Control-M Agentless Technology is perfect for the following batch processing scenarios:

  • The total number of hosts involved in batch processing is large, but some of them do not usually run many jobs, and therefore hardly justify the effort of installing and maintaining the Control-M/Agent on these hosts.

  • Between the remote host and Control-M/Server, the Control-M/Agent communication ports are blocked by the firewall and cannot be opened due to security reasons.

  • Control-M/Agent cannot be installed on a remote host (that is, the host is owned by a third-party vendor).

  • The O/S and Platform of remote host is either very old or very special, and no Control-M/Agent is compatible with it.

Of course, Agentless Technology is not a one size fits all approach, simply because there are a number of limitations with the technology. Here's a list of things to be considered before deciding on Agentless Technology:

  • Are there any jobs needed to be scheduled through Control Module? Control Modules are installed on top of Control-M/Agent. In order to install and use Control Modules, Control-M/Agent has to be installed first. By saying that, CM can be installed on other hosts and still be able to communicate with the application installed on this host.

  • Is Control-M filewatch facility going to be used on the job execution host? Control-M filewatch facility is a file detection feature provided as part of Control-M/Agent. Therefore, in order to use it, Control-M/Agent has to be installed. Alternatively, we can use Control-M CM for AFT to perform filewatch on a remote host, but it may cost extra to get the Control Module software license.

  • Are there any Control-M utilities which need to be triggered from the job execution host? There are a number of Control-M/Server utilities which can be invoked from Control-M/Agent. It is relying on Control-M/Agent's internal mechanism to accept the request and pass it to Control-M/Server.

  • Will the jobs produce large sysout (output)? If so, will the users need to view these sysouts from Control-M/Enterprise Manager GUI? With Control-M Agentless Technology, the output of each job execution always gets transferred back to the Control-M/Agent that sent the original job execution request. Frequently transferring large sysouts between hosts can cause significant network overhead. In such cases, either the sysout should get deleted right after job execution, or simply install Control-M/Agent on the job execution host, so the sysout gets generated on the host and stays on the host for users to view from the GUI.

  • Will the job execution host be running a large number of jobs in a high frequency manner? Is the job submission a part of a time-critical real-time system that requires imminent response? In such cases, Control-M/Agent installation is recommended for performance guarantee. This is because with Agentless Technology, jobs are submitted to the remote host through selected Control-M/Agent; time delay is unavoidable when opening the connection to a remote host through WMI or SSH for job submission and tracking.

Agentless Technology allows rapid job deployment. For applications that are just starting to move batch processing into Control-M, users can always set up Agentless job submission to fulfill the initial request and convert the host into Control-M/Agent in the future if needed. Control-M allows users to freely switch a job execution host between real Control-M/Agent and Agentless Technology, and it is transparent to job execution and management.

Production, development, and testing

Regardless, the batch jobs are migrated from an existing scheduling platform or developed together with new applications; these jobs need to be tested before going into production. Organizations have to build extra Control-M environments to separate development and testing from production. These development and testing Control-M/Server environments are often very similar to production or at least running the same version, so once the jobs are developed and tested, they can be transferred to production without significant modifications.

In smaller environments, the production, development, and testing are more likely to be separated only at the Control-M/Server level, that is, the organization will have a single Control-M/Enterprise Manager instance with all production, development, and testing Control-M/Servers defined and connected (we can call it "All-in-One"). User access is given according to the person's role in the organization; for example, an application developer may only be able to access jobs that reside in the development and testing environment, whereas production operators may have full permission to access jobs from all three environments.

In larger environments, the organization may have hundreds of GUI users from all around the world, running multiple Control-M/Servers with tens thousands of jobs in production, development, and testing. Sure enough, Control-M/Servers are running by themselves without interfering with each other and each user's access privilege can still be defined from Control-M/Enterprise Manager. The concern is that because the Control-M/Enterprise Manager's components, such as the GUI server, database, and global conditions, are shared between production and non-production, a possible usage spark from the development and testing environment could overload the Control-M/Enterprise Manager components, thus affecting the production system running. If this is a concern, it is common sense to completely separate the production from the development and testing environments (we can call it "Split"), that is, production Control-M/Server(s) are connected to the Control-M/Enterprise Manager which is dedicated to production. Development and testing Control-M/Servers are connected to their own Control-M/Enterprise Manager. This is a good way to make the production system's performance much more consistent and predictable; at the same time it reduces the complexity of the environment.

Control-M/Agents are to be installed on production, testing, or development job execution hosts, and connected to the corresponding Control-M/Server accordingly. Each Control-M/Agent is designed to interact with only one Control-M/Server at any given time. User can manually re-configure Control-M/Agent to talk to a different Control-M/Server (for example, moving a Control-M/Agent from UAT to production), or install additional Control-M/Agents on the same hosts if job submission is required by multiple Control-M/Servers concurrently.

Control-M high availability requirements

Prior to implementing the batch environment, how to ensure high availability should be taken into consideration as part of the planning phase to ensure that the future batch environment can offer continuous operation during special events such as outages for server maintenance, machine hardware failure, data corruption, power outage, or even flood, fire, and earthquakes at the server location.

News Paper Article: Floods take down CUA net bank (Australia)

By Luke Hopewell, on January 12th, 2011 Credit Union Australia (CUA) has shut down its internet banking services because of the Queensland flood disaster, leaving customers nationwide high and dry.

The services have been shut down because CUA operates out of head offices in the flooded area of the Brisbane central business district. It apologized to customers via its website today.

"Customers can be assured that all data and accounts are secure" it said on its front page, adding that customers can still make withdrawals at branches and ATMs outside the flooded areas.

"Customers can be assured that all their personal data remains secure and CUA is working to restore services as soon as possible. However, we do ask for their patience at this time as we work through the issues" CUA said.

IT outages can cost a lot for the business; it is important to at least have a high availability (HA) strategy for the batch environment and consider disaster recovery (DR) if the required resources are available. In general terms, HA means the application can offer uninterruptable computing during a hardware failure or other issues, whereas DR means the application can offer recovery at a foreign site during a disaster, such as fire or earthquake, at the current running datacenter.

In Control-M, there are two aspects around the topic, that is, HA/DR of Control-M components and continuing batch processing on job execution hosts.

Control-M in a clustered environment

Some organizations may have invested into operating system-level clustering and associated hardware in their IT infrastructure to achieve HA and DR. We are talking about the traditional clustering technology here, such as Microsoft Cluster Server (MSCS), VERITAS Cluster Server (VCS), IBM PowerHA, Oracle Solaris Cluster, and so on. All components of the three tiers (Control-M/Enterprise Manager, Server, and Agent) can be implemented on most common types of Windows/Unix/Linux clusters to achieve high availability. It is essential to have the required hardware resource and operating system clustering fully configured prior to Control-M installation. Installing Control-M in a clustered environment requires additional configuration as compared to a normal standalone installation, but it offers nearly transparent failover to batch running and Control-M users.

In a clustered environment, Control-M applications need to be installed on the shared filesystem that can be accessed by each machine (cluster node) within the cluster. These nodes can be located in multiple datacenters in different cities or countries as long as they are able to communicate with each other and are able to access the shared disk through the TCP/IP network. As Control-M applications only support clusters in active-passive mode and not active-active mode, the shared filesystem is to be mounted to the cluster node, whichever is active. The active node has a virtual service IP assigned for other machines on the network to access. In the event of the active node's failure, the cluster management software imminently detects it and switches the Control-M application together with the service IP to the available Standby. In theory, the entire process does not require modifications on either the Failover component itself or the machines which are accessing it.

Because this method offers transparency and no interruption during Failover, it is considered to be one of the best methods for Control-M high availability. However, organizations should always bear in mind the cost of acquiring and maintaining the clustering infrastructure.

Control-M/Server mirroring and failover

Operating system-level clustering requires purchasing of additional software and special storage hardware, as well as technical experts to implement and maintain it. Some organizations may not have the budget and the expertise to do so. As an alternative, users can consider Control-M's built-in capability to achieve mirroring and failover at the Control-M/Server level with minimal additional cost.

Control-M/Server database mirroring

Control-M/Server database mirroring offers real-time data duplication to a secondary database purely at the application level – done by mechanisms within the Control-M/Server. During the normal operation of mirroring mode, the Control-M/Server executes each database update query in both the primary and secondary database to keep them consistent. Because the two databases are always holding the same information, in the event of primary database failure, Control-M/Server can be manually switched over to the secondary database for continuous operation.

Database mirroring can be implemented on all types of Control-M/Server supported databases, that is, Sybase, Oracle, MSSQL, and PostgreSQL. The method requires an additional machine to hold the secondary database. The secondary machine should have similar (preferably identical) hardware specifications and capacity to the primary database machine to avoid performance bottleneck. The primary and secondary machines can reside at a different location, as long as the network connection in-between is consistent. Configuring the mirrored database is straightforward and does not require specific database knowledge, as every step of the configuration is done by utilities provided with Control-M/Server.

Control-M/Server failover

The database mirroring offered by Control-M/Server can only handle failures at the database level. The batch environment will still be interrupted if a disruption is beyond the database to the entire machine level that holds the Control-M/Server installation. Control-M/Server failover is another built-in mechanism that works together with database mirroring to allow a complete switch over to a standby Control-M/Server installation in the event of the primary site's total failure. When a failover is required, the users need to manually shutdown the primary Control-M/Server and start the secondary Control-M/Server, as well as update the Control-M/Server's communication details in Control-M/Enterprise Manager.

Again, the secondary machine that holds the standby Control-M/Server and mirror database installation should be identical to the primary to ensure performance. It can be running in a remote location, as long as there's a communication link between itself and the primary and the network performance is consistent. Unlike the operating system clustering solution, with the built-in Control-M/Server failover mechanism, both the primary and secondary Control-M/Server operates with the local machine's hostname rather than a virtual hostname. Due to this reason, each Control-M/Agent need to be configured to be able to communicate and accept job submission requests from both the Control-M/Servers.

The limitation of Control-M/Server mirroring and failover is that a manual switch over to secondary is required in the event of a failure, as well as a manual switch back to the primary after the service is restored to normal. However, these manual steps can be fully automated and triggered by a monitoring solution such as BMC's ProactiveNet Performance Management solution. This is because the mirroring and failover actions are command-line utilities driven, which can be called by a thirdparty application in a non-interactive way. Technically, the switch over should take less than 10 minutes, but if it is going to be performed manually, additional delay may occur as it takes time for someone to respond to the event. During the switch over process, Control-M/Server needs to be completely shutdown, therefore no jobs can be scheduled until the switch over is complete. If such a delay is not acceptable at the business level, other high availability solutions might be worth considering even though this requires additional cost.

Control-M mirroring and failover could be the perfect solution for those small and medium size batch environments that have a more flexible batch window to achieve high availability without massive investment on clustering software and additional storage technology. The other thing that should also be kept in mind is that even though the Control-M/Enterprise Manager server components do not offer the same mirroring and failover mechanism, to some degree, HA is still possible through some simple backup and restore processes that can be done manually or automatically through a third-party application.

Control-M node group

As we mentioned in the earlier part of this article, multiple Control-M/Agents can be grouped together into a node group for round-robin-based load balancing job submission. Node groups can also provide continues batch processing when one or more job execution nodes in the node group are unavailable. During such events, Control-M/Server will automatically skip the problematic nodes and perform load balanced job submission on rest of the available Control-M/Agents. Once the failed node(s) are restored, Control-M/Server will automatically take them back into consideration again during the next job submission.

Node group technology is used to achieve continuous batch processing as an alternative to configuring Control-M/Agents with operating system clustering. In order to use this technology, each machine within the node group needs to pre-set up with similar characteristics to handle the submission of jobs, such as being able to access the common database, having the same user defined operating system for job execution, and being able to read and write a shared disk drive for the job's input and output. On the other hand, the batch jobs need to be MSA (Multi-Server Architecture) compliant so they can be executed on any of the machines within the node group to produce the consistent outcome.

Control-M does not have a limit for the number of nodes defined in a node group. It is a good practice to create a node group even for a single node and define jobs to be scheduled to the node group name. By doing so, the user can add, change, or remove job execution nodes to the node group without the need of modifying the job definition.

High availability by virtualization technology

Virtualization Technology offers a simple and cost-effective high availability solution, and at the same time, it dynamically increases the IT environment's hardware utilization. As organizations are starting to transform their physical servers into virtual machines, it is worthwhile to consider implementing the Control-M applications on virtual machines to save physical machine cost and at the same time achieve cost-efficient high availability.

If the virtualized environment is configured properly, Control-M components are no longer required to be installed on two or more machines for each instance. Instead, a virtual machine image holds the Control-M installation that will be stored on a shared storage and gets activated on a physical machine's virtual environment. In the event of the physical machine's failure, the machine image imminently gets restarted on the next available physical machine. The entire restart process is done automatically by the virtualization management software, which does not require going through complex failover steps and can be totally transparent to the end user or its client machines.

New features offered by the virtualization technology could even offer live migration of a virtual image from onse physical machine to another without impacting the Control-M applications running. For example, by using VMware's VMOTION technology, users can schedule maintenance to the hardware at any time without worrying about interrupting the batch environment. It is also useful when virtual image migration is required because a potential problem or capacity issue is detected with the current physical machine.

Virtualization technology offers a similar high availability standard as traditional operating system cluster, but with much less cost and more flexibility. However, before we decide to implement Control-M into a virtual environment, load simulation testing should be performed to ensure that the Virtual Infrastructure can produce similar performance as if running it on a physical machine and the storage shared across many virtual machines should provide a sufficient and consistent performance for Control-M application.


By the end of this article, you will be able to:

  • Understand the potential challenges of implementing a batch scheduling infrastructure, which is able to overcome those challenges by taking the right approach at the right time to achieve enterprise-wide workload automation.

  • Analyze batch requirements and design an efficient batch environment by using the right Control-M technology at the right place to meet complex technical needs.

  • Perform Control-M installation on Windows and Linux-based systems and carry out Control-M post installation tasks to ensure installed components are running and working with each other properly.

Resources for Article :

Further resources on this subject:

You've been reading an excerpt of:

BMC Control-M 7: A Journey from Traditional Batch Scheduling to Workload Automation

Explore Title