The purpose of this chapter is to establish a foundation for the reader, by identifying the key concepts, both technical and operational, that will eventually need to be applied to a SharePoint deployment within an organization.
In this chapter, we will cover:
Identifying Disaster Recovery (DR) scenarios within SharePoint and its associated technology stack
Inheriting a mission critical environment that has no existing DR plans
Traditional DR – the battle between cost and speed
Thinking in terms of service disruptions versus disasters
Four datacenter outages in 2012 that we can learn from
Building confidence by refining DR plans with more frequent testing
What is virtualization and how does it help with DR?
Efficiently supporting hybrid environments with virtualized DR
Cloud based solutions welcoming a new approach
Tackling DR in a SharePoint environment is often a struggle for both seasoned SharePoint administrators and newbies, because of the different ways in which the platform can be deployed within an organization. Furthermore, it can prove to be challenging to apply existing DR experience to SharePoint, due to the distributed and componentized nature of SharePoint and the supporting technologies that need to be in place for SharePoint to function. SharePoint relies on technologies such as Microsoft SQL Server, Active Directory Domain Services (ADDS), Internet Information Services (IIS), and .NET framework just to name a few, so there are a lot of dependencies and points of failure to identify during the DR planning stage.
In IT there is a misconception that more documentation, procedures, and processes, equals better documentation, procedures, and processes. The fact is that the opposite is true. The first thing that you should implement to avoid tons of documentation, procedures, and processes that may be redundant or contradictory is to, as per best practice; create a solid governance plan that details your procedures and processes.
A good governance plan is a living document that requires constant revision and adjustment to maintain a crisp and agile process. The administrators who do the work should own the processes and maintain them with the help and input of the business stakeholders. I have seen too many businesses where the stakeholders define the policies and procedures thinking only about the business needs and giving little or no thought to the technical side of things, so the documentation and procedures are unrealistic and prone to failure.
SharePoint environments are extremely complex systems that require constant monitoring, planning, and maintenance; you cannot just deploy a farm and hope to have a stable, secure collaboration platform. This is why having a solid and well thought out governance plan is crucial. But the reality is that most organizations either don't have a proper governance plan or don't have one at all.
One of the main causes of system failure is when processes and procedures are weak, this usually happens when the people who are responsible for creating, implementing, and tweaking the processes and procedures (usually the governance board) are not monitoring and reviewing the processes and procedures continually to keep them up-to-date, and the administrators are not testing on a regular basis and reporting back to the governance board informing them of the issues that were found with the processes and procedures while testing. So what usually happens is that the administrators start coming up with quick fixes and workarounds to keep things going in the short term but sooner or later, they will get tired, frustrated, or they'll leave before things really go wrong. Then, it is too late to prevent the catastrophe that has been brewing.
So management must understand that they need enough staff on the ground not just to keep things up and running but to maintain a healthy and stable environment, they need to have a well thought out governance plan. Staff on the ground must report situations that will lead to system failure and data loss to management immediately.
Is failure necessary for success? I think that processes and procedures must be tested and improved continuously, testing is how you will find weaknesses and flaws in your processes and procedures that you may not find in the midst of a system failure. This is the main reason for governance, people taking ownership of change and reacting to it constructively. So the answer is yes, failure is necessary for success, but if you are testing regularly these failures will happen in a controlled environment.
A part of the problem for an administrator is to understand how the supporting technology stack integrates with SharePoint. Although an administrator may know ADDS, IIS, and SQL Server, and how these software stacks work, they may be unfamiliar as to how these technologies work together with SharePoint. This topic is covered in later chapters of this book.
The other challenge with SharePoint DR is that an administrator may not realize or understand the business activity that is reliant on SharePoint, and will have a hard time putting together an appropriate DR plan. With e-mail, there is no question that this is considered mission critical and needs constant uptime in an organization. But with SharePoint, this may not always be the case.
Perish the thought. Someone in IT needs to wear a business hat and speak to the business managers, understand their business needs and SharePoint activities, and how mission critical these are. This should not be a once a year exercise, but rather an ongoing interaction so that the entire team is on the same page and completely understands the business needs.
In scenarios where proper backups were not done, restoring a SharePoint server is much more problematic. Because SharePoint does not run in a vacuum, proper planning must account for three components: SQL, IIS, and Active Directory.
SQL-specific issues will be covered later in this chapter. In regards to server restoration, remember that if a SQL alias was not used, SQL may not perform as expected when renamed. Although most connections to a SQL Server are socket or named pipe-based, a rename can cause some aberrant behavior if not properly planned. In addition to this issue, permissions at the instance and database level also merit inspection.
For more information see Plan for backup and recovery in SharePoint 2013 available at:
For more information see Backup and restore: SharePoint server 2013 available at:
IIS maintains configuration in the Metabase. Its location and tools for management changed between IIS 6.0 (server 2003) and IIS 7.0 (server 2008). In the early versions of IIS, the web server's configuration was stored in an XML file.
Under the current versions of IIS, the configuration is saved in the application's
web.config files. Previously, backup and restoration of this data was integrated into the IIS Manager utility. The command line tool
appcmd is used for disaster protection of the configuration.
AppCmd.exe file is located in the
%systemroot%\system32\inetsrv\ directory. This is not the path so it will not start automatically; you need to use the full path to the executable when executing commands, that is,
%systemroot%\system32\inetsrv\AppCmd.exe list sites or you can manually add the
inetsrv directory to the path on your machine so that you can access the
AppCmd.exe file directly from any location. Data in the Metabase is specific to the current web applications and settings and may be lost if simply moving a site to a different server.
Active Directory is another service that touches SharePoint in several ways. The primary concern is ensuring that the computer account for the SharePoint server has the correct memberships in Active Directory. The various service accounts and permission groups for SharePoint are also held in Active Directory. If the identity of the server was maintained, then Active Directory will not need to change when the server is restored. Otherwise, remapping the old identity to the new one may be necessary.
Disaster protection of a SharePoint server is a layered approach. The outer ring of software protection is the operating system. Protection and restoration of the operating system is the first and a critical step in restoring a SharePoint server. The most important goal in server restoration is maintaining the identity of the server, even in cases where both the software and hardware are destroyed. The second step is to identify the factor that shapes and, in many cases, dictates a restoration strategy. A proper DR plan will allow rapid restoration of the server, sometimes with several options available.
When choosing a DR approach, organizations rely on the level of service required, as measured by two recovery objectives:
The preceding objectives are still relevant with SharePoint and the amount of money the business is willing to spend. This is covered in depth in Chapter 2, Creating, Testing, and Maintaining the DR Plan, and Chapter 6, Working with Data Sizing and Data Structure.
With dedicated and shared DR models, organizations are often forced to make trade-offs between cost and speed. As the necessity to achieve high availability and reduce costs continues to increase, organizations can no longer accept trade-offs, that is, a bank, for example, cannot use a cold standby model because it's cheaper, the C-level executives, that is, your CIO is going to want to know why it took 4 or 5 days to recover and why was there loss of data costing your organization possibly thousands of dollars. There is no set rule for this, except how much is your organization willing to pay and how much data loss is acceptable that is the formula.
Most organizations where SharePoint is mission critical use a hot standby; this is a duplicate farm in a DR datacentre. Depending on how much downtime is acceptable to your organization and how much time you want to spend on maintaining both farms synchronized, you would make the following decisions:
Just have three servers running and the rest turned off, and in the case of a disaster you would turn on the rest of the servers, and add whatever solutions and patches need to be added.
Have all your servers live all the time; this is much faster but obviously more expensive
I was the lead architect for
recovery.gov. They have 45 servers on the AWS cloud in one region and 45 servers in their DR region. Although all the servers are live, it is not an active active environment; it is an active passive environment.
In case of a disaster, they would need to fail over to their DR farm manually, this is about a 1 hour window that is expectable to them. So you see the decision is yours; what is an acceptable loss of data and what is an acceptable amount of down-time?
While DR was originally intended for critical back-office processes, many organizations are now dependent on real-time enterprise applications like SharePoint that handle everything from their internet, intranet and extranet which are primary interfaces for their clients and employees. The cost of a minute of downtime may cost them thousands of dollars.
Standby datacentres are required for scenarios where local redundant systems and backups cannot recover from the outage at the primary datacentre. The time to get a farm up and running in a different location is often known as a hot, warm, or cold standby. Our definitions for these farm recovery datacentres are as follows:
Each of these standby datacentres have an associated cost to operate and maintain.
The slowest option to recover.
Some datacentres do not have the SharePoint expertise in house to deploy and configure your farm, so you will need to implement a solution to facilitate this, such as Microsoft's System Center Data Protection Manager or PowerShell script. You may still run into problems such as the hardware not being the same, this can cause all sorts of problems and delays.
Warm standby DR strategy: A business ships/uploads backups or virtual machine images to local and regional disaster recovery farms.
Can be very expensive and time consuming to maintain.
You pay lots of money in storage fees, that is, if you take a backup of one of your servers and it is 90 GB in size, the virtual machine will be 90 GB in size; multiply that by 6 or 10 servers and the cost of uploading that data every time you send the datacentre a new backup not to mention the cost of having them upload those images and of course test them at least once a month. (Remember: if you haven't tested it and had a successful restore it is not a good DR plan it's a shot in the dark.)
Hot standby DR strategy: A business runs multiple datacentres, but serves content and services through only one datacentre.
It is often fairly fast to recover. If you are using third-party tools, such as Metalogix Replicator (C), that can synchronize two or more distant SharePoint farms in real time you can ensure that SharePoint content is always available and up-to-date. Bi-directional replication syncs all your SharePoint content; documents, sites, applications, permissions, and workflows with full metadata, versioning, and permissions. Replicator can sync immediately after changes happen or on a regular schedule.
In a cold standby disaster recovery scenario, you have to recover by setting up a new farm in your cold standby datacentre and restore the backups that you have stored there. In this scenario, if your primary farm fails before you get to make the backups to ship out to the cold standby datacentre, you will lose all the data added or changed since your last backup.
In a warm standby disaster recovery scenario, you have to create a duplicate farm in the warm standby datacentre and ensure that it is updated regularly by using full and incremental backups of the farm in the primary datacentre. This requires some continuous monitoring, server maintenance, SharePoint upgrades, and other data activity to keep the environment warm. In the event of a failure, you will lose all the data added or changed since your last backup.
Virtualization provides a cost effective option for a warm standby recovery solution. Typically, you can use Hyper-V or VMware as an in-house solution for recovery. This is explained in further detail in Chapter 4, Virtual Environment Backup and Restore Procedures. But even this has its downside. If it takes two days for the VMs or backups to get to the DR datacentre or to upload all the VMs to the DR datacenter, your backups are now two days out of date.
Otherwise, you have to make sure that the virtual images are created often enough to provide the level of farm configuration and content freshness that you must have for recovering the farm at the secondary DR site. You must have an environment available in which you can host the VMs. We will dig a bit deeper into virtualization technologies later in this chapter.
In a hot standby disaster recovery scenario, you have to create a duplicate farm in the hot standby datacentre, so that it can assume production operations almost immediately after the primary farm fails. This requires a third-party tool, such as Metalogix Replicator for real time synchronization.
For more information on Metalogix Replicator visit, http://www.metalogix.com/Products/Replicator/Replicator-for-SharePoint.aspx.
Both the RTO and RPO approaches include shared and dedicated models. These are explained below.
In a dedicated model, the infrastructure is dedicated to a single organization. Compared to other traditional models, this can offer a faster time for recovery, because the IT infrastructure is mirrored at the disaster recovery site and is ready to be called upon in the event of a disaster. While this model can reduce RTO because the hardware and software are preconfigured, it does not eliminate all delays. You still need to restore the data. This approach is costly because the hardware sits idle when not being used for disaster recovery. Some organizations use the DR infrastructure for development and testing, to mitigate the cost, but that introduces additional risk. When organizations start using their DR site for development or test, it becomes a huge problem because when the time comes to use it for an actual disaster, the farms are not the same; they are drastically different. There are solutions that were not maintained or documented correctly and now you are in a bind.
In a shared model, the infrastructure is shared among multiple organizations so it is more cost effective. After a disaster is declared, the hardware, the operating system, and the application software at the disaster site must be configured from the ground up to match the IT site that has declared a disaster. On top of that, the data restoration process must be completed. This can take hours or even days.
This is normally a service provided by the company that is managing your data operations.
There is a hybrid model, where a certain SharePoint technology such as SQL Server leverages a DR process from another application; this does reduce costs, but of course both DR plans need to be in sync. This can also become very complex; how do you separate the two and when it comes to restoring what is the process? I personally don't like this model because of its complexity, and as a best practice it is never a good idea to add any other database to your SharePoint SQL Server.
Most people think of disaster recovery as a plan that is in place in case of a disaster, such as:
Weather related events, such as floods, tornadoes, hurricanes, and forest/brush fires
Any of these disasters can disable your primary datacentre and you would have to failover to your DR datacentre. However, most application interruptions are due to more mundane everyday occurrences, such as:
Fiber or communication lines are cut – loss of network
Power failures – outage or sporadic service
Cut power line
Security breach – hacking and/or malicious code
Water pipe breaks in a facility
Human error, such as a redundant system's failure that goes unnoticed
Another dimension to this point is covered in Chapter 8, Disaster Recovery Techniques for End Users. Think of who is interrupted: sales force, trading floor, executives, or end users.
This may seem like a trivial point, but IT has only so much manpower to dedicate to issues.
Super storm sandy, Oct 29-30: Datacenters in New York and New Jersey were impacted by the storm ranging from downtime because of flooding to days on generator power for data centers around the region. Sandy was a storm that caused more than just a single outage, and tested the resilience and determination of the data center industry on an unprecedented scale.
Go Daddy DNS outage, Sept 10: Go Daddy is one of the biggest DNS service providers, as it hosts 5 million websites and manages more than 50 million domain names. That's why the Sept 10 outage was one of the most disruptive incidents of 2012. The six-hour incident was a result of corrupted data in router tables.
Amazon Outage, June 29-30: AWS EC2 cloud computing service powers some of the web's most popular sites and services, including Netflix, Heroku, Pinterest, Quora, Hootsuite and Instagram. A system of strong thunderstorms, known as a derecho, rolled through northern Virginia causing a power outage to the AWS Ashburn datacenter. The generators failed to operate properly, depleting the emergency power in the Uninterruptible Power Supply (UPS) systems.
Calgary data center fire, July 11: A datacenter fire in the Shaw Communications facility in Calgary, Alberta delayed hundreds of surgeries at the local hospitals. The fire disabled both the primary and backup systems that supported key public services. This was a wake-up call for government agencies to ensure that the datacenters that manage emergency services have failover systems.
This is why having a well-planned DR strategy is so important, because of unforeseen assurances like the preceding ones.
When you think of virtualization, think of it as a way of consolidating servers. Virtualization is the process of separating the software layer of a server from its hardware layer. A new layer is placed between the two to act as a go-between this is known as the hypervisor.
Companies used to have multiple servers with each server operating system on its own piece of server hardware. In virtualization the server, including the operating system, applications, patches, and data, is encapsulated into a single image or virtual server.
A single physical server, called the host, can run four or five of these images or virtual servers simultaneously, saving the company money on the following:
Purchase of hardware, as less servers are needed
Consolidate management of the machines
Reduced energy consumption, as there are less servers
Much more efficient use of resources, as the new machine(s) will be able to share those resources
Failure in one machine will not lead to the failure of others
Most computers operate using as little as 4 percent to 7 percent of their resources. There are many visualization companies in the market; the two that we have mentioned in this chapter are Microsoft Hyper-V and VMware.
As the complexity of IT departments increases, including multiple server farms, multiple farm environments, and federated farms, the ability to respond to a disaster or outage has become more complex. Depending on what standby recovery model you are using, you may need to recover on different hardware, which can take longer and increase the possibility for errors and data loss.
Organizations are implementing virtualization technologies, such as Hyper-V and VMware in their datacenters to help remove some of the underlying complexities, and optimize infrastructure utilization. Cloud-based business resilience solutions must offer both Physical-to-Virtual (P2V) and Virtual-to-Virtual (V2V) recovery, in order to support these types of environments.
When a cloud is made available in a pay-as-you-go manner to the public, we call it a public cloud; the service being sold is Utility Computing. Current examples of public clouds include Amazon web services, Google, Rackspace, and Microsoft Azure.
The term 'private cloud' refers to internal datacenters of a business or other organization that are not made available to the public.
There are three major cloud models:
Because the servers are on demand and do not need to be purchased, there are a lot of benefits with cloud computing. There is huge cost benefit, as well as rapid provisioning, scalability, and elasticity. Some cloud vendors even offer a complete DR package and can allow an outsourcing of your DR to them. The last chapter in this book is dedicated to evaluating DR in the context of the cloud.
It is a good practice to test your DR plan regularly by failing over to your DR site. I would suggest doing this failover test once a month, but this may not be possible for all enterprises, at the very least, you should be doing it once a quarter to gain confidence and peace of mind knowing that you have a solid and reliable DR plan that works. You have no idea how many clients I have been to that tell me they have a solid DR plan but when we test it they rarely work! This mind-set exists because too many companies believe in set it and forget it, as we mentioned previously, you must test your DR by failing over to your DR site to ensure that it works.
You wouldn't want to find all the issues in your DR plan in the middle of a disaster. Backups alone are useless if you don't have a place to restore them. The way to go from ought to work to known to work is through testing!
The reasons for infrequent testing are usually budgets and the scarcity of time. This is why most failures are usually discovered during a disaster. And at that time you have a few or no practical alternatives.
A well-developed Disaster Recovery plan will identify all key processes and steps to failover to your DR site. It should have a predefined schedule for testing, after each test document any weakness found and what was done to correct them.
New technologies, such as virtualization and cloud computing make regular and (even daily) testing feasible. These technologies allow you to automate processes and provide a foundation for an ongoing RTO and RPO reporting at the management level, allowing you to better estimate and mitigate risks for the business.
This chapter has set the stage so that readers understand and are aware of the importance of planning and creating a SharePoint DR plan to support their SharePoint deployment.
The following chapter explain how to perform testing and maintenance on a DR environment so the administrator has the confidence that the documentation is actually workable.