Active Directory Disaster Recovery

5 (1 reviews total)
By Florian Rommel
  • Instant online access to over 8,000+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. An Overview of Active Directory Disaster Recovery

About this book

Murphy's law states that anything that can go wrong will go wrong. In relation to Information Systems and Technology this could mean an incident that completely destroys data, slows down productivity or causes any other major interruption of your operations or your business. How bad can it get?—"Most large companies spend between 2% and 4% of their IT budget on disaster recovery planning; this is intended to avoid larger losses. Of companies that had a major loss of computerized data, 43% never reopen, 51% close within two years, and only 6% will survive long-term." —Jim Hoffer, Backing Up Business – Industry Trend or Event.

Active Directory (AD) is a great system but it is also very delicate. If you get a problem, you will need to know how to recover from this situation. You will need to know about Disaster Recovery and be prepared with a business continuity plan. If Active Directory is a part of the backbone of your network and infrastructure, the guide to bring it back online in case of an incident needs to be as clear and concise as possible. If all of this happens or if you want to avoid all of this happening, this is the book for you.

Recovering Active Directory from any kind of disaster is trickier than most people think. If you do not understand the processes associated with recovery, you can cause more damage than you fix. This is why you need this book.

This book has a unique approach—the first half focuses on planning and shows you how to configure your AD to be resilient; the second half is response focused and meant as a reference in which we discuss different disaster scenarios. We follow a Symptom-Cause-Recovery approach—so all you have to do is follow along and get back on track.


This book describes the most common scenarios and how to properly recover your infrastructure from them. It contains commands and steps for each process and contains information on how to plan for disaster and how to leverage technologies in your favor in case of a disaster.

You will encounter these types of disaster and incident in the book, and learn how to recover from them:

  • Deleted objects
  • Single domain controller hardware failure
  • Single domain controller AD corruption
  • Site AD corruption
  • Site hardware failure
  • Corporate AD corruption
  • Complete corporate hardware failure
Publication date:
June 2008
Publisher
Packt
Pages
252
ISBN
9781847193278

 

Chapter 1. An Overview of Active Directory Disaster Recovery

When Microsoft introduced Active Directory (AD) with Windows 2000, it was a huge step forward compared to the aged NT 4.0 domain model. AD has since evolved even more and emerged as almost the de-facto standard for corporate directory services.

Today, if an organization is running a Windows Server based infrastructure, then they are almost certainly running AD. There are still some organizations that have NT 4.0 DCs, though that is quickly changing.

AD is often used as THE authentication database even for non-Windows-based systems because of its stability and flexibility. There are many network-based applications relying on AD without its users being aware of it. For example, an HR application can use AD as a directory for personnel information such as name, phone number, email address, location in the company, and even the computer of the user. Yet the HR personnel may not be aware that the same information directory is used to fetch all the information for the global address book in the email system, and to authenticate the user when he or she logs on to his or her workstation.

Due to the strong integration between applications and AD, an event that could cause an outage could have quite a huge impact on systems, from sales to human resources, all the way to payroll and even logistics in manufacturing companies.

In most cases where AD is used for more than just authentication, it quickly becomes the IT infrastructures' lifeline, which, if interrupted or stopped, causes chain reactions of failures that can bring a company to a halt, and stop production, communications, and delivery of goods.

Of course, once you have an AD running, a logical step is to have Exchange as your email and collaboration system. If you have both systems, then you know how critical AD is for Exchange. Without an AD, the email and collaboration systems will not function. For many companies, being without email functionality for even a day can be catastrophic. If email is your main method of communication within the organization, then picture having your preferred method of communicating taken away for an entire day (or more) within your entire organization. This applies to receiving as well as sending, and access to your mailbox and related functions.

As you might have noted by now, a proper Disaster Recovery (DR) plan is a necessity, and a proper DR is just as critical. You need to cut the possible downtime of your mission-critical systems to a minimum.

What is Disaster Recovery?

Disaster Recovery (DR) is, or should be part of your Business Continuity plan. It is defined as the way of recovering from a disturbance to, or a destructive incident in, your daily operations. In the context of Information Systems and Technology, this means that if an incident completely destroys data, slows down productivity, or causes any other major interruptions of your operations or your business, the process of reverting to normal operations with minimum outage from that incident is called Business Continuity. Disaster Recovery is, or should be, a part of that process.

You could say that Business Continuity and Disaster Recovery go hand in hand, but they do vary depending on the area and subject. For example, if your WAN connection goes offline, it means that your business units can no longer communicate via email or share documents with each other, although each local unit can still operate and continue to work. This scenario would definitely be outlined in your Business Continuity Plan. However, if your server room burns down in one location, the rebuilding of the server room and the data housed in it would be Disaster Recovery.

The problem with Disaster Recovery is that the approach varies for different domains and applications. Also, the urgency and criticality vary across areas and subjects. A lot of companies have a very superficial Business Continuity plan, if they have any plan at all, and have Disaster Recovery plans that are just as superficial. A visual outline of a sample Business Continuity plan is shown below:

As you can see, DR is only a part of the greater picture. It is, however, one of the most crucial parts that many IT departments forget, or decide to overlook. Some even seem to think that DR is not an important step at all.

 

What is Disaster Recovery?


Disaster Recovery (DR) is, or should be part of your Business Continuity plan. It is defined as the way of recovering from a disturbance to, or a destructive incident in, your daily operations. In the context of Information Systems and Technology, this means that if an incident completely destroys data, slows down productivity, or causes any other major interruptions of your operations or your business, the process of reverting to normal operations with minimum outage from that incident is called Business Continuity. Disaster Recovery is, or should be, a part of that process.

You could say that Business Continuity and Disaster Recovery go hand in hand, but they do vary depending on the area and subject. For example, if your WAN connection goes offline, it means that your business units can no longer communicate via email or share documents with each other, although each local unit can still operate and continue to work. This scenario would definitely be outlined in your Business Continuity Plan. However, if your server room burns down in one location, the rebuilding of the server room and the data housed in it would be Disaster Recovery.

The problem with Disaster Recovery is that the approach varies for different domains and applications. Also, the urgency and criticality vary across areas and subjects. A lot of companies have a very superficial Business Continuity plan, if they have any plan at all, and have Disaster Recovery plans that are just as superficial. A visual outline of a sample Business Continuity plan is shown below:

As you can see, DR is only a part of the greater picture. It is, however, one of the most crucial parts that many IT departments forget, or decide to overlook. Some even seem to think that DR is not an important step at all.

 

Why is Disaster Recovery Needed?


A lot of people may ask themselves: "Why would we need a 'guide' for Disaster Recovery? If a Domain Controller (DC) has a critical failure, we just install another one". This might seem to work at first, and even for a longer period in small organizations, but in the long run, there would be problems, and a lot of error messages. Correct recovery is crucial to ensure a stable AD environment. The speed at which problems appear, grows exponentially if there are multiple locations of various sizes across different time zones and countries. For example, let's say a company called Nail Corporation (www.nailcorp.com) has its headquarters in Los Angeles, California, and branch offices with several hundred employees in Munich, and Germany, in addition to branch offices in Brazil and India.

NailCorp has one big AD domain and a data center in Brazil having a 512 kilobit link to the headquarters. Let's suppose that the data center in Brazil is partially destroyed due to an earthquake. Network connectivity is restored fairly quickly, but both DCs are physically broken and have therefore become non-functional. The company has around 10,000 employees and, according to Microsoft's AD Sizer software, the space requirement for each Global Catalog server is about 5GB.

As you have to start the rebuild process from scratch, and you have no other DC at the site, you have to replicate 5GB over a 512 kilobit link. Assuming that you get maximum connectivity speed, and no other traffic is flowing at the same time, which is nearly impossible because your users will inadvertently boot their machines and want to start working, you would need over a day to replicate the database. This will increase your restoration time even further-in this case, by at least a day.

In the event of a disastrous event for a company such as NailCorp, you would want to replicate and rebuild as fast as possible. During that time, since you have machines authenticating against the other domain controllers in your company—assuming your DNS service is globally configured to support failover—your replication will be much slower. In this case, you should have different plans in place than just installing another DC.

Note

To learn more about how DNS and authentication (DC selection) for Windows XP clients work, please read Microsoft's Knowledgebase article 314861 (http://support.microsoft.com/kb/314861).

Another good example is an application that authenticates against a specific DC, or pulls specific information from one. If that DC breaks, the DC will have to be rebuilt with the same name. If you do not do this the right way, you may see strange things happening This is not very far fetched especially in, for example, a software development company.

The need for Disaster Recovery is ever-increasing, and there are several books that touch upon the subject. But none of them are dedicated to different scenarios, and certainly none of them explain the entire process.

Recovering AD from any kind of disaster is trickier then most people think. If you do not understand the processes associated with recovery, you can damage more than you fix.

In order to prevent any kind of major interruptions, and to speed up recovery in the event of an disaster, there are several things that can be done.

For example, AD relies extremely heavily on DNSes. So you need to make sure that if you use AD Integrated (ADI) DNS zones, you should have a standard backup DNS server that has a complete copy of your zones in a non-integrated form. This DNS server should be on an isolated network, and should contain only the records and zones relating to AD, and not all existing dynamic updates.

You should also have a Delayed Replication Site (DRS), also called a lag site . This is a standard part of your AD domain. This should have one or two DCs, maybe a DNS server, and even a standby Exchange server in case one is needed. However, the AD replication is set up with a high link cost in order to prevent replication for a longer time period. Or, you can make it a completely isolated site with a firewall and force a replicate once every one to three months only. This will allow you to have a stable infrastructure. This state may be three months old, but if anything happens you can have a running AD within a few hours, instead of days.

Virtualization can be a boon, especially in this case. Buying a server is fairly cheap nowadays, and as for a DRS, you only need a lot of memory in the machine. VMWare server (http://vmware.com/products/server/) and Microsoft Virtual Server (http://www.microsoft.com/windowsserversystem/virtualserver/) can be downloaded and used for free nowadays. Both of these systems allow the DRS to be run in a virtualized, isolated environment.

Having a DRS can reduce restore time tremendously because, even if there is a global failure, the old DCs can be removed and new ones installed to replicate the DRS.

 

Conventions Used in This Book


To avoid repetition, acronyms have been used wherever possible in this book. The following is a list of acronyms, with their respective explanations, used in this book:

  • DC: Domain Controller (the server that acts as an authentication and directory authority within a domain).

  • OS: Operating System (Windows 2000 and all 2003 Server varieties).

  • IP Address: Internet Protocol Address. (This is the address that a computer uses to uniquely identify itself in a network.)

  • AD: Active Directory (Microsoft Directory Service used for authentication and domain related information).

  • DNS: Domain Name Service (This is a crucial service that AD relies on map IP addresses to domain names, and vice versa.)

  • FSMO Roles: The roles that each DC holds within a domain.

  • NTDSAM and NTDS NT Data Storage and Architecture: In AD, the data store contains database files and processes that store and manage directory information for users, services, and applications. Basically, this is the back-end of AD.

  • FRS (File Replication Services): These are services necessary to replicate AD.

 

Disaster Recovery for Active Directory


We have established that DR is an important part of a Business Continuity plan. But now, we can go further and say that, DR for AD is only a part of a Disaster Recovery plan, and not the whole plan by itself.

You are correct if you think that you should have different DR guides for different things. While writing good DR documentation, it is important to take the standpoint that the person who performs the recovery has little or no knowledge of the system. If you roll out your own hardened and customized version of Windows 2003, some things might differ during the installation and someone who has no clear guide will install a system that differs from your actual DC install guidelines. This can cause incompatibility or result in an improperly-functioning system, later on. This happens say, when you have specific policies that are applied to DCs, and during an install process, the selection of policies is called in a manner different from the dictats of the DC policy.

You might think that this situation will never arise, but hurricane Katrina in the U.S., and the tsunami that struck Thailand, India, and others, proves that it can. Situations may arise when a knowledgeable person is not around at the time of crisis, so the guide needs to be as clear as possible. It may also be possible that the person doing the actual recovery is an external IT consultant or junior IT staff member because the senior and trained staff are not available. In this case, the person handling the recovery may not at familiar with your environment all be.

AD is a great system, but it is also very complex. Performing correct DR is therefore crucial. If AD forms a part of, or is the backbone of, your network and IT infrastructure, a proper guide to bringing it back online in the event of an incident needs to be as clear and concise as possible.

The Business Continuity plan, and the DR guides, especially the AD DR guides, should be practiced and tested at regular intervals. This effectively means that once a year or so, you need to test that your guides are working and that they will actually bring your business back online. In order to test all kinds of scenarios, building a test environment—preferably virtualized because it gives you much more flexibility such as rollbacks and snapshots—is a necessity.

Note

Never test anything in your production environment. Rather, take a backup of your live AD database and restore it to an isolated (virtual) test AD. Make the test AD as close to your production AD as possible, and test there. This also goes for hotfixes and schema changes, even if it is just "a small change that won't affect anything". If it's a change, it will eventually affect something.

It may be difficult to convince the top management that your systems could actually fail, but replicating your systems, or even just a crucial portion of your server infrastructure, and testing that would definitely be acceptable to them.

 

Disaster Types and Scenarios Covered by This Book


Since this book is meant as a reference, and we discuss different scenarios here, an overview of these scenarios is necessary. The following types of disasters or incidents are covered in this book. Illustrations and flowcharts are provided to visualize the disasters more easily, wherever necessary.

Recovery of Deleted Objects

The most common scenario (more common than a single DC hardware failure) is the accidental deletion of objects, computer accounts, users or Organizational Units (OU) within the AD. This is a possible scenario where no proper change management controls are in place, or where testing is not done properly. The restore can take some time, even if the backup tapes are immediately at hand, because the object relationship in AD is quite complex, and simply restoring the deleted objects will not work.

The real fun starts when you have a "safe" replication schedule due to various time zones and other reasons, such as office locations and line speeds. There are, and have been, scenarios where the deletion or modification of a critical service account, such as the Exchange service group, gets replicated in the course of 12 hours to all locations within the organization. The service that uses the account then stops working, and as it is probably a mission-critical service, gets noticed, fixed, and force-replicated to the closest DC. If things proceed smoothly, all locations will have their service restored, one after another, to the point where one of the last locations starts replicating forward in the chain to the first DC again, before it gets the restored information applied. Then, a vicious circle forms, as shown in the following diagram, giving way to some interesting possibilities. One possibility is that the service in different locations goes from working to non-working and back within a few hours, or returns to step one while the account remains deleted. This addresses the need for proper restoration of lost objects, and the proper process of forced replication.

Single DC Hardware Failure

This is another common scenario. You lose a DC due to a hardware or software failure. The reason for this can of course be failure of any of the hardware components caused by a faulty part, or an external event, such as water damage, a computer virus, or other reasons. At this stage, the DC is no longer operational and cannot be booted again.

If you have a small branch office with only one DC, this can be catastrophic and the need to bring the lost DC back online is critical because no-one at the location will be able to log in or use the directory service. Bringing a failed DC back is not very difficult, but there are steps that need to be taken to ensure that this does not affect the rest of your AD infrastructure. This incident might not be classified as extremely critical if you have two DCs at the site, but if some of these steps are not taken, and the DC has not been cleanly demoted, this can cause issues in the long term.

Some small offices also like to combine the file server, Exchange server, and DC onto one physical server so that more than just the authentication and the directory service is hosted on it. In the case of a file server, the recovery of the files is out of the scope of this book. However, if you run an Exchange server, and/or use the distributed file system service (DFS), or run services with domain accounts, such as Microsoft SQL, then the procedures outlined in this book will most definitely help you get your services back up and running.

Single DC AD Corruption

The single DC AD corruption is also quite common, especially in smaller companies where the DC has more than one role, such as also being a file, Exchange, and print server. AD corruption essentially means that the Directory service cannot be initiated because the directory database is corrupted, and that no user can log on to this DC with domain authentication or use any of the AD services, such as a global address book in Exchange. It is also possible (though not very common) that during a write process or replication process, one of the DCs fails or interrupts the data stream for some reason. It then replicates the changes with its nearest DC, which is usually its failover, located in the same server room. Both AD databases are then corrupted, and essentially all Directory services for that site fail.

Owing to the nature of AD, DNS, and the client authentication process (mentioned earlier in this chapter), the clients may still try to authenticate against the corrupted DCs but may not get a valid response and may therefore have to rely on the cached login information on the client server. The users will be allowed to log in, but will not be able to access any file shares or other services in the domain, if the information on the servers has not been cached, or the cache has expired (on Windows 2003's Universal Group caching is for 8 hours).

Site AD Corruption

If your AD gets corrupted on one DC in one site, the corrupted data is likely to replicate itself to other DCs within the same site very quickly. This leaves your entire site with a corrupted AD that makes it impossible for any users or services to use domain authentication. Basically, this is the same as the Single DC AD corruption, except that steps are outlined to recover an entire site, and not just a single DC.

Corporate (Complete) AD Corruption

This scenario is very dramatic but it can happen faster than you would have thought possible. A corruption can be anything from failed forest preps to schema modifications that were either incomplete or wrong. Another possibility is denial of service attacks, or exploits of vulnerabilities by a disgruntled employee (maybe an administrator within the organization), although this is quite remote. Consider a situation where one DC has a corrupt AD due to a human error, such as making changes to the AD schema at a remote location on a Saturday night, and the remote person does not recognize his or her mistake. The chances are high that this mistake this mistake is replicated out to the other DCs before anyone notices it.

Now, this becomes something of a race condition with the clients or systems continuously authenticating against the AD. The DCs will replicate the corrupt AD one by one, while the clients don't notice anything, because if one DC gives no answer, the client continues to query the next one in the list and so on until the last DC receives the replication of the bad database and goes offline. Then, the alarm bells go off and the systems come to a grinding halt. In addition, you have a very decentralized organization, a lot of time will be spent in coordinating the restoration efforts as well.

Of course, there are steps to initiate and recover from this as well, but response time is very important in this situation, and effective and correct processes and steps are also necessary.

Complete Site Hardware Failure

This scenario, describing an AD site and not necessarily a single physical site, is already quite drastic as it describes a total loss of AD service due to a complete hardware failure at a specific site. A site is a branch in your organization that is connected to your domain forest via a LAN or a WAN connection. This could also mean that a site includes two or more buildings, possibly distributed across an entire city. This scenario assumes that you have at least one other DC in your organization at another location that is unaffected. This scenario can be caused by anything that affects the whole server room, and is most likely to be physical. Fire and water, as well as storms or explosions, are very high on the probability list.

In this scenario, it is most likely that you have other servers that are also affected. This scenario will address the issue of how to get a complete site back up and running as quickly as possible. This is a critical scenario that needs to be fixed as soon as possible. You can, of course, re-route your users to another site for authentication if your WAN link gets backed up quickly, but if the links are not very fast, this can cause extreme slowness and precipitate incidents such as timeouts, and domain controller not found messages to the clients.

This is even worse if you have mission-critical systems authenticating against the AD as illustrated in the following diagram:

Corporate (Complete) Hardware Failure

If your corporation or organization has their entire AD infrastructure in one location (which is not recommended, but neither is it unheard of in small organizations), and a disaster, such as fire, water, or any other destructive incident happens, you need to rebuild everything. Backups are valuable but will not do the work for you. The most crucial task, at that point, is to get the working system back so that users can start their work. Damage control is not part of your job, but bringing back the company's domain infrastructure is. This means that your first priority is to get the DCs back online, and restore the applications that rely on it. Don't waste valuable time trying to get the print server to work when your clients and applications cannot authenticate. You also need to be aware that just re-installing the DCs from scratch will not work as you have hundreds or even thousands of systems bound to your AD infrastructure. Some services depend on this structure very heavily, and re-configuring all the clients and services is definitely not an option once your organization grows to critical size.

Your client machines at this point have no way of getting any information out of the AD, and the only reason why most of them are still operating is because of cached logins. You might even have a Group Policy preventing cached logins in which case you will have quite a few users who cannot get anything done, and a Management team that is calculating the loss of revenue per hour.

 

Summary


This chapter provided you with an overview of AD DR and how a DR guide fits into your organization's Business Continuity plan. It also provided you with a brief overview of the scenarios that will be addressed by this book. As you will have noticed by now, the subject of DR with an AD is much deeper and more crucial than you might think.

Microsoft's focus on pointing out the de-centralized architecture of AD is true, and it really does work for small incidents, for connection error, and so on. The picture is very different, however, when we look at how complex and devastating AD failure can be in some situations, and how crucial it is to have a proper recovery guide in place.

It is important to understand that the risks and threats discussed here exist in very real forms. In today's multi-platform environments and heterogeneous networks, there may be services and systems that authenticate against AD which probably didn't figure in your initial designs, but had to be added to the actual schema.

All these things make your infrastructure all the more mission-critical. If you want to use an analogy would you rather have insurance for your house that covers all your valuable items but not the house or one that covers the house as well?

About the Author

  • Florian Rommel

    Florian was born and raised in his native Germany until the age of 15 when he moved with this family to Central America and then the US. He has worked in the IT industry for more then 14 years and has gained a wealth of experience in many different IT environments. He also has a personal interest in Information Security.

    His certifications include CISSP, SANS GIAC, MCSE, MCSA, MCDBA, and several others. Together with his extensive experience, he is a qualified expert in the area of Information Security. After writing several Disaster Recovery guides for Active Directory environments he now brings you this unique publication, which he hopes will become a key title in the collection of many Windows Server Administrators and specialists.

    Florian lives with his wife and daughter in Finland where he also works as an IT Manager in a global company.

     

    Browse publications by this author

Latest Reviews

(1 reviews total)
Awesome Price for all the books
Book Title
Access this book, plus 8,000 other titles for FREE
Access now