Microsoft Exchange Server 2013 High Availability

By Nuno Mota
    Advance your knowledge in tech with a Packt subscription

  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Getting Started

About this book

Microsoft Exchange 2013 is the most widely used messaging platform in the world. Learning how to deploy it in a highly available manner is as fascinating and challenging as it is crucial for every organization.

This practical hands-on guide will provide you with a number of clear scenarios and examples that will explain the mechanics behind the working of Exchange Server High Availability 2013 and how maximum availability and resilience can be achieved through it. For most organizations around the world, e-mail is their top mission-critical service. Throughout nearly 20 years of Exchange development, Microsoft has been improving the Exchange platform, making it more user-friendly and reliable with each release. From Windows clusters, to Cluster continuous replication and database availability groups, the progress of Exchange in terms of availability and resilience is extraordinary.

Throughout this book, you will go through all the roles, components, and features that should be considered when addressing high availability. You will go through how to achieve high availability for the Client Access and Mailbox server roles, what’s new in load balancing, site resilience, the new public folders, and much more.

You will learn to successfully design, configure, and maintain a highly available Exchange 2013 environment by going through different examples and real-world scenarios, saving you and your company time and money, and eliminating errors.

Publication date:
February 2014
Publisher
Packt
Pages
266
ISBN
9781782171508

 

Chapter 1. Getting Started

For most organizations around the world, e-mail is their top mission-critical service. Throughout nearly 20 years of Exchange development, Microsoft has been improving the Exchange platform, making it more user-friendly and reliable with each release. From Windows clusters to Cluster Continuous Replication to Database Availability Groups and much more, the progress of Exchange in terms of availability and resilience is extraordinary.

In this chapter, we will look at the definitions of availability and resilience, as well as the new architecture of Exchange 2013.

 

Defining high availability and resilience


Before we delve into how we will make Exchange 2013 a highly available solution, it is important to understand the differences between a highly available solution and a resilient solution.

Availability

According to the Oxford English dictionary, available means able to be used or obtained; at someone's disposal. From an Exchange perspective, we can interpret availability as the proportion of time that Exchange is accessible to users during normal operations and during planned maintenance or unplanned outages. In simple terms, we are trying to provide service availability, that is, keep the messaging service running and available to users. Remember that uptime and availability are not synonymous; Exchange can be up and running but not available to users, as in the case of a network outage.

The availability of any IT system is often measured using percentages; more commonly, the number of nines in the digits ("class of nines"). The higher the percentage, the higher the availability of the system. As an example, when the business states that the organization's target is 99.9 percent Exchange availability, it is referred to as three nines, or class three. And 99.9 percent sounds excellent, right? Actually, it depends on the organization itself, and its requirements and goals. Looking at the following table, we can see that 99.9 percent of availability means that Exchange is actually down for almost 9 hours in a year, or 10 minutes every week on average. While this might seem acceptable, imagine if the period of downtime was to happen every week during peak utilization hours. The following table gives an overview of the approximate downtime hours for different values of availability, starting from 90 percent and higher:

Availability (%)

Downtime

per year (365d)

per month (30d)

per week

90 percent (1 nine)

36.50 days

3.00 days

16.80 hours

95 percent

18.25 days

36.00 hours

8.40 hours

99 percent (2 nines)

3.65 days

7.20 hours

1.68 hours

99.5 percent

1.82 days

3.60 hours

50.40 minutes

99.9 percent (3 nines)

8.76 hours

43.20 minutes

10.08 minutes

99.95 percent

4.38 hours

21.60 minutes

5.04 minutes

99.99 percent (4 nines)

52.56 minutes

4.32 minutes

1.01 minutes

99.999 percent (5 nines)

5.26 minutes

25.92 seconds

6.05 seconds

99.9999 percent (6 nines)

31.54 seconds

2.59 seconds

0.60 seconds

While a typical user would probably be content with an availability of 99.9 percent, users in a financial institution may expect, or even demand, better than 99.99 percent. High levels of availability do not happen naturally or by chance; they are the result of excellent planning, design, and maintenance.

The ideal environment for any Exchange administrator is obviously one that is capable of achieving the highest level of availability possible. However, the higher the level of availability one tries to achieve, the higher the cost and complexity of the requirements that guarantee those extra few minutes or hours of availability.

Furthermore, how does one measure the availability of an Exchange environment? Is it by counting the minutes for which users were unable to access their mailboxes? What if only a subset of the user population was affected? Unfortunately, how availability is measured changes from organization to organization, and sometimes even from administrator to administrator depending on its interpretation. An Exchange environment that has been up for an entire year might have been unavailable to users due to a network failure that lasted for 8 hours. Users, and possibly the business, will see Exchange unavailable while its administrators will still claim 100 percent of availability. If we take the true definition of availability, Exchange was only approximately 99.9 percent available. But is this fair for the Exchange administrator? After all, the reason why Exchange was not available was not because of an issue with Exchange itself, but with the network.

The use of "nines" has been questioned a few times since it does not appropriately reflect the impact of unavailability according to the time of occurrence. If in an entire year, Exchange was only unavailable for 50 minutes during Christmas day at 3 A.M when no one tried to access it, should its availability be quantified as 99.99 percent or 100 percent?

The definition of availability must be properly established and agreed upon. It also needs to be accurately measured, ideally with powerful monitoring tools, such as Microsoft System Center Operations Manager, which are themselves highly available. Only when everyone agrees on a shared interpretation and define how to accurately measure availability, will it actually be useful for the business.

The level of availability that the business expects from Exchange will not be simply expressed as, for example, 99.9 percent. It will be part of a Service Level Agreement (SLA), which is one of the few ways of ensuring that Exchange meets the business objectives. SLAs differ for every organization, and there is not an established process on how to define one for Exchange. Typically, Exchange SLAs contain five categories:

  • Performance: An SLA of this category pertains to the delivery and speed of e-mails. An example would be 90% of all e-mails are to be delivered within 10 minutes. If desired, the SLA might also define the remaining 10 percent.

  • Availability: An SLA of this category establishes the level of availability of Exchange to the end users using the "class of nines" that we discussed previously.

  • Disaster Recovery: An SLA of this category defines how long it should take to recover data or restore a service when a disaster occurs. These SLAs typically focus on the service recovery time as well as on more specific targets such as a single server or a mailbox. To help establish these SLAs, two other elements of business continuity are used:

    • Recovery Time Objective (RTO): This element establishes the duration of time in which Exchange must be restored after a disaster. For example, Exchange must be made available within 4 hours in the secondary datacenter if a major incident happens in the primary datacenter.

    • Recovery Point Objective (RPO): This element establishes the maximum tolerable period in which data might be lost from Exchange due to a major incident. For example, in case of a major incident, no more than 1 hour of data can be lost. In environments where a secondary datacenter is used for disaster recovery, the RPO can be defined as the amount of time taken for the data to be replicated to the secondary datacenter. If a disaster occurs during this time, any data written during that time frame could be lost if the primary datacenter is unrecoverable.

  • Security: An SLA of this category generally includes assurances regarding malware-detection rate, encryption performance, data at rest and in transit, e-mail-scanning time, and physical security of servers and the datacenter(s) where these are located.

  • Management: An SLA of this category helps ensure that the messaging solution meets both user and maintenance requirements.

Translating an SLA document by putting it into practice requires administrators to be suitably skilled and to have the necessary infrastructure and tools to achieve the SLA. After SLAs have been planned, developed, and deployed, they must be periodically reviewed to ensure they are being met and are achieving the desired results. It is extremely important to ensure SLAs remain cost-effective and realistic.

Resilience

According to the Oxford English dictionary, the adjective resilient means able to withstand or recover quickly from difficult conditions.

Resilience, or resiliency as it is sometimes used, is the ability to provide a satisfactory level of service when faced with faults and challenges during normal operation. More specifically, it is the ability of a server, network, or an entire datacenter to recover quickly and continue operating normally during a disruption.

Resilience is usually achieved by installing additional equipment (redundancy) together with careful design to eliminate single points of failure (deploying multiple Hub Transport servers, for example) and well-planned maintenance. Although adding redundant equipment might be straightforward, it can be expensive and, as such, should be done only after considering its costs versus its benefits.

A typical example is one in which when a server's power supply fails, the server also fails, and its services become unavailable until the services are restored to another suitable server or the server itself is repaired. However, if this same server had a redundant power supply, it would keep the server running while the failed power supply was being replaced.

A resilient network infrastructure, for example, is expected to continue operating at or above the minimum service levels, even during localized failures, disruptions, or attacks. Continuing operation, in this example, refers to the service provided by the communications infrastructure. If the routing infrastructure is capable of maintaining its core purpose of routing packets despite local failures or attacks, it is said to be robust or resilient.

The same concept holds true from the server level all the way up to the datacenter facilities. Datacenter resilience is typically guaranteed by using redundant components and/or facilities. When an element experiences a disruption (or fails),its redundant counterpart seamlessly takes over and continues to provide services to the users. For example, datacenters are usually powered by two independent utility feeds from different providers. This way, a backup provider is available in case the other fails. If one is to design a resilient Exchange environment that extends across multiple datacenters, no detail should be overlooked.

Let us briefly throw another term into the mix, reliability, which signifies the probability of a component or system to perform for an anticipated period of time without failing. Reliability in itself does not account for any repairs that may take place, but for the time it takes the component (or system) to fail while it is in operation.

Reliability is also an important notion because maintaining a high level of availability with unreliable equipment is unrealistic, as it would require too much effort and a large stock of spares and redundant equipment. A resilient design takes into consideration the reliability of equipment in a redundant topology.

Taking storage as an example, the advertised reliability of the disks used in a storage array might influence the decision between using a RAID 1, RAID 5, RAID 6, or even a RAID 10.

Note

Designing and implementing a highly available and resilient Exchange 2013 environment is the sole purpose of this book. Although the main focus will be on the Exchange application layer, technologies such as Active Directory, DNS, and virtualization are also covered in some detail in the final chapter.

 

Introducing the new Exchange architecture


Before we delve into how to design and deploy a highly available Exchange 2013 infrastructure, we need to look at the architecture changes and improvements made over the previous editions of Exchange.

Note

This is not an extensive list of all the improvements introduced in Exchange 2013. Only the main changes are mentioned as well as those relevant for high availability.

Looking at the past

In the past, Exchange has often been architected and optimized with consideration to a few technological constraints. An example is the key constraint that led to the creation of different server roles in Exchange 2007: CPU performance. A downside of this approach is that server roles in Exchange 2007/2010 are tightly coupled together. They introduce version dependency, geo-affinity (which requires several roles to be present in a specific Active Directory site), session affinity (often requiring a complex and expensive load-balancing solution), and namespace complexity.

Today, memory and CPU are no longer the constraining factors as they are far less expensive. As such, the primary design goals for Exchange 2013 became failure isolation, improved hardware utilization, and simplicity to scale.

Let us start by having a quick look at how things have evolved since Exchange 2000.

Exchange 2000/2003

In Exchange 2000 and 2003, server tasks are distributed among frontend and backend servers, with frontend servers accepting client requests and proxying them for processing by the appropriate backend server. This includes proxying RPC-over-HTTP (known as Outlook Anywhere), HTTP/S (Outlook Web App (OWA)), POP, and IMAP clients. However, internal Outlook clients do not use the frontend servers as they connect directly to the backend servers using MAPI-over-RPC.

An advantage of this architecture over Exchange 5.5 is that it allows the use of a single namespace such as mail.domain.com. This way, users do not need to know the name of the server hosting their mailbox. Another advantage is the offloading of SSL encryption/decryption to the frontend servers, freeing up the backend servers from this processor-intensive task.

While the frontend server is a specially configured Exchange server, there is no special configuration option to designate a server as a backend server.

High availability was achieved on the frontend servers by using Network Load Balancing (NLB) and by configuring backend servers in an active/active or active/passive cluster.

Exchange 2007

While in Exchange 2003 the setup process installs all features regardless of which ones the administrators intend to use, in Exchange 2007, Microsoft dramatically changed the server roles architecture by splitting the Exchange functionality into five different server roles:

  • Mailbox server (MBX): This role is responsible for hosting mailbox and public folder data. This role also provides MAPI access for Outlook clients.

  • Client Access Server (CAS): This role is responsible for optimizing the performance of mailbox servers by hosting client protocols such as POP, IMAP, ActiveSync, HTTP/S, and Outlook Anywhere. It also provides the following services: Availability, Autodiscover, and web services. Compared to Exchange 2000 or 2003 frontend servers, this role is not just a proxy server. For example, it processes ActiveSync policies and does OWA segmentation, and it also renders the OWA User Interface. All client connections with the exception of Outlook (MAPI) use the CAS as the connection endpoint, offloading a significant amount of processing that occurs against backend servers in Exchange 2000 or 2003.

  • Unified Messaging Server: This role is responsible for connecting a Private Branch eXchange (PBX) telephony system to Exchange.

  • Hub Transport Server (HTS): This role is responsible for routing e-mails within the Exchange organization.

  • Edge Transport Server: This role is typically placed in the perimeter of an organization's network topology (DMZ) and is responsible for routing e-mails in and out of the Exchange organization.

Each of these server roles logically group specific features and functions, allowing administrators to choose which ones to install on an Exchange server so that they can configure a server the way they intend to use it. This offers other advantages such as reduced attack surface, simpler installation, full customization of servers to support the business needs, and the ability to designate hardware according to the role since each role has different hardware requirements.

This separation of roles also means that high availability and resilience is achieved using different methods depending on the role: by load-balancing CASs (either using NLB or hardware/software load-balancing solutions), by deploying multiple Unified Messaging and Hub Transport Servers per Active Directory site, and at the Mailbox server level by using Local Continuous Replication, Standby Continuous Replication, or through the cluster technologies of Cluster Continuous Replication or Single Copy Cluster.

Exchange 2010

In terms of server roles, Exchange 2010 has the same architecture, but under the hood, it takes a step further by moving Outlook connections to the CAS role as well. This means there will be no more direct connections to Mailbox servers. This way, all data access occurs over a common and single path, bringing several advantages, such as the following:

  • Improved consistency

  • Better user experience during switch and fail-over scenarios as Outlook clients are connected to a CAS and not to the mailbox server hosting their mailbox

  • Support for more mailboxes per mailbox server

  • Support for more concurrent connections

The downside is that this change greatly increases the complexity involved in load-balancing CASs as these devices now need to load-balance RPC traffic as well.

To enable a quicker reconnect time to a different CAS when the server that a client is connected to fails, Microsoft introduced the Client Access array feature, which is typically an array of all CASs in the Active Directory (AD) site where the array is created. Instead of users connecting to the Fully Qualified Domain Name (FQDN) of a particular CAS, Outlook clients connect to the FQDN of the CAS array itself, which typically has a generic name such as outlook.domain.com.

Exchange 2010 includes many other changes to its core architecture. New features including shadow redundancy and transport dumpster provide increased availability and resilience, but the biggest change of all is the introduction of Database Availability Group (DAG)—the base component of high availability and site resilience of Exchange 2010. A DAG is a group of (up to 16) Mailbox servers that hosts databases and provides automatic database-level recovery by replicating database data between the members of the DAG. As each server in a DAG can host a copy of any database from any other server in the same DAG, each mailbox database is now a unique global object in the Exchange organization, while in Exchange 2007, for example, databases were only unique to the server hosting them. DAGs provide automatic recovery from failures that can go from a single database to an entire server.

The following diagram provides an overview of the evolution from Exchange 2003 to Exchange 2010:

Exchange 2013

While looking back at ways of improving Exchange, Microsoft decided to address three main drawbacks of Exchange 2010:

  • Functionality scattered across all different server roles, forcing HTS and CASs to be deployed in every Active Directory site where Mailbox servers are deployed.

  • Versioning between different roles, meaning a lower version of HTS or CAS should not communicate with a higher version of the Mailbox server. Versioning restrictions also indicate that administrators cannot simply upgrade a single server role (such as a Mailbox server) without upgrading all first.

  • Geographical affinity, where a set of users served by a given Mailbox server is always served by a given set of CAS and HTS servers.

Enter Exchange 2013, which introduces major changes once more, and we seem to be back in the Exchange 2000/2003 era, only in a far improved way as one would expect. In order to address the issues just mentioned, Microsoft introduced impressive architectural changes and investments across the entire product. Hardware expansion was seen as a limiting boundary from a memory and disk perspective. At the same time, CPU power keeps increasing, which a separate role architecture does not take advantage of, thus introducing a potential for the consolidation of server roles once more.

The array of Client Access servers and Database Availability Group are now the only two basic building blocks, each providing high availability and fault tolerance, but now decoupled from one another. When compared with a typical Exchange 2010 design, the differences are clear:

To accomplish this, the number of server roles has been reduced to three, providing an increased simplicity to scale, isolation of failures, and improved hardware utilization:

  • The Mailbox server role hosts mailbox databases and handles all activity for a given mailbox. In Exchange 2013, it also includes the Transport service (virtually identical to the previous Hub Transport Server role), Client Access protocols, and the Unified Messaging role.

  • The Client Access Server role includes the new Front End Transport service and provides authentication, proxy, and redirection services without performing any data rendering. This role is now a thin and stateless server which never queues or stores any data. It provides the usual client-access protocols: HTTP (Outlook Anywhere, Web Services, and ActiveSync), IMAP, POP, and SMTP. MAPI is no longer provided as all MAPI connections are now encapsulated using RPC-over-HTTPS.

  • The Edge server role, added only in Service Pack 1, brings no additional features when compared to a 2010 Edge server. However, it is now only configurable via PowerShell in order to minimize its attack surface, as adding a user interface would require Internet Information Services, virtual directories, opening more ports, and so on.

In this new architecture, the CAS and the Mailbox servers are not as dependent on one another (role affinity) as in the previous versions of Exchange. This is because all mailbox-processing occurs only on the Mailbox server hosting the mailbox. As data rendering is performed local to the active database copy, administrators no longer need to be concerned about incompatibility of different versions between CAS and Mailbox servers (versioning). This also means that a CAS can be upgraded independently to Mailbox servers and in any order.

Encapsulating MAPI connections using RPC-over-HTTPS and changing the CAS role to perform pure connection-proxying means that advanced layer 7 load-balancing is no longer required. As connections are now stateless, CASs only accept connections and forward them to the appropriate mailbox server. Therefore, session affinity is not required at the load-balancer; only layer 4 TCP with source IP load-balancing is required. This means that if a CAS fails, the user session can simply be transferred to another CAS because there is no session affinity to maintain.

Note

All these improvements bring another great advantage: users no longer need to be serviced by CASs located within the same Active Directory site as that of the Mailbox servers hosting their mailboxes.

Other changes introduced in Exchange 2013 include the relegation of RPC. Now, all Outlook connections are established using RPC-over-HTTP (Outlook Anywhere). With this change, CASs no longer need to have the RPC client-access service, resulting in a reduction from two different namespaces normally required for a site-resilient solution. As we will see in Chapter 4, Achieving Site Resilience, a site-resilient Exchange infrastructure is deployed across two or more datacenters in a way that it is able to withstand the failure of a datacenter and still continue to provide messaging services to the users.

Outlook clients no longer connect to the FQDN of a CAS or CAS Array. Outlook uses Autodiscover to create a new connection point involving the mailbox GUID, in addition to the @ symbol and the domain portion of the user's primary SMTP address (for example [email protected]). Because the GUID does not change no matter where the mailbox is replicated, restored, or switched over to, there are no client notifications or changes. The GUID abstracts the backend database name and location from the client, resulting in a near elimination of the message Your administrator has made a change to your mailbox. Please restart Outlook.

The following diagram shows the client protocol architecture of Exchange 2013:

As shown in the preceding diagram, Exchange Unified Messaging (UM) works slightly differently from the other protocols. First, a Session Initiation Protocol (SIP) request is sent to the UM call router residing on a CAS, which answers the request and sends an SIP redirection to the caller, who then connects to the mailbox via SIP and Real-time Transport Protocol (RTP) directly. This is due to the fact that the real-time traffic in the RTP media stream is not suitable for proxying.

There are also other great improvements made to Exchange 2013 at the Mailbox server level. The following are some of them:

  • Improved archiving and compliance capabilities

  • Much better user experience across multiple devices provided by OWA

  • New modern public folders that take advantage of the DAG replication model

  • 50 percent to 70 percent reduction in IOPS compared to Exchange 2010, and around 99 percent compared to Exchange 2003

  • A new search infrastructure called Search Foundation based on the FAST search engine

  • The Managed Availability feature and changes made to the Transport pipeline

In terms of high availability, only minor changes have been made to the mailbox component from Exchange 2010 as DAGs are still the technology used. Nonetheless, there have been some big improvements:

  • The Exchange Store has been fully rewritten

  • There is a separate process for each database that is running, which allows for isolation of storage issues down to a single database

  • To do code enhancements around transaction logs and a deeper checkpoint on passive databases, failover times have been reduced

Another great improvement is in terms of site resilience. Exchange 2010 requires multiple namespaces in order for an Exchange environment to be resilient across different sites (such as two datacenters): Internet Protocol, OWA fallback, Autodiscover, RPC Client Access, SMTP, and a legacy namespace while upgrading from Exchange 2003 or Exchange 2007. It does allow the configuration of a single namespace, but it requires a Global Load Balancer and additional configuration at the Exchange level. With Exchange 2013, the minimum number of namespaces was reduced to just two: one for client protocols and one for Autodiscover. While coexisting with Exchange 2007, the legacy hostname is still required, but while coexisting with Exchange 2010, it is not.

This is explored in more depth in Chapter 4, Achieving Site Resilience.

 

Summary


In this first chapter, we had a look at the differences between availability and resilience, as well as an overview of the new architecture of Exchange 2013 and all the improvements it brings over the past editions of Exchange.

Throughout the following chapters, we will explore in depth about all these new availability and resilience features. We will see how to take full advantage of them through the design and configuration of a highly available Exchange 2013 infrastructure, starting with the CAS role in the next chapter.

About the Author

  • Nuno Mota

    Nuno Mota is a Senior Microsoft Messaging Consultant currently working for a large sovereign wealth fund. He has been responsible for designing and deploying Exchange and Office 365 solutions for organizations across the UK. He also shares a passion for Skype for Business, Active Directory, and PowerShell.

    Besides writing his personal Exchange blog, called LetsExchange, he is also an author on the MSExchange website with dozens of published articles and product reviews as well as multiple scripts on TechNet.

    He has also been awarded the Microsoft Most Valuable Professional (MVP) on Exchange five times since 2012.

    Browse publications by this author
Book Title
Unlock this book and the full library for only $5/m
Access now