PostgreSQL 12 High Availability Cookbook - Third Edition

4.7 (3 reviews total)
By Shaun Thomas
    What do you get with a Packt Subscription?

  • Instant access to this title and 7,500+ eBooks & Videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Free Chapter
    Architectural Considerations
About this book

Databases are nothing without the data they store. In the event of an outage or technical catastrophe, immediate recovery is essential. This updated edition ensures that you will learn the important concepts related to node architecture design, as well as techniques such as using repmgr for failover automation. From cluster layout and hardware selection to software stacks and horizontal scalability, this PostgreSQL cookbook will help you build a PostgreSQL cluster that will survive crashes, resist data corruption, and grow smoothly with customer demand.

You’ll start by understanding how to plan a PostgreSQL database architecture that is resistant to outages and scalable, as it is the scaffolding on which everything rests. With the bedrock established, you'll cover the topics that PostgreSQL database administrators need to know to manage a highly available cluster. This includes configuration, troubleshooting, monitoring and alerting, backups through proxies, failover automation, and other considerations that are essential for a healthy PostgreSQL cluster. Later, you’ll learn to use multi-master replication to maximize server availability. Later chapters will guide you through managing major version upgrades without downtime.

By the end of this book, you’ll have learned how to build an efficient and adaptive PostgreSQL 12 database cluster.

Publication date:
February 2020


Architectural Considerations

In many ways, database server architecture is treated as a mere afterthought. It's often much easier to simply create a single node, install some software, and consider the whole affair resolved. If a company is particularly paranoid, they may even spare some thought for a replica server, or perhaps some kind of backup.

The true importance of database cluster architecture is easily overlooked as a result. But what is server architecture? Why does it matter?

Look down the street. Any street is fine. What do you see? Homes, offices, and buildings of various descriptions. With very rare exceptions, each one of these was meticulously planned, from the foundation to the walls to the electrical wires, pipes, up to the roof and drainage systems. A failure in any of these components could lead to the ultimate demise of the entire structure, given enough time.

The same also applies to a PostgreSQL cluster! Database architecture defines what goes into a database server cluster, and the reason for each element. How does it communicate? How many nodes are required? Where do we put those nodes, and why? What common problems are inherent in those decisions? How will our decisions influence the underlying cost? What trade-offs can we make, given some important constraints? How does all of this affect data availability? We need those answers before we even consider hardware or virtualization. There are many important considerations we must entertain when designing a highly available PostgreSQL cluster.

Why then is it so common for critical application and user data that drives the entire application stack behind the company itself to be treated so callously? We direct so much attention and focus on the application, with its various layers of indirection, queues, caches, container automation, and microarchitecture, that the data layer is overlooked or considered a nuisance.

This is actually highly understandable. In most cases, a PostgreSQL database layer demands an entirely different approach that development, system administration, and other information technology fields may not be entirely familiar with managing. Even experienced database administrators may not comprehend the scale and necessary theoretical concepts that drive the high availability of databases.

While we can't reduce the subtle art of database server architecture to a few memorable quips sure to entertain at parties, we can make the subject far more approachable. It shouldn't be necessary to have a Ph.D. in abstract theoretical frameworks to prevent a costly database outage.

In this chapter, we will learn how the layout of the nodes in our PostgreSQL cluster can drastically influence its availability. We will cover the following recipes:

  • Setting expectations with RPO
  • Defining timetables through RTO
  • Picking redundant copies
  • Selecting locations
  • Having enough backups
  • Considering quorum
  • Introducing indirection
  • Preventing split brain
  • Incorporating multi-master
  • Leveraging multi-master

Setting expectations with RPO

RPO is a common term in business continuity known as Recovery Point Objective. In the context of a database system, it describes the amount of data that may be lost following an unexpected outage before it is once again operational. It's important to understand this at an early stage because it will drive decisions such as node count, data synchronization methods, and backup technologies.

In this recipe, we will examine the ingredients for concocting a comprehensive RPO that will influence the PostgreSQL cluster composition itself.

Getting ready

The first thing we need to do is set expectations. These are most often defined by upper management or some other decision-making entity. Data loss is never desirable but is unavoidable in catastrophic scenarios. How much data loss can the business tolerate under these circumstances? Seconds, minutes, or hours' worth?

This recipe will mainly focus on information gathering from key individuals, so make sure it's possible to at least email anyone involved with the application stack. Hardware purchases depend on budget proposals, so it may even be necessary to interact with VP and C-level executives as well. Even if we don't do this right away, try to determine the extent of influence available to you.

How to do it...

Since we're dealing with many vectors, we should iterate them if possible. Try to follow a process like this:

  1. Seek the input of major decision makers:
  • VP and C-level executives involved with technology
  • Product manager
  • Application designers and architects
  • Infrastructure team lead
  1. Find an amount of time that will satisfy most or all of the above.
  2. Follow the rest of the advice in this chapter to find a suitable architecture.
  3. Try to determine a rough cost for this and the closest alternative.
  4. Present one or more designs and cost estimates to decision makers.
  5. Document the final RPO decision and architecture as reference material.

How it works...

Decision makers such as the technology VP, CEO, CTO, and such are the final word in most cases. Their input is vital and should be considered a requirement before ever taking a step further. Keep in mind that these people are likely not familiar with the technical feasibility of their demands at this extreme implementation level. When asked a question such as How much data can we lose in a major outage? they're probably going to say None! Regardless, this is a vital first step for reasons that will shortly become apparent.

Then, we simply traverse the stack of people who helped define the features the application stack fulfills, those who designed and implemented it, and whoever may be in charge of the requisite hardware and network where everything runs. Perhaps the design has a built-in tolerance for certain amounts of loss. Perhaps inherent queues or caches act as a sort of buffer for data backend difficulties. Maybe the design assumes there are multiple data systems all ingesting the same stream for redundancy. The architecture and those who built it are the best sources of this information.

Once we know the maximum amount of data the backend can lose before being restored, we must apply what we learn from the rest of this chapter and choose one or two best-case designs that can deliver that promise. The point here is that we will be executing this recipe several times until everyone agrees to all inherent design costs and limitations before continuing.

The best way to estimate cost is to take the chosen database server architectures and iterate a gross cost for each element. The next chapter on Hardware Planning describes in detail how to do this. We don't have to be exact here; the goal is to have some numbers we can present to decision makers. Do they still want zero RPO if it costs 10x as much as ten seconds of data loss? Are they willing to compromise on a hybrid design?

Once we have chosen a final structure, possibly the most important step is to produce a document describing that architecture, why it was chosen, the known limitations, and the RPO it delivers. Present this document to decision makers and encourage them to sign it if possible. Save it in any corporate documentation management system available, and make sure it's one of the first things people see regarding the database cluster layer. This document will single-handedly answer multiple questions about the capabilities of the database cluster, all while acting as a reference specification.

There's more...

RPO is considered a vital part of business continuity planning. Entire books have been written on this subject, and what we've presented here is essentially a functional summary. The subject is deep and varied, rich with its own inherent techniques beyond simply architecture and design. It is the language of business and resource management, so it can be a key component when interacting with decision makers.

Learning these concepts in depth can help influence the overall application stack to a more sustainable long-term structure. We'll cover more of these techniques in this chapter, but don't be afraid to proactively incorporate these techniques into your repertoire.


Defining timetables through RTO

Like RPO, RTO refers to a common business continuity term known as Recovery Time Objective. In practice, this is the amount of time an outage of the database layer may last. Often, it is incorporated into a Service Level Agreement (SLA) contract presented to clients or assumed as a metric within the application stack. Like RPO, this is a contractual-level element that can determine the number of required nodes at steadily increasing expense as the amount of tolerable downtime decreases.

In this recipe, we will examine the necessary steps to defining a realistic RTO, and what that could mean given known industry standards.

Getting ready

As with RPO, our goal in determining a functional RTO is to set expectations regarding inherent architecture limitations. The primary difference here is that RTO is more easily quantifiable. Fire up your favorite spreadsheet program, such as OpenOffice, Microsoft Excel, or Google Sheets; we'll be using it to keep track of how much time each layer of the application, including the database layer contributes to a potential outage scenario.

How to do it...

We simply need to produce a spreadsheet to track all of the elements of known RTO that depend on the database. We can do this with the following steps:

  1. Locate an already-defined RTO SLA for each portion of the application dependent on PostgreSQL if possible.
  2. If this does not exist, seek the input of major decision makers:
  • VP and C-level executives involved with technology
  • Product manager
  • Application designers and architects
  • Infrastructure team lead
  1. Find an amount of time that will satisfy most or all of the above.
  2. Create a new spreadsheet for RTO.
  3. Create a heading row with the following columns:
  • Activity
  • Time (seconds)
  • Count
  • Total (seconds)
  1. In the Total column, create the following formula:
  1. Create one row for each type of the following Activity categories:
  • Minor Upgrade
  • Major Upgrade
  • Reboot
  • Switchover
  • Failover
  • OS Upgrade
  • Etc.
  1. Copy and paste the formula into the Total column for all the rows we created.
  1. At the bottom of the Total column, after all relevant rows (row 21, for example), create the following formula:
  1. Ensure that the end result looks something like the following screenshot:
  2. Follow the rest of the advice in this chapter to find a suitable architecture.
  3. Try to determine a rough cost for this and the closest alternative(s).
  4. Present the design and cost estimates to decision makers.
  5. Document this final RTO decision and architecture as reference material.

How it works...

In order to see where our PostgreSQL cluster fits company expectations, we need to know whether the company and each individual part of the existing application stack has an overall target RTO. If it doesn't, it's our job to approximate one. This means contacting any decision-makers, product owners, architects, and so on, to know what RTO target we're trying to attain and how other resources may contribute. These will act as a type of maximum value we can't exceed.

Keep in mind that RTO values tend to be amplified between layers. If our RTO is higher than some portion of the application stack, that will necessarily raise the RTO of that layer as well, which may increase the RTO of each subsequent layer. This is the exact scenario we're trying to avoid.

Once we have an RTO expectation, we need to examine how possible it is to fall under that target. The easiest way to accomplish this is to build a spreadsheet that essentially consists of a list of dependencies, maintenance tasks, or other occurrences related to PostgreSQL.

The rows we used for Activity are mainly suggestions, and producing an exhaustive list is generally dependent on the architecture to a certain extent. However, all software requires upgrades, machines need to be rebooted, switchover tests to prove high availability functionality may be required, past experience with the full application stack and hardware may imply two unexpected outages per year, and so on. Each of these will contribute to the cumulative RTO for PostgreSQL which we can use as a reference value.

The number we use for the Count column should be the number of times the Activity happens on a yearly basis. As an example, PostgreSQL has a quarterly release schedule for non-critical bug and security enhancements. If you want to follow along with these, it could make sense to set the Count column of Minor Upgrade to 4.

A number of architectural examples that we'll discuss later in this chapter will make it possible to set the Time column to 0 for some actions, or at least to a much lower value. We'll discuss these where relevant. This is also one of the reasons we'll need to execute this recipe multiple times when deciding on an appropriate architecture.

Once we have accounted for as many foreseeable Action components that may be necessary over the course of a year, we'll have a cumulative total that may represent the RTO that PostgreSQL can achieve for a given architecture. As a sanity check, we should compare that value to the lowest RTO for any parts of the application stack that depend on PostgreSQL. It's important we don't exceed this target.

Then, as with RPO, we need to present the possible RTO to decision-makers so that it can be integrated into the overall company RTO. To do that, we must continue with the rest of the chapter to find one or two architectures with either higher or lower expected RTO, estimate the cost of each, and work on a suitable compromise.

Deriving an appropriate RTO may require multiple iterations of this recipe, from estimation, architecture selection, presenting it to appropriate parties, and so on. This isn't a fast or simple process, and it pays to get it right early. We need to know how many PostgreSQL nodes to purchase, where each will reside, how we switch to alternatives, how much time each step may take, and so on.

There's more...

Besides what we discussed in the main recipe, there are other RTO concepts we would like to explore.

This may seem familiar

Believe it or not, it's very likely you've encountered this concept without even realizing it. Internet service providers or application hosts often advertise how many 9s of availability their platform can maintain. It's often presented as a chart like this:

Uptime (%)






14m 24s

1h 40m 48s

7h 18m 18s

3d 15h 39m 30s


1m 26s

10m 5s

43m 50s

8h 45m 57s



1m 1s

4m 23s

52m 36s





5m 16s

As you can imagine, it's generally more desirable to stay toward the higher end of 9s to minimize downtime. On the other hand, this is highly restrictive, as Five 9s only allows just over five minutes of downtime over the course of an entire year. This doesn't leave much room for database maintenance tasks or unexpected outages at any other layer of the stack.

Node counts

Generally, the more nodes we have, the lower our RTO will be. It may make sense to start with an initial estimate spreadsheet, and then create another for each architecture or variant that seems applicable. This will make it easier to rank the monetary cost and associated RTO for each. This may influence the final decision, and hence make it easier to track what options we may have.


Picking redundant copies

How many database servers should any architecture have as part of the inherent design? There are several factors that contribute to this answer, including the design of the final architecture itself. The number of redundant data copies ultimately determines how many nodes must exist, irrespective of whether we require more data centers, irrespective of whether we should account for latency, and so on.

The goal in this recipe is to consult our needs to derive a node count that won't break the bank, but still deliver the level of availability we want. In other words, aside from our primary data node, we will explain how to figure out the number of redundant data nodes necessary to adequately represent the entire cluster safely and ensure high availability.

Getting ready

Luckily, this recipe is a simple counting exercise. The only necessary elements are a healthy imagination and perhaps some idea of the budgetary constraints before we begin. Just consider that, for any of our reference designs, we will always require more than one server node as a minimum.

How to do it...

Observe the following steps when considering node counts driven by high availability architectures:

  1. Always add one separate server for backups.
  2. Always allocate one server for a logical or physical replica.
  3. For automated failover, allocate the following:
  • An additional small VM / node to act as a voter
  • OR a fully qualified replica
  1. For every active data center beyond the first two, allocate one replica.
  2. If non-local data access latency is a concern, allocate the following:
  • An additional replica in the primary location
  • An additional replica in each location for symmetric clusters

How it works...

Why do we demand at least one backup server? The full answer to this question actually has its own recipe in this chapter. However, catastrophic failure is a fact of life and we must be ready for such an event. Even if the separate server is not a fully operational PostgreSQL node, it must exist and should be part of the reference design.

Likewise, we must have at least one PostgreSQL replica. Some of our designs work with either physical or logical replicas, so we won't differentiate between them here. Simply assume that every highly active PostgreSQL cluster must have at least two nodes that can fulfill the role of a primary database. Backups take time to restore, whereas replicas are generally writable in a minute or less.

One replica only really covers the case where switching from one PostgreSQL node to the alternate is a manual procedure. Fully automated failure detection mechanisms require an odd number of nodes for voting purposes. This third node can either be a mere voting entity, or a full PostgreSQL replica. We cover this in greater depth in the Considering quorum recipe.

Once we start accounting for multiple geographic locations at different data centers, things don't change exceptionally. By now, we have at least one PostgreSQL replica that is probably at the first alternate location. If we have three or more active data centers where the application is using PostgreSQL, we'll want a local replica for each.

Then, consider the implications of limiting ourselves to merely one PostgreSQL node per location. This means any minor upgrade or other maintenance task will mean switching to an alternate data center while the maintenance is active. This can introduce unwanted latency that will affect the application. To reduce this, add one replica to the primary location to account for this effect. For symmetrical data centers that have no primary location, add a PostgreSQL replica to each location for the same reasons.

As a quick example, consider two scenarios. Our first company (Company A) only uses two data centers, and doesn't need automated database failover, nor is it necessary to worry about downtime caused by minor upgrades. In this case, they decided to use two PostgreSQL servers and a backup system. This is a minimum of three nodes related to PostgreSQL, and their cluster looks like this:

Company A

In the second case (Company B), we have a much more demanding financial institution that requires all three of their data centers to be active at all times. They chose to have one Primary PostgreSQL server, two Replicas per data center, a Witness node, and a Backup server. In that extreme case, they used a total of eight nodes dedicated to PostgreSQL.

Their cluster would look like this:

Company B
Nodes denoted here are specifically meant to be running PostgreSQL. Architectures discussed in this chapter will include various abstraction layers and, as a result, will likely require even more nodes in the cluster. PostgreSQL nodes tend to be much larger and do far more work, so are the focus of this recipe.

There's more...

Chosen node counts will naturally suggest certain architectures over others. The reason we're performing this recipe so early is to get an idea of compatible architectures. It may be necessary to revise these counts once we learn information from other recipes in this chapter.


Selecting locations

Once we've decided how many PostgreSQL nodes to allocate in our cluster, where should we put them? Generally, this is easy to answer, but there are some subtleties we need to consider as well. A truly high availability cluster can resist many different types of failure, including where the servers themselves reside.

In this recipe, we will learn all about the ways separate geographical locations can affect our chosen design.

Getting ready

It's time to start drawing diagrams. Find your favorite drawing program, such as Visio, Dia, or, or a convenient sheet of paper or whiteboard.

Keep in mind that the ideal design may require more data centers than the company currently utilizes. In these cases, it may be possible to supply sufficient justification to contract at least one more location if it benefits the RPO or RTO. Hence, we recommend following the Setting expectations with RPO recipe and Defining timetables through RTO recipe before continuing here.

How to do it...

Consider these basic guidelines while thinking about how many data centers are necessary, and which nodes should be placed in each:

  1. If data must be available in case of a site outage, use one additional location.
  2. Always place the backup in a separate location if possible.
  3. If two locations are in the same general geographical area, use one additional location at least 100 miles (160 km) away.
  4. If automated failover is desirable, consider at least three data centers.
  5. Place one PostgreSQL server (or witness) in each location.
  6. Continue placing PostgreSQL servers evenly until the count is exhausted.
  7. Try to place witness servers in a location that is unlikely to lose contact with more than one location simultaneously.

How it works...

Let's consider an extreme example to explain how this works: a financial institution wants to place six PostgreSQL nodes, one witness server, and a backup system. This would clearly be a silly design, as shown in the following diagram:

This places every node in a single location that could lose its connection to the internet, lose power, or suffer some other catastrophe that would render the entire database stack unusable or even destroyed.

Now, let's apply the guidelines. First of all, we want to protect the backup; let's place that elsewhere, as seen in this following diagram:

Now, one PostgreSQL server and the backup are safe in case something happens to the first data center. Now of course, we have a new problem: what happens if Chicago itself is somehow isolated from the rest of the internet. Though incredibly rare, major internet backbone outages like this are possible.

So, let's add a third data center in Dallas. This allows us to actually follow three separate rules. We can move the backup to that remote location so it's even safer. We can relocate at least one more PostgreSQL server to that data center as well, so it acts as an alternate in case Chicago becomes unavailable. And finally, we have three data centers, so it's possible to safely use automated failover.

Recipes later in this chapter will explain why we keep insisting that safety comes in odd numbers. For now, just keep it in mind when considering the design.

With these revisions, our cluster looks more like the following diagram:

The only remaining rules suggest that our node distributions should be more even, and that the witness should be less likely to lose contact with more than one location at once. This means we need to move a couple more of our nodes to the other data centers. But what about the witness node? It's actually best to leave it in the second Chicago location. If Chicago is separated from Dallas, the witness is still at least in another data center, and is less likely to lose contact with Chicago, thereby preserving its voting abilities.

Given we're discussing a large financial institution that likely has access to multiple data centers, perhaps there's an even better solution. Following one final node reorganization, the cluster probably looks more like this diagram:

Now, we can utilize three data centers that are relatively diverse in their geographic isolation. This may appear to be an extreme case, but is merely illustrative of the process we recommend to ensure that the overall state of the cluster is as safe as possible.

There's more...

As database sizes increase, sometimes having only a single offsite backup as the only recovery source can be somewhat inconvenient. Not only do we have to wait to copy data from the backup location, but potentially any related time to restore the database instance to working order, and any further recovery steps.

In these cases, it's reasonable and even suggested to maintain a backup server at each major data center. This can be done by backing up a local replica, or by using some other kind of filesystem copy or distribution mechanism synchronizing between the locations.

See also

Please refer to this short list of companies that offer virtual hosting to use as quick supplementary data center locations:

Additionally, diagram software is various and widely available. Here are some of our favorites:


Having enough backups

Database backups are a crucial component to any architecture, and should be considered a required part of the central design. The only real question in most cases is: how many backups? All highly available clusters account for relevant backup copies, lest the cluster itself is lost.

In this recipe, we'll cover one simple set of rules to provide an answer.

Getting ready

This is very important, so write it down if necessary. Put it in company documentation if possible.

How to do it...

When considering how many backups to allocate, follow the 3-2-1 backup rule, which consists of these elements:

  1. Keep at least three copies of your data.
  2. Store two copies on different devices.
  3. Keep at least one copy offsite.

Take note that many things may qualify as a copy, including PostgreSQL replicas, and the original data itself.

How it works...

Notice how we don't really recommend a specific backup method, how it should be stored, or how filesystem-based features might contribute. All of those things are implementation details and don't matter as much as the rules themselves.

Consider the first rule: keep at least three copies of your data. Since our PostgreSQL instance is the first copy, we need two more. What might these two be? Could we use one replica and a backup? Maybe we could use two replicas? Perhaps, but let's examine the remaining rules first.

Imagine we've produced a PostgreSQL backup, and it's stored on our primary server and the same filesystem as the database instance. What happens if that storage device is damaged or destroyed? Now, we've lost the database and the backup.

That naturally leads to the second rule: store two copies on different devices. It's fine to retain a copy on the local PostgreSQL server and even the same physical storage device, provided we store a copy of the backup on a device that won't be lost simultaneously. Store another copy of the backup on a second physical device. This can be a separate set of storage drives, a SAN, a shared NFS filesystem, or anything else, so long as it's separate from the database itself.

Be wary of relying on shared infrastructure for following the second rule. If we have two separate LUNs from the same SAN mounted on our database server for PostgreSQL and the backup, this means nothing if the SAN itself is lost. If possible, try to ensure that the backup is actually on a physically distinct device.

So, to fulfill the second rule, we merely need to ensure that the second copy of our PostgreSQL instance is on another device. This is most easily done by creating a replica on another server, or a VM hosted on a different hypervisor. That's two copies of the data.

Finally there's rule three: keep at least one copy offsite. This is the third copy of our data, and it's best to place it somewhere that's immune from a catastrophic failure of the data center itself. In limited circumstances, it may be safe enough to place the backup on a server in another rack of the same data center, but why take the risk?

There are ample cloud providers, vault systems, and cheap storage services that can fill the role of hosting the third and final copy of our data. If we have our own second data center, that's an easy choice of venue. If not, it's important to select, allocate, and designate some tertiary location that won't be lost if the worst happens. This third data copy is an insurance policy, and it doesn't even have to be easily available. As long as we can obtain the backup upon request, that fits the minimum requirement.

There's more...

There's one important corollary here. PostgreSQL replicas tend to immediately reflect every change made to the primary node. What happens if someone accidentally drops a table? In this case, the 3-2-1 rule by itself is not sufficient. Relying on replicas alone means we've permanently lost this data.

Thus, we strongly recommend the following additional rule: At least one copy must be a true backup.

Databases such as PostgreSQL are equipped with Point-In-Time-Recovery (PITR), which allows the user to start with any past backup and apply changes until it reaches a specific point where recovery is stopped. This lets us recover a backup to the point before a table was damaged or removed, capture the desired contents, and reintroduce them into the original database. This can only be done with a real binary-level backup, and there are multiple tools dedicated to performing this task. This book even contains a chapter dedicated to Backup Management.

See also


Considering quorum

Quorum can best be explained by imagining any voting system. It's a result of trusted consensus and relies on multiple implementations backed by dissertation and quantitative study. The most common way to guarantee a quorum for a PostgreSQL cluster is by utilizing a witness node. This exists only to vote and observe the state of the cluster. This helps us reach maximum availability by guaranteeing there's always an active primary node.

In this recipe, we'll examine why it's important to apply the concept of quorum to our PostgreSQL cluster, and how we may do so.

Getting ready

The primary criteria for establishing a quorum is that we must satisfy the capability for avoiding tie votes, also known as establishing consensus. Basically, this means we must have an odd number of PostgreSQL nodes within our cluster such that there's always a majority. We should already have a preliminary node count by working through previous recipes in this chapter, in particular, the Picking redundant copies recipe and the Selecting locations recipe.

That being said, the concept of quorum is only necessary in clusters that intend to provide automated failover capabilities. If this is not going to be a feature of the end architecture, this recipe may be skipped.

How to do it...

Once we have an initial node count, we should apply these guidelines to adjust the total count and node distribution:

  1. If the initial PostgreSQL node count is even, add one witness node.
  2. If the initial PostgreSQL node count is odd, convert one replica into a witness node.
  3. In the presence of two locations, the witness node should reside in the same data center as the primary node.
  4. If possible, allocate witness nodes in an independent tertiary location.

How it works...

While deceptively simple, there's actually a lot of thought involved in correctly placing an odd node, and why we use witness nodes rather than yet another PostgreSQL replica:

  1. Our first guideline is the most straightforward of these, such that we ensure there are an odd number of nodes in the cluster. Once we have that, any event in the cluster is submitted to the entire quorum for a decision, and only agreement guarantees subsequent action. Further, since the witness cannot vote for itself, only one eligible node will ever win the election. Consider this sample cluster diagram:

We have three nodes in this cluster and, in the event of a failure of the Primary node, the Witness must vote for the only remaining Replica. If the Witness had been a standard replica node, it could have voted for itself and potentially led to a tied vote. In an automated scenario, this would prevent the cluster from promoting a replacement Primary node.

  1. The second guideline is a variant of this concept. If we already had an odd number of nodes, one of these should be a Witness rather than a standard replica. Consider this diagram:

We can see here that the third node is still a replica, but it also acts as a Witness. Essentially, we don't allow this node to vote for itself to become the new Primary. This kind of role works well for read-only replicas that exist only for application use and is a good way to reuse existing resources.

  1. The third guideline, of placing the Witness in the same location as the Primary node, safeguards node visibility. More important than automation is safety. By placing the Witness in the same location as the Primary when there are only two data centers, we can ensure that a network partition—a situation where we lose network connectivity between the data centerswon't result in the alternate location incorrectly promoting one of its replicas. Consider this diagram:

If the connection between Chicago and Dallas is lost, Chicago still has the majority of voting nodes, and Dallas does not. As a result, the cluster will continue operating normally until the network is repaired, and we didn't experience an accidental activation of a node in Dallas.

Some failover automation systems also take physical location into account by verifying that all nodes in one location agree that all nodes in the other location are not responding. In these cases, the only time where automation will not work normally is when a network partition has occurred. This approach is only viable when more than one node exists in each location. Such can be accomplished by allocating further replicas, or even witness nodes.

Unfortunately, our cluster is no longer symmetrical. If we activate the node in Dallas, there are no witnesses in that location, so we must eventually move the Primary back to Chicago. This means every failover will be followed by a manual switch to the other location, thus doubling our downtime.

The easiest way to permanently address these concerns is to add a third location and assign a node there. In most cases, this will be the Witness node itself. Consider this example:

In this case, we may desire that only Chicago or San Jose host the active PostgreSQL node. In the event of a failure of our Primary node, San Jose should take over instead. The Witness can see both data centers and decide voting based on this. Furthermore, it doesn't matter if the Primary is active in Chicago or San Jose, because the Witness is not tied directly to either location.

There's more...

What happens in the case of a tie? Even if the original cluster contained an odd number of nodes, when the Primary node goes offline, this is no longer true. In simple quorum systems, each node votes for itself. However, a Witness, by its definition, must vote for some other node. This means some replica in the cluster will have more than one vote, and thus win the election.

In case there are somehow multiple witnesses, and votes are split anyway, PostgreSQL quorum systems usually account for the Log Sequence Number (LSN) from the Primary node. Even if it's only a single transaction, one of the nodes with the most votes will have replicated more data than the other, and this will break the tie.


Introducing indirection

What happens to connections to a PostgreSQL system when the service must be shut down for maintenance, or the node itself experiences a hardware problem? Previous recipes have already recommended we integrate at least one data replica into our design, but how should we handle switching between these resources? A great way to achieve high availability is to make server maintenance or replacement as simple as possible.

The concept we'll be exploring in this recipe will be one of anticipating system outages, and even welcoming them, by incorporating proxy techniques into the design.

Getting ready

There are actually several methods for switching from one PostgreSQL node to another. However, when considering the node architecture as a whole, we need to know the four major techniques to handle node indirection:

  1. Domain name reassignment
  2. Virtual IP address
  3. Session multiplexing software
  4. Software or hardware load balancer

In real terms, these are all basically the same thing: a proxy for our PostgreSQL primary node. Keep this in mind as we consider how they may affect our architecture. It would also be a good idea to have some diagram software ready to describe how communication flows through the cluster.

How to do it...

Integrating a proxy into a PostgreSQL cluster is generally simple if we consider these steps in the design phase:

  1. Assign a proxy to the primary node.
  2. Redirect all communication to the primary node through the proxy.
  3. If the proxy requires dedicated hardware or software, designate two to account for failures.

How it works...

These rules are simple, but that's one of the reasons they're often overlooked. Always communicate with the Primary node through at least one proxy.

Even if this is merely an abstract network name, or an ephemeral IP address, doing so prevents problems that could occur, as seen in the following diagram:

What happens when the Primary PostgreSQL node is offline and the cluster is now being managed by the Standby? We have to reconfigure—and possibly restartany and all applications that connect directly to it. With one simple change, we can avoid that concern, as seen here:

By following the second guideline, all traffic is directed through the Proxy, thus ensuring that either the Primary or Standby will stay online and remain accessible without further invasive changes. Now, we can switch the active primary node, perform maintenance, or even replace nodes entirely, and the application stack will only see the proxy.

We've encountered clusters that do not follow these two guidelines. Sometimes, applications will actually communicate directly with the primary node as assigned by the inventory reference number. This means any time the infrastructure team or vendor needs to reassign or rename nodes, the application becomes unusable for a short period of time.

Sometimes, hardware load balancers are utilized to redirect application traffic to PostgreSQL. On other occasions, this is done with connection multiplexing software such as PgBouncer or HAProxy. In these cases the proxy is not simply a permanent network name or IP address that is associated with the PostgreSQL cluster, but a piece of hardware. This means that a software or hardware failure could also affect the proxy itself.

In this case, we recommend using two proxies, as seen here:

This is especially useful in microarchitectures, which may consist of dozens or even hundreds of different application servers. Each may target a different proxy such that a failure of either only affects the application servers assigned to it.

There's more...

Given that applications must always access PostgreSQL exclusively through the Proxy, we always recommend assigning a reference hostname that is as permanent as possible. This may fit with the company naming scheme, and should always be documented. PostgreSQL nodes may come and go and, in extreme cases, the cluster itself can be swapped for a replacement, but the Proxy is (or should be) forever.

Physical proxy nodes themselves are not immune to maintenance or failure. Thus, it may be necessary to contact the network team to assign a CNAME or other fixture that can remain static even as the proxy hardware fluctuates.

See also


Preventing split brain

Split brain is the scenario that occurs when more than one primary node is active in a PostgreSQL cluster simultaneously. In these circumstances, if any data was written to both nodes from the application, it becomes extremely difficult to rectify. Certainly, no cluster with such data corruption can be considered highly available!

In this recipe, we will further explore the concept, and how we might mitigate this problem.

Getting ready

An important concept necessary for preventing split brain scenarios is fencing, or isolation of a node from the application and database stack. Often, this is accomplished through STONITH (which stands for Shoot The Other Node In The Head). After accounting for situations where this is not possible, the old primary must invoke SMITH instead, or Shoot Myself In The Head. While it may sound extreme, for servers, this is really a temporary solution to prevent a more worrying complication.

Keep these terms in mind while we explore how they may affect our architecture.

How to do it...

Apply these steps when designing a cluster to help minimize or eliminate the risk of split brain:

  1. If available, allocate STONITH hardware for nodes that may take the role of the Primary.
  2. Consider situations where SMITH must be used instead, during network interruptions.
  3. Ensure PostgreSQL does not start automatically following a system reboot.

How it works...

Modern servers are often equipped with hardware that enables remote administration. These components often allow network access to the boot process itself. In the case where PostgreSQL is installed on virtual servers, the hypervisor serves this role. Many companies commonly install Power Distribution Units (PDU) that can be remotely instructed to cut power to a server.

Whatever the approach, working with infrastructure or systems operations teams is likely necessary to gain access to interact with these devices. As an example, imagine we have a 2-node cluster consisting of a Primary and Standby, and a PDU is available for each. We could do something like this:

In this case, in the case of a failover and subsequent promotion, the Standby could instruct the PDU to cut power to the Primary to ensure that it wasn't possible for applications to be connected. But what about a scenario where it's far more common for network interruptions, such as between two data centers, as seen here:

In this scenario, the Standby would be unable to shut down the Primary node in the case of a failover. This is why it's important to consider SMITH approaches as well. In this case, the Primary should monitor the state of the Standby and Witness, and if both disconnect for a sufficient period, it should assume the Standby will be promoted. In this case, it would power itself down to prevent applications from interacting with it.

Finally, always disable the PostgreSQL start up mechanism on clusters equipped with high availability management software. That software should manage starting and stopping the service, and this will also prevent unintended events such as accidentally having two primary nodes active simply because a reboot started a previously failed PostgreSQL node.

Advanced recipes later in this book will adhere to the rule of disabling PostgreSQL on startup and provide exact instructions for doing so.

There's more...

The reason preventing split brain is so difficult is that it's not an easy problem to solve. This is why dedicated software for managing high availability exists. While these are not immune to the issue, they greatly reduce the potential of occurrence.

Pacemaker has components specifically for interacting with STONITH hardware. repmgr implements the concept of hook scripts for event notifications, and accounts for scenarios where the Primary is isolated from the remainder of the cluster as seen in the previous diagram. Patroni uses a sophisticated locking mechanism that only allows one primary node to be registered at once.

Don't try to invent a solution for an already solved problem when most of the work has already been done by companies dedicated to the cause.


Incorporating multi-master

Some PostgreSQL vendors provide proprietary extended functionality that makes it possible for a cluster to contain multiple writable Primary nodes simultaneously. Users of this kind of software can expect certain enhanced capabilities, though concessions are often necessary. This recipe will explore how PostgreSQL multi-master can influence cluster topology.

Getting ready

This recipe will require some knowledge of where the nodes are likely to reside on a global scale. Will some PostgreSQL nodes be in Dubai, while others are in Cairo or Toronto? We will also need to have a very basic understanding of how the application operates. This may mean interacting with application developers or designers to derive a rough approximation of queries required for basic operation.

How to do it...

When considering deploying multiple writable PostgreSQL nodes, utilize these guiding questions:

  1. Is there significant geographical distance between nodes?
  2. Does the application use multiple transactions or queries per operation?
  3. Are accounts or users likely to operate primarily in a certain region?

How it works...

Probably the most obvious benefit arising from using multiple writable PostgreSQL nodes is one of reduced write latency. Consider an initial cluster that may resemble this diagram:

Each write to Tokyo or Sydney must first cross thousands of miles before being committed. And due to how replication works, the local replicas in those regions will have to wait for the transaction to be replayed before it will be visible there. These times can be considerable. Consider this table of round-trip-times for network traffic for the locations we've chosen:






205 ms

145 ms


200 ms


195 ms


145 ms

195 ms


Each write may require over 200 ms simply to reach the primary node. Then, the same data must be transmitted from the Primary to each Standby, doubling the time necessary before the transaction may be visible in the continent where it originated. Since many application actions can invoke multiple transactions, this can cause a time amplification effect that could last for several minutes in extreme cases.

This is why we ask whether or not an application performs multiple actions per task. Displaying a web page may require a dozen queries. Submitting a credit application can mean several writes and polling for results. With competition around every corner, every second of waiting increases the chances a user may simply use another application without such latency issues. If each of those nodes were a Primary, the transaction write overhead would be effectively zero.

The last question we should answer is one of expandability. As the usage volume of the cluster increases, we will inevitably require further nodes. A popular method of addressing this is to regionalize the primary nodes, but otherwise follow standard replication concepts. As an example, imagine we needed a further two nodes in each region to fulfill read traffic. It could look something like this:

This Hub + Spoke model helps ensure each region can keep up with demand, without adding latency by including nodes outside of a particular region. Note also that, when using these multi-master clusters, all nodes often require direct connections to each other.

There's more...

These types of multi-master PostgreSQL clusters often require two direct connections between all participating nodes, one for each direction of communication. This is called a Mesh topology, and is considered by some to be a source of excessive communication overhead. If we think about it, that's a valid criticism given that every transaction in the cluster must eventually be acknowledged by every other primary node. In very active systems, the impact could be significant.

A scenario along the lines of the following diagram, for example, may present complications:

By merely adding three more primary nodes, we've increased the number of communication channels to 30. In fact, the general formula for this can be expressed for PostgreSQL multi-master as follows:

C = N * (N - 1) 

So, if we have 3 nodes, we can expect 6 channels, but if we have 10 nodes, there are 90 instead. This is one major reason for the Hub + Spoke model, since the local Replica nodes do not need to be primary nodes and contribute to the topology communication overhead.

Consider the potential impact of this before simply embracing the benefits of operating in multiple locations simultaneously.

See also


Leveraging multi-master

One significant benefit to using software that enables multiple primary nodes in a PostgreSQL cluster is the associated increase in availability. This functionality can eliminate node promotion time and allow a fully active application stack on all data backends if properly configured.

In this recipe, we'll explore advanced usage of a multi-master cluster, and how it can help us reach the pinnacle of high availability.

Getting ready

It's crucially important to become familiar with the benefits and drawbacks of how multi-master operation can affect the cluster. The previous Incorporating multi-master recipe is a good place to start. Additionally, information we cover here can be directly relevant to the Defining timetables through RTO recipe and the Picking redundant copies recipe.

In a way, this recipe will bring together a lot of concepts we've covered through the chapter, so we recommend covering it last if possible.

How to do it...

To really make the most of multi-master architecture, follow these guidelines:

  1. Always allocate a proxy layer.
  2. If cross-data center latency is relevant, allocate at least two nodes per location.
  3. It's no longer necessary to worry about adding nodes specifically to maintain quorum.
  4. Geographically partition data if possible.

How it works...

We actually recommend applying the first rule to all clusters, as suggested in the Introducing indirection recipe. It's especially important here as the focus is specifically centered on maximizing availability.

Unlike a standard PostgreSQL node, a cluster containing multiple primary nodes does not require the promotion of alternate systems to writable status. This means we can switch to them in a nearly instantaneous manner. A properly configured proxy layer means this is possible without directly alerting the application layer. Such a cluster could resemble this diagram:

Given this configuration, it's possible to switch from one Primary to the other with a pause of mere milliseconds in between. This effectively means zero RTO contribution for that action. This allows us to perform maintenance on any node, essentially without disturbing the application layer at all.

In the preceding configuration, however, we only have one node per location. In the event that the Primary in Chicago fails or is undergoing maintenance, applications in that location will be interacting with the Dallas node. A better design would be something like this:

With two nodes per data center, we're free to swap between them as necessary. If the proxy uses a connection check mechanism, it can even autodetect online status and ensure traffic always goes to the online node in the same location.

The extra Primary per data center need not remain idle when not in use. Some proxy systems can allocate application sessions by user, account, or some other identifying characteristic. This allows safe load balancing that avoids risks associated with multi-master systems, such as conflict resolution.

Pay attention to the preceding diagrams and try to find one common attribute they both share.

Find it yet?

Each cluster has an even number of nodes. Also note that we didn't compensate for this by adding any kind of witness node to help arbitrate the quorum state. This is because each node is a Primary with no failover process to manage. As a consequence, we no longer have the usual cause of split brain, nor must we worry too much about network partition events.

Finally, try, if possible, to arrange the cluster such that data is as closely associated with its users as possible. If users are bank clients interacting with their own account and can be regionalized by country, this is an easy choice. If it's a shared service microarchitecture with applications indiscriminately modifying data from arbitrary accounts, that's not so simple.

For those more advanced circumstances, it's possible to approach the problem from a smaller scale. Perhaps servers in the same rack only communicate with the database nearest to them physically. Perhaps the proxy layer can use sticky sessions and route connections to specific primary nodes based on a stable metric.

The goal here is data locality. While multi-master PostgreSQL allows multiple nodes to ingest writes simultaneously, consider transmission latency. We can observe this in a simple two-node interaction:

  1. Node A accepts a write for Account X.
  2. Node A sends the result to Node B.
  3. The application is stateless and connects to Node B.
  4. The application notices data is missing in node B and submits a change again.
  5. Node B replays data from Node A.
  6. Account X has now been modified twice.

If the application session was tightly coupled to one primary node, this scenario would not be possible. There are numerous ways to accomplish this coupling, and it helps ensure fastest turnaround for associated data that was previously modified in any case.

There's more...

PostgreSQL multi-master solutions use logical replication to transfer data between nodes by necessity. As a result, software versions need not match. This means that PostgreSQL 11 and PostgreSQL 12 nodes may coexist in the same cluster. Combined with a proxy layer as recommended, this allows fully online, major-version upgrades. From an RTO perspective, this means the following elements may all be assumed to contribute zero or a small number of milliseconds:

  • Node failover and switchover
  • Minor upgrades (v12.1 to v12.2)
  • Node maintenance
  • Major upgrades (v11 to v12)

Due to its proprietary nature, PostgreSQL multi-master is generally not available without additional cost. Consider any associated pricing when tabulating RTO architecture variant cost sheets. This should enable management to make an informed decision based on expenses associated with pursuing extremely low RTO features such as these.

See also

About the Author
  • Shaun Thomas

    Shaun M. Thomas has been working with PostgreSQL since late 2000. He is a frequent contributor to the PostgreSQL Performance and General mailing lists, assisting other DBAs with the knowledge he's gained over the years. In 2011 and 2012, he gave presentations at the Postgres Open conference on topics such as handling extreme throughput, high availability, server redundancy, and failover techniques. Most recently, he has contributed the Shard Manager extension and the walctl WAL management suite. Currently, he serves as the database architect at PEAK6 Investments, where he develops standard operating procedure (SOP) guidelines to facilitate reliable server architecture among many other tasks. Many of the techniques used in this book were developed specifically for this extreme environment. He believes that PostgreSQL has a stupendous future ahead, and he can't wait to see the advancements subsequent versions will bring.

    Browse publications by this author
Latest Reviews (3 reviews total)
Really good in-depth coverage of a subject that can be quite opinionated.
The book provides usefull and comprehensive info.
un libro muy completo, el autor tiene un gran dominio del tema.
PostgreSQL 12 High Availability Cookbook - Third Edition
Unlock this book and the full library FREE for 7 days
Start now