In this chapter, we will learn about selection and provisioning of hardware necessary to build a highly-available PostgreSQL database. We will cover the following recipes in this chapter:
- Planning for redundancy
- Having enough IOPS
- Sizing storage
- Investing in a RAID
- Picking a processor
- Making the most of memory
- Exploring nimble networking
- Managing motherboards
- Selecting a chassis
- Saddling up to a SAN
- Tallying up
- Protecting your eggs
What does high availability mean? In the context of what we're trying to build, it means we want our database to start and remain online for as long as possible. A critical component of this is the hardware that hosts the database itself. No matter how perfect a machine and its parts may be, failure or unexpected behavior of any element can result in an outage.
So how do we avoid these unwanted outages? Expect them. We must start by assuming hardware can and will fail, and at the worst possible moment. If we start with that in mind, it becomes much easier to make decisions regarding the composition of each server we are building.
Make no mistake! Much of this planning will rely on worksheets, caveats, and compromise. Some of our choices will have several expensive options, and we will have to weigh the benefits offered against our total cost outlay. We want to build something stable, which is not always easy. Depending on the size of our company, our purchasing power, and available hosting choices, we may be in for a rather complicated path to that goal.
This chapter will attempt to paint a complete picture of a highly-available environment in such a way that you can pick and choose the best solution without making too many detrimental compromises. Of course, we'll offer advice to what we believe is the best overall solution, but you don't always have to take our word for it.
For the purposes of this chapter, we will not cover cloud computing or other elastic allocation options. Many of the concepts we introduce can be adapted to those solutions, yet many are implementation-specific. If you want to use a cloud vendor such as Amazon or Rackspace, you will need to obtain manuals and appropriate materials for applying what you learn here.
Redundancy means having a spare; but a spare for what? Everything. Every single part, from motherboard to chassis, power supply to network cable, disk space to throughput, should have at least one piece of excess equipment or capacity available for immediate use. Let's go through as many of these as we can imagine, before we do anything that might depend on something we bought.
Fire up your favorite spreadsheet program; we'll be using it to keep track of all the parts that go into the server, and any capacity concerns. If you don't have one, Open Office and Libre Office are good free alternatives for building these spreadsheets. Subsequent sections will help determine most of the row contents.
We simply need to produce a hardware spreadsheet to track our purchase needs. We can do that with the following steps:
- Create a new spreadsheet for parts and details.
- Create a heading row with the following columns:
- Total cost
- Create a new row for each type of the following components:
- Hard Drive (3.5")
- Hard Drive (2.5")
- Hard Drive (SSD)
- Network Card
- Power Supply
- RAID Controller
- In the
Chassisrow, under the
Total costcolumn, enter the following formula:
- Copy and paste the formula into the
Total Costcolumn for all the rows we created. The end result should look something like the following screenshot:
What we've done is prepare a spreadsheet that we can fill in with information collected from the rest of this chapter. We will have very long discussions regarding each part of the server we want to build, so we need a place to collect each decision we make along the way.
The heading column can include any other details you wish to retain about each part, but for the sake of simplicity, we are stuck to the bare minimum. This also goes for the parts we chose for each column. Depending on the vendor you select to supply your server, many of these decisions will already be made. It's still a good idea to include each component in case you need an emergency replacement.
Total Cost column exists for one purpose: to itemize the cost of each part, multiplied by how many we will need to complete the server.
To make sure we account for the redundancy element of the spreadsheet, we strongly suggest inflating the number you use for the
Count column, which will also increase the price automatically. This ensures we automatically include extra capacity in case something fails. If you would rather track this separately, add a
Spare Count column to the spreadsheet instead.
We'll have discussions later as to failure rates of different types of hardware, which will influence how many excess components to allocate. Don't worry about that for now.
It's also a very good idea to include a summary for all of our
Total Cost columns, so we get an aggregate cost estimate for the whole server. To do that with our spreadsheet example, keep in mind that the
Total Cost column is listed as column
To add a
Sum Total column to your spreadsheet on row
F, enter the formula
=SUM(F2:F12). If you've added more columns, substitute for column
F whichever column now represents the
Total Cost. Likewise, if you have more than 13 rows of different parts, use a different row to represent your summary price than row
There are a lot of spreadsheet options available. Many corporations supply a copy of Microsoft Excel. However, if this is not the case, there are many alternatives as follows:
All of these options are free to use and popular enough that support and documentation are readily available.
IOPS stands for Input/Output Operations Per Second. Essentially, this describes how many operations a device can perform per second before it should be considered saturated. If a device is saturated, further requests must wait until the device has spare bandwidth. A server overwhelmed with requests can amount to seconds, minutes, or even hours of delayed results.
Depending on application timeout settings and user patience, a device with low IOPS appears as a bottleneck that reduces both system responsiveness and the perception of quality. A database with insufficient IOPS to service queries in a timely manner is unavailable for all intents and purposes. It doesn't matter if PostgreSQL is still available and serving results in this scenario, as its availability has already suffered. We are trying to build a highly-available database. To do so, we need to build a server with enough performance to survive daily operation. In addition, we must overprovision for unexpected surges in popularity, and account for future storage and throughput needs based on monthly increases in storage utilization.
This process is more of a thought experiment. We will present some very rough estimates of IO performance for many different disk types. For each, we should increment entries in our hardware spreadsheet based on perceived need.
The main things we will need for this process are numbers. During development, applications commonly have a goal, expected client count, table count, estimated growth rates, and so on. Even if we have to guess for many of these, they will all contribute to our IOPS requirements. Have these numbers ready, even if they're simply guesses.
We need to figure out how many operations per second we can expect. We can estimate this by using the following steps:
- Increment the
Countcolumn in our hardware spreadsheet for one or more of the following, and round up:
- For 3.5" hard drives, divide by 200
- For 2.5" hard drives, divide by 150
- For SSD hard drives, divide by 50,000, then add two
- Multiply these numbers together, and double the result. Then multiply the total by eight.
- Count the amount of tables used in those queries. If this is unavailable, use three.
- Obtain the average number of queries per page. If this is unavailable, use 10.
- Collect the amount of simultaneous database connections. Start with the expected user count, and divide by 50.
- Add 10 percent to any count greater than 0 and then round up.
Wow, that's a lot of work! There's a reason for everything, of course.
In the initial three steps, we're trying to figure out how many operations might touch an object on disk. For every user that's actively loading a page, for every query in that page, and for every table in that query, that's a potential disk read or write.
We double that number to account for the fact we're estimating all of this. It's a common engineering trick to double or triple calculations to absorb unexpected capacity, variance in materials, and so on. We can use that same technique here.
Why did we suggest dividing the user count by 50 to get the connection total? Since we do not know the average query runtime, we assume 20 ms for each query. For every query that's executing, a connection is in use. Assuming full utilization, up to 50 queries can be active per second. If you have a production system that can provide a better query runtime average, we suggest using that value instead.
But why do we then multiply by eight? In a worst (or best) case scenario, it's not uncommon for an application to double the amount of users or requests on a yearly basis. Doubled usage means doubled hardware needs. If requirements double in one year, we would need a server three times more powerful (1 + 2) than the original estimates to account for the second year. Another doubling would mean a server seven times better (1 + 2 + 4). CPUs, RAM, and storage are generally available as powers of two. Since it's fairly difficult to obtain storage seven times faster than what we already have, we multiply the total by eight.
That gives a total IOPS value roughly necessary for our database to immediately serve every request for the next three years, straight from the disk device. Several companies buy servers every three or four years as a balance between cost and capacity, so these estimates are based on that assumption.
In the next step, we get a rough estimate of the amount of disks necessary to serve the required IOPS. Our numbers in these steps are based on hard drive performance. A 15,000 RPM hard drive can serve under ideal conditions, roughly 200 operations per second. Likewise, a 10,000 RPM drive can provide about 150 operations per second. Current SSDs at the time of writing commonly reach 200,000-300,000 IOPS, and some even regularly eclipse a cool million. However, because they are so fast, we need far fewer of them, and thus the risk is not as evenly distributed. We artificially increase the amount of these drives because, again, we are erring toward availability.
Finally, we add a few extra devices for spares that will go in a closet somewhere, just in case one or more drives fail. This also insulates us from the rare event that hardware is discontinued or otherwise difficult to obtain.
Figuring out the number of IOPS we need and the devices involved is only part of the story.
Sometimes these large lists of calculations make more sense if we see them in practice. So let's make the assumption that 20,000 users will use our application each second. This is how that would look:
- 20000 / 50 = 400
- Default queries per page = 10
- Default tables per query = 3
- 400 * 10 * 3 * 2 = 2400
- 2400 * 8 = 19200
- 19200 IOPS in drives:
- 3.5" drives: 19200 / 200 = 96
- 2.5" drives: 19200 / 150 = 128
- SSDs: 2 + (19200 / 50000) = 2.38 ~ 3
- Add 10 percent:
- 3.5" drives: 96 + 9.6 = 105.6 ~ 106
- 2.5" drives: 128 + 12.8 = 140.8 ~ 141
- SSDs: 3 + 0.3 = 3.3 ~ 4
We are not taking space into account either, which would also increase our SSD count. We will be discussing capacity soon.
Our calculations always assume worst-case scenarios. This is both expensive and in many cases, overzealous. We ignore RAM caching of disk blocks, we don't account for application frontend caches, and the PostgreSQL shared buffers are also not included.
Why? Crashes are always a concern. If a database crashes, buffers are forfeit. If the application frontend cache gets emptied or has problems, reads will be served directly from the database. Until caches are rebuilt, query results can be multiple orders of magnitude slower than normal for minutes or hours. We will discuss methods of circumventing these effects, but these IOPS numbers give us a baseline.
The number of necessary IOPS, and hence disk requirements, are subject to risk evaluation and cost benefit analysis. Deciding between 100 percent coverage and an acceptable fraction is a careful balancing act. Feel free to reduce these numbers; just consider the cost of an outage as part of the total. If a delay is considered standard operating procedures, fractions up to 50 percent are relatively low risk. If possible, try to run tests for an ultimate decision before purchase.
Capacity planning for a database server involves a lot of variables. We must account for table count, user activity, compliance storage requirements, indexes, object bloat, maintenance, archival, and more. We may even have to consider application features that do not exist. New functionality often brings new tables, new storage standards, and archival needs. Planning done now may have little relevance to future usage.
So how do we produce functional estimates for disk space, with so many uncertain or fluctuating elements? Primarily, we want to avoid a scenario where we do not have enough space. Running out of disk space results in ignored queries at best, and a completely frozen and difficult to repair database at worst. Neither are ingredients of a highly-available environment.
So we have a lower bound in this case, enough to avoid catastrophe, though it's in our best interest to allocate more than the bare minimum.
Since there are a lot of variables that contribute to the volume of storage we want, we need information about each of them. Gather as many data points as possible regarding things such as: largest expected tables and indexes, row counts per day, indexes per table, desired excess, and anything else imaginable. We'll use all of it.
This is much easier if we already have a database, and are now trying to ensure it is highly-available. Even if the database is only in development or staging environments at this moment, a few activity simulations at expected user counts should provide a basis for many of our numbers. No matter the case, revisit estimates as concrete details become available.
We can collect some of the information we want from PostgreSQL if we have a running instance already. If not, we can use baseline numbers. Follow these steps if you already have a PostgreSQL database available:
- Submit this query to get the amount of space used by all databases:
SELECT pg_size_pretty(sum(pg_database_size(oid))::BIGINT) FROM pg_database;
- Wait one week.
- Perform the preceding query again.
- Subtract the first reading from the second.
Downloading the example code
You can download the example code files for all Packt books that you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
If we don't have an existing install and are working with a project that has yet to start development, we can substitute a few guesses instead. Without a running PostgreSQL instance, use the following assumptions:
- Our databases have a total size of 100 GB
- After one week, our install grew by 1.5 GB
Of course, you don't have to start with these rather arbitrary numbers for your own use case. Without a source database, we simply recommend starting with medium-size growth values to avoid underestimating. If our estimates are too low, the database could exceed our plans and require emergency resource allocation. That's not something we want in a highly-available cluster!
Next, we can calculate our growth needs for the next three years. Perform the following steps:
- Multiply the change in install size by four.
- Apply the following formula, where x is the most recent size of the databases, and y is the value from the previous step: x * (1 + y/x)^36.
- Multiply the previous result by two.
In the end, this is the magic of compounding interest. If we have an existing database installed, it can tell us not only how much space it currently consumes, but also how quickly it's currently growing. If not, we can start with a medium size and substitute a growth assumption that will cause the cumulative total to double in size every year. Remember, we begin by working with worst-case scenarios, and modify the numbers afterwards.
What if we don't need compounding interest because our expected growth is linear? It's always easier to start with too much space than to add more later. If you know your table count will rarely change, users will not increase in number, or data streams are relatively consistent, feel free to drop the compounded interest formula. Otherwise, we suggest using it anyway.
The PostgreSQL query we used takes advantage of the system catalog and known statistics regarding the database contents. The
pg_database_size function always returns the number of bytes a database uses, so we must use the
pg_size_pretty function to make it more human readable.
Once we know the size of the database instance and its growth rate, we can apply a simple compounding interest function to estimate the volume at any point in the future. This not only accounts for the current growth rate, but also incorporates additional accumulation caused by increases in clients, table counts, and other unspecified sources. It's extremely aggressive, since we take the weekly growth rate, translate that to a monthly rate, and apply the compounding monthly instead of yearly.
And then we use a standard engineering tactic and double the estimate, just in case. Using the provided values--that of a 100 GB database that grows at 1.5 GB per week-we would have an 815 GB database install in three years. With a system that large, we should allocate at least 1630 GB. If we simply added the 1.5 GB weekly growth rate for three years, the final tally would only be 334 GB, and we could get by with 668 GB.
Don't let our formulas define your only path. Let's explore how they apply in a real-world situation, and how we can modify them to better fit our systems.
There are quite a few very large databases using PostgreSQL. Whether or not they have thousands of tables and indexes, billions of rows, or handle billions of queries per day, statistics help us plan for the future. Let's apply the previous steps to an example database that actually exists:
- The database is currently 875 GB
- The database was 865 GB last week
- The database grows by 10 GB per week
- Thus, the database grows by 40 GB every four weeks
- Using the formula we discussed in step two of this recipe, the number become this: 875 * (1 + 40/875)^36 = 4374 GB
- Doubled, this is 8748 GB
Keep in mind that this estimation technique may grossly exaggerate the necessary space. If we take the existing 40 GB monthly growth rate, the database would only be 2315 GB in three years. Of course, 2.3 TB is still a very large database; it's just half as large as our estimate.
We already mentioned that the growth curve used here is extremely aggressive. We can't risk ever running out of space in a production database and still consider ourselves highly-available. However, there is probably a safe position between the current growth rate of the database, and the compounded estimate, especially since we are doubling the allocation anyway.
In the preceding real-world example, the database is likely to have a size between 2315 GB and 4374 GB. If we split the difference, that's 3345 GB. Furthermore, we don't necessarily have to double that number if we're comfortable having a disk device that's 70 percent full three years from now instead of 50 percent. With that in mind, we would probably be safe with 5 TB of space instead of 9 TB. That's a vast saving if we're willing to make those assumptions.
At the beginning of this chapter, we created a hardware cost spreadsheet to estimate the total cost of a highly-available server. If we were following the chapter, our spreadsheet already accounts for the minimum number of devices necessary to provide the IOPS we want.
Suppose we needed 15,000 IOPS, and decided to use 2.5-inch drives. That would require over 40 drives. Even at only 300 GB each, that's 12 TB of total available space. Yet the case for SSDs is the opposite. For our previous example, we would need at least five 1 GB SSD drives, or one very large PCIe SSD to provide 5 TB of space for the adjusted sample.
Whichever solution we finally choose, we can take the advice from every section so far. At this point, the spreadsheet should have a device count that should satisfy most, if not all, of our space and IOPS requirements.
RAID stands for Redundant Array of Independent (or Inexpensive) Disks, and often requires a separate controller card for management. The primary purpose of a RAID is to combine several physical devices into a single logical unit for the sake of redundancy and performance.
This is especially relevant to our interests. Carnegie Mellon University published a study in 2007 on hard drive failure rates. They found that hard drives fail at about 3 percent per year. Furthermore, they found that drive type and interface contributed little to disk longevity, and that hard drives do not reflect a tendency to fail early, as was commonly accepted. These findings were largely corroborated by a parallel study released the same year by Google.
What does this mean? For our purposes in building a highly-available server, it means hard drives should be looked at with great disdain. Larger databases will depend on tens or hundreds of hard drives in order to represent several terabytes of data. With a 3 percent failure rate per year, a 100-drive array would lose roughly nine devices after three years.
This is the primary reason that all of our calculations regarding disk devices automatically assume a 10 percent excess inventory allotment. If a drive fails, we need an immediate replacement. Vendors are not always capable of delivering a new drive quickly enough. Having a spare on hand, ideally at the hosting facility or in the server itself, helps ensure continuous uptime.
So how does RAID figure into this scenario? If we hosted our database on several bare hard drives, knowing that around 10 percent of these drives will fail in three years, outages would be inevitable. What we want is an abstraction layer, one that can present any amount of hard drives as a single whole, keeping reserves for drive errors, handling checksums for integrity, and mirroring for redundancy.
RAID provides all of that in several convenient configurations. Good controller cards often include copious amounts of cache and other management capabilities. Instead of manually assigning dozens of drives, split them into several usable array allocations that reflect much lower operational risk.
Knowing all of this, databases have special needs when it comes to RAID and the performance characteristics associated with each RAID type. Now we will explore the selection criteria for our database, and how to simplify the process.
That was a long introduction, wasn't it? Well, we also strongly suggest taking a look at the Having enough IOPS and Sizing storage recipes before continuing. Make sure the hardware spreadsheet has a drive count for the type of drives going into the server we're designing. If we're using PCIe instead of standard SSD drives, this section can be skipped.
Only a few RAID levels matter in a database context. Perform these steps to decide which one is right for this server:
- If this is an OLTP (Online Transaction Processing) database primarily for handling very high speed queries, use RAID level 1+0
- If this is a non-critical development or staging system, use RAID level 5
- If this is a non-critical OLAP (Online Analytic Processing) reporting system, use RAID level 5
- If this is a critical OLAP reporting system, use RAID level 6
- If this is a long-term storage OLAP warehouse, use RAID level 6
We made a lot of snap decisions here. There are quite a few RAID levels that we simply ignored, so there should be some discussion regarding the reasoning we used.
Let's begin with RAID level 0. Level 0 stripes data across all disks at once. It's certainly convenient, but a single drive failure will lose all stored information in the array. What about RAID level 1? Level 1 acts as a full mirror of all data stored. For every set of drives, a second set of drives has an exact copy. If a drive fails in one set, the second set is still available. However, if that set also experiences any failure, all data is lost.
When we talk about RAID 1+0, we actually combine the mirroring capability of RAID 1 with the striping of RAID 0. How? Take a look at the following diagram for six disks:
In this RAID 1+0, we have three sets, each consisting of two disks. Each of the two disks mirror each other, and the data is striped across all three sets. We could lose a disk from each set and still have all of our data. We only have a problem if we lose two disks from the same set, since they mirror each other. Overall, this is the most robust RAID level available, and the most commonly used for OLTP systems.
RAID level 5 and 6 take a different approach. Again, let's look at six drives and see a very simplified view of how RAID 5 would operate in that situation:
The solid line shows that the data is spread across all six drives. The dotted line is the parity information. If a drive fails and the block can't be read directly from the necessary location, a RAID 5 will use the remaining parity information from all drives to reconstruct the missing data. The only real difference between a RAID 5 and a RAID 6 is that a RAID 6 contains a second parity line, so up to two drives can fail before the array begins operating in a degraded manner.
Using a RAID 5 or 6 offers more protection than a RAID 0, with less cost than a RAID 1+0, which requires double the amount of desired space. We selected these for non-critical OLAP systems because they usually need space over performance, and are not as sensitive to immediate availability pressures as an OLTP system.
We mentioned controller cards earlier, and noted that they also offer on-board cache. RAID has been around for a long time, and though disks are getting much larger, they haven't experienced an equivalent increase in speed. In scenarios that use RAID 5 or 6, writes can also be slowed since each write must be committed to several devices simultaneously in the form of parity.
To combat this, RAID controllers allow configuration of the cache itself, to buffer writes in favor of reads, or vice versa. Don't be afraid to adjust this and run tests to determine the best cache mix. If everything else fails, start with a 100 percent for writes, as they are the most in need of caching. Keep a close eye on write performance, and give it priority. Generally, the OS cache does a better job of caching reads, and has much more memory available to do so.
In selecting a CPU for our server, we have a lot to consider. At the time of writing, the current trend among processors in every space-including mobile-is toward multiple cores per chip. CPU manufacturers have found that providing a large number of smaller processing units spreads workload horizontally for better overall scalability.
As users of PostgreSQL, this benefits us tremendously. PostgreSQL is based on processes instead of threads. This means each connected client is assigned to a process that can use a CPU core when available. The host operating system can perform such allocations without any input from the database software. Motherboards have limited space, so we need more cores on the same limited real estate, which means more simultaneously active database clients.
Once again, our discussion veers toward capacity planning for a three or four year cycle. Limited processing capability leads to slow or delayed queries, or a database that is incapable of adequately handling increasing amounts of simultaneous users. Yet simply choosing the fastest CPU with the most cores and filling the motherboard can be a staggering waste of resources. So how, then, do we know what to buy?
That's what we're here to figure out.
Luckily, there are only really two manufacturers that produce commodity server-class CPUs. Furthermore, each vendor has a line of CPU designed specifically for server use. AMD and Intel both provide a similar price to performance curves, but that's where the comparison ends.
At the time of writing, the Intel Xeon CPUs benchmark is significantly higher than equivalently priced AMD Opterons. This is true for both mid-range and high-end processors. Before going through this recipe, it would be a good idea to visit AnandTech, Tom's Hardware, Intel, and AMD, just to get a basic idea of the landscape. There are a lot of benchmarks that compare various models of CPUs, so don't take our word for it.
Because of this current performance disparity, we'll focus exclusively on Intel processors for now. This situation has changed in the past, and may do so again in the future.
We can collect some of the information we want from the database if we have one already. If we already have a PostgreSQL database available, we can execute a query to start our calculations. This works best if used at the most active time of day.
Execute this query as a superuser to get the count of simultaneous active users if you have PostgreSQL 9.2 or higher:
SELECT count(1) FROM pg_stat_activity WHERE state = 'active';
Use this query if you have an older version:
SELECT count(1) FROM pg_stat_activity WHERE current_query NOT LIKE '<IDLE>%';
If we don't have a PostgreSQL server, we need to make an educated guess. Use these steps to approximate:
- Work with the application developers to obtain a count of expected clients active per second.
- Divide the previous number by 50 to remain consistent with our 20ms query assumption.
Once we have some idea of how many queries will be active simultaneously, we need to figure out the processor count. Follow these steps:
- If we already know how many disks will store our data, use this number. In the case of an SSD base, use 0.
- Subtract the previous number from our count of active users.
- Divide the previous result by two.
- Apply the following formula, where x is the value from the previous step: x * (1.4)^3.
Before we can even begin to decide on a processor count, we need a baseline. With a working PostgreSQL server to base our numbers on, we can just use the amount of existing users during a busy period. Without that, we need to guess. This guess can actually be pretty close, depending on how the application was targeted. If the intent is to service 1000 users per second, we should start there since that's the same assumption the company is using to buy application and web servers.
After that, we are applying a commonly accepted formula used by PostgreSQL administrators for a very long time. The ideal number of active connections is equal to twice the amount of available processor cores, plus the amount of disk spindles. Amusingly, the disk spindles increase the ideal number of connections because they contribute seek time, which forces the processor to wait for information. While a processor is waiting for input for one connection, the operating system may decide to lend the processor to another until the data is retrieved.
So, we apply that accepted formula in reverse. First, we subtract the number of spindles, and then divide by two to obtain how many CPUs we should have for our expected workload.
Afterwards, we assume a 40 percent increase in active clients on a yearly basis, and increase the CPU core count accordingly for three years. Note that this is a very aggressive growth rate. If we have historical growth data available, or the company is expecting a different value, we should use that instead.
When purchasing CPUs, no matter how cores are distributed, the final total should be equal or greater than the number we calculated. If it isn't, the application may require more aggressive caching than expected, or we may need to horizontally scale the database. We're not ready to introduce that yet, but keep it in mind for later.
The processor count is only part of the story. Intel CPUs have a few added elements we need to consider.
Newer generations of Intel processors often provide a feature called hyperthreading, which splits each physical processor core into two virtual cores. Historically, this was not well received, as benchmarks often illustrated performance degradation when the feature was enabled.
Since the introduction of Nehalem-based architecture in 2008, this is no longer the case. While doubling the processor count does not result in a doubling of throughput, we've run several tests that show up to 40 percent improvement over using physical cores alone. This may not be universal, but it does apply to PostgreSQL performance tests. What this means is that the commonly accepted formula for determining ideal connection count requires modification.
Current advice is to only multiply the physical core count by two. Assuming a 40 percent increase by enabling hyperthreading, the new formula becomes: 2 * 1.4 * CPUs + spindles. With that in mind, if we wanted to serve 1000 connections per second, and used SSDs to host our data, our minimum CPU count would be: 1000 / 50 / 1.4, or 14. Half of that is seven, but no CPU has seven physical cores, so we would need at least eight. If we used the physical cores alone for our calculation, we would need 10.
Recent Intel processors also have something called Turbo Boost. Some vendor motherboards disable this by default. Make sure to go through BIOS settings before performing acceptability tests, as turbo mode can provide up to 25 percent better performance in isolated cases.
This is possible because the maximum speed of the core itself is increased when resources are available. A 2.6 GHz core might operate temporarily at 3.0 GHz. For queries that are dependent on nested loops or other CPU-intensive operations, this can drastically reduce query execution times.
Intel family chips often have low voltage versions of their high performance offerings. While these processors require up to 30 percent less electricity, they also run up to 25 percent slower. Low power name designations are not always consistent, so when choosing a processor, make sure to compare specifications of all similarly named chips.
Beware of accidentally choosing a low power chip meant for a high performance database. However, these chips may be ideal for warehouse or reporting database use, since those systems are not meant for high throughput or vast amounts of simultaneous users. They often cost less than their high-performance counterparts, making them perfect for systems expecting low utilization.
The primary focus when selecting memory for a highly-available system is stability. It's no accident that most, if not all, server-class RAM is of the error-correcting variety. There are a few other things to consider, which may not appear obvious at first glance.
Due to the multi-core nature of our CPUs, the amount of addressable memory may depend on the core count. In addition, speed, latency, and parity are all considerations. We also must consider the number of channels reported by each CPU; failing to match this with an equal count of memory sticks will drastically reduce performance.
Let's make our server fast and stable by considering our memory options.
Some of the decisions we will make depend on the capabilities of the CPU. Make sure to read through the Picking a processor recipe before continuing. If we have a PostgreSQL database available, there's also a query that can prepare us for selecting the most advantageous count of memory modules. It's also a very good idea to complete the Sizing storage recipe to get a better idea for choosing an amount of memory.
We can collect some of the information we want from PostgreSQL if we have an install already. Follow these steps if there's an existing database install that we can use:
- Execute the following query to obtain the size of all databases in the instance:
SELECT pg_size_pretty(sum(pg_database_size(oid))::BIGINT) FROM pg_database;
- Multiply the result by eight.
If we don't have an existing database, we should use a size estimate of the database install after three years. Refer to the Sizing storage recipe to obtain this estimate. Then, perform the following steps:
- Divide the current or estimated database storage size by ten to obtain the minimum amount of memory.
- Multiply our ideal CPU chip count by four to get the memory module count.
- Divide the minimum memory amount by the module count to get the minimum module size.
- Round up to the nearest available memory module size.
The important part of this recipe is starting with a viable estimate of the database size. Since a lack of RAM won't cause the database to crash or operate improperly, we can use looser guidelines to obtain this number. Hence, three years down the road, an existing database install could be eight times larger than its current size.
Why do we then divide that number by ten? Our goal here is to maximize the benefit of the OS-level cache, which will consume a majority of our RAM. This estimate gives us a value that is ten times smaller than the space our database consumes. At this scale, data that is frequently fetched from disk is likely to be served from memory instead. The alternative is read latency due to insufficient memory for disk caching.
Most current CPUs are quad-channel, and thus operate best when the number of modules per processor is a multiple of four. Since we should have determined how many processor cores would be ideal for our system in the Picking a processor recipe, we automatically know the most efficient memory module count. Why do we multiply by four, regardless of how many memory channels the CPU has? Adding more memory modules is not wasted on chips with fewer channels, and provides a possible upgrade path.
Dividing the memory amount by the module count gives our minimum module size. RAM comes in many dimensions, and our calculation is not likely to match any of the available dimensions for purchase, so we need to round up. Why not round down? The operating system will utilize all available RAM to cache and buffer important data. Unless the greater amount is extremely expensive in comparison, any excess memory will not be wasted.
We didn't focus on memory speed, timings, or latency here. Timing and latency can affect performance, but our primary focus is stability. We're always free to order faster or better memory as our budget allows.
Memory speed, on the other hand, is a more visible factor. Every memory speed works with a multiplier to match the highest compatible motherboard bus speed. This directly controls how quickly the CPU can utilize available RAM. Before buying memory, research the stated clock speed and try to match it with one of the faster settings compatible with both the CPU and motherboard.
For example, DDR3-1600 is twice as fast as DDR3-800 since it operates at 200 MHz, as opposed to 100 MHz. Database benchmarks would be vastly different between these two memory speeds, even with the same CPU. Fast memory means PostgreSQL can make more immediate use of cached data, and produce results more quickly.
The network card enables the database server to exchange data with the outside world. This includes far more than web servers, spreadsheets, loading jobs, application servers, and other data consumers. The database server is part of a large continuum of activity, much of which will center around maintenance, management, and even filesystem availability.
Little of this other traffic involves PostgreSQL directly. Much happens in the background regardless of the database and its current workload. Yet even one mishandled network packet across an otherwise normal driver can render the entire server invisible to the outside world, or in extreme cases, even lead to a system panic and subsequent shutdown. On a busy database server, network cards can handle several terabytes of traffic on a daily basis; the margin of error for such a critical piece of hardware is exceptionally slim.
What's more, network bandwidth can easily be saturated by an aggressive backup strategy, which is something critical to a highly-available database. For PostgreSQL systems utilizing streaming replication or WAL archival, that traffic contributes quite a bit of bandwidth to the overall picture. If our backups are delayed, or replicas sit idle waiting for network packets, our exposure to risk is high indeed.
That's not to say everything is doom and gloom! With the right network setup and accompanying hardware, there should be more than enough room for any and all traffic our database server needs. Let's explore all the copious options for connecting our database to the outside world, and making sure it stays there.
This is one of those times it pays to do research. At the time of writing, the current high-speed network standards include 1 Gb/s, 10 Gb/s, 40 Gb/s, and even 100 Gb/s Ethernet. However, 40 Gb/s network cards are still extremely rare, and 100 Gb/s is generally reserved for fiber-based switches and data center use.
This means we will be covering 1 Gb/s and 10 Gb/s interfaces. While we will do our best to outline all of the important aspects of these technologies to simplify the process, we strongly encourage using the Internet to validate current availability and performance characteristics.
Let's begin with a few basic calculations. Look at these following numbers that represent an estimate of interface speed after accounting for overhead:
- 1000 Mb/s * B/10 b = 100 MB/s
- 10,000 Mb/s * B/10 b = 1,000 MB/s
Next, consider how many ways this will be distributed. If we have an existing PostgreSQL setup, follow these steps:
- Execute the following query to determine the number of existing replicas:
SELECT count(1)+1 AS streams FROM pg_stat_replication;
- Multiply streams by 160 for maximum MB/s needed by replication streams.
- Execute the following queries together in a psql connection during a busy time of day on a production database:
SELECT SUM(pg_stat_get_db_tuples_fetched(oid)) AS count1 FROM pg_database; SELECT pg_sleep(1); SELECT SUM(pg_stat_get_db_tuples_fetched(oid)) AS count2 FROM pg_database;
- Subtract the results of
count2for the number of rows fetched from the database per second.
- Divide the number of rows per second by 10,000 for MB/s used by PostgreSQL connections.
- Add MB/s for streams to MB/s for connections.
Without an existing database, follow these steps for some basic bandwidth numbers:
- Multiply the desired number of PostgreSQL replicas by 160 for the maximum MB/s needed by replication streams.
- Assume one WAL stream for an offsite disaster recovery database copy.
- Start with at least one live hot streaming standby copy.
- Include any additional database mirrors.
- Estimate the active client count as discussed in the Picking a Processor recipe.
- Multiply the active client count estimate by 5 for MB/s used by PostgreSQL connections.
- Add MB/s for streams to MB/s for connections.
No matter which checklist we follow, we should double the final tally.
If we have an existing database, there is a wealth of statistical information at our fingertips. The first query we ran gave us a slightly inflated count of copies of our database. For each copy, data must be transferred from the database to another server. This data is based on PostgreSQL WAL output, and these files are 16 MB each. A busy server can produce more than ten of these per second, so we multiply the count of streams by 160 to produce an aggressive amount of network overhead used by database replicas. As usual, this may be overzealous; it's always best to observe an actual system to measure maximum WAL segments generated during heavy write loads.
In PostgreSQL 9.2 and higher, database replicas can stream from other database replicas. This means network traffic can be distributed better among streaming clients, reducing network bandwidth pressure on production systems. PostgreSQL 9.2 also allows direct backup of streaming replicas. This means one or two replicas may be the most the production database ever needs to supply with WAL traffic.
For the next set of numbers, we need to know how much data database connections commonly retrieve. PostgreSQL tracks the number of table rows fetched, but it's a cumulative total. By waiting until a busy time of day and asking the database how many rows have been fetched before and after a one-second wait, we know how many rows are fetched per second.
However, we still don't know how many bytes these rows consume. A good estimate of this is 100 bytes per row. Then we only have to multiply the number of rows by 100 to find the amount of bandwidth we would need. So why do we divide by 10,000? What's 10,000 multiplied by 100? One million. On dividing by 10,000, we produce the number of megabytes per second that those tuple fetches probably used.
If an average of 100 bytes per row isn't good enough, we can connect to one of our primary databases and ask what the average is. Use this query:
SELECT sum(pg_relation_size(oid)) / sum(reltuples)
By adding the amount of streaming traffic to the amount of connection traffic, we have a good, if slightly inflated, idea of how much bandwidth the server needs.
Without a working database to go by, we need to use a few guesses instead. Luckily, the number of streams for a reliable database infrastructure starts at two: one for a live standby, and one for an offsite archive. Each additional desired mirror should increase this total. Again, we multiply by 160 to obtain the maximum megabytes per second that all these streams are likely to require.
The amount of bandwidth client connections use is slightly harder to estimate. However, if we worked through previous chapter sections, we have a CPU estimate, which also tells us the maximum number of database clients that the server can reliably support. If we take that value and multiply by five, that provides a rough value in megabytes per second as well.
Again, we just add those two totals together, and we know the minimum speed of our network.
Finally, we multiply the final tally by two, to account for any unknown maintenance, backup, and filesystem synchronization overhead.
Besides producing an estimate through some simple calculations, we also want to make note of a few other networking details.
This may be easier to visualize with a real example. Let's start with a very active database that has one streaming replica, and one offsite archive. Furthermore, connected clients regularly fetch five million rows per second. Now, let's go through our steps:
- 2 * 160 = 320 MB/s.
- 5,000,000 / 10,000 = 50 MB/s.
- 320 + 50 = 370 MB/s.
- 370 * 2 = 740 MB/s.
That's a very high value! A 1 Gb/s interface can only supply 100 MB/s at most, so we would need eight of those to produce the necessary bandwidth. Yet a 10 Gb/s interface can supply 1000 MB/s, so it can easily handle 740 MB/s, and have room to spare. Would we rather have eight network cables coming out of our server, or one?
One of the first things this chapter suggested was to consider extra inventory. What we haven't really covered yet involves online backups. Most server-class motherboards include not just one, but two on-board network modules. Each module commonly provides four Ethernet interfaces.
Usually each interface is considered separate, and two interfaces from each module are connected to two switches in the data center. This allows server administrators to seamlessly perform maintenance on either switch without disrupting our network traffic. Furthermore, if a switch or network module fails, there's always a backup available.
In our working example, we would need eight 1 Gb/s interfaces to avoid experiencing network congestion. However, we've already used four of our eight available interfaces simply to satisfy basic server hosting requirements. That doesn't leave enough available capacity, and as a consequence, this server would experience a network bottleneck.
This would not be the case with a 10 Gb/s interface. Each of the interfaces connected to redundant switches can carry the entire network requirements of the server.
We suggested doing research on 1 Gb/s and 10 Gb/s network cards. Well, don't do too much. It's very likely that the infrastructure department already has a standard server profile for high-bandwidth systems. This is primarily due to the fact that 10 Gb/s is a very complicated standard compared to 1 Gb/s or lower. There are several different cable types available along with complimentary network modules, one or more of which are probably already deployed in the data center.
Just make sure that the infrastructure knows to allocate high-bandwidth resources if our calculations call for it.
- To read more about how 10-gigabit Ethernet works, please visit the following URL: http://en.wikipedia.org/wiki/10-gigabit_Ethernet
We have been working up to this for quite some time. None of our storage, memory, CPU, or network matters if we have nothing to plug all of it into.
This could have been a long section dedicated to properly weighing the pros and cons of selecting a motherboard manufacturer for maximum stability. It turns out that most server vendors have already done all the hard work in that regard.
In fact, few vendors even disclose many details about the motherboard in their servers outside of model documentation. We can't really read hundreds of pages of documentation about every potential server we would like to consider, so what is the alternative?
No matter where we decide to purchase our server, vendors will not sell-or even present-incompatible choices. If we approached this chapter as intended, we already have a long list of parts, counts, and necessary details to exclude potential offerings very quickly. These choices will often come in the form of drop-down lists for every component that the motherboard and chassis will accept.
The chassis will come later. For now, let's focus on CPU, RAM, RAID, and network compatibility.
Keep in mind that motherboards and the requisite case are almost exclusively a package deal. This means we can't keep an extra motherboard available in case of failure, unlike other swappable elements. This breaks our redundancy rule, but there are ways of circumventing that problem.
This is one of the times when the hardware spreadsheet will show its true usefulness. So, as long as we have been keeping track of our counts through each section, this segment of server selection will be much simpler. By this point, our spreadsheet should look something like this:
We don't care about the total cost for each part yet. It might be a good idea to create a separate tab or copy of the spreadsheet for each vendor we want to consider. This way, we can comparison shop. Also remember that the counts are inflated by at least one replacement in case of failure. So we want to look for two 10-core CPUs, eight 16 GB memory modules, and so on.
Now it's time to do some research. Follow these steps:
- Make a list of desired server vendors. This list may even be available from the infrastructure department, if our company has one.
- For each vendor, check their available 1U and 2U products.
- For each 1U or 2U server, remove from consideration any that can't fulfill minimum CPU requirements.
- Repeat for RAM.
- Repeat for RAID controller cards.
- Repeat for network interface cards.
- Fill in actual selections where appropriate to obtain unit prices.
- Make corrections to the spreadsheet.
While this is straightforward, it requires a lot of time. The amount of server variants available, even from a single vendor, can be staggering. This is one of the reasons we only consider 1U and 2U servers. The other is that 4U servers and larger are often designed for much different use patterns related to vertical scaling, incorporating more CPUs, hard drives, and even multiple concurrent motherboards.
For our purposes, that is simply too powerful. When purchasing servers with the explicit intention to obtain multiple, redundant, and compatible examples, this becomes more difficult as the cost and complexity of the servers increase.
Although we have reduced our sample size, there is still more work to do. When considering the compatible CPUs, if we want 10-core chips, and the motherboard only supports up to 8-core chips, we can remove that from consideration. This also applies to available memory slots and sizes. Yet there's an unwritten element to RAM: maximum amount. If the motherboard only supports up to 384 GB, and our earlier calculations show we may eventually want 512 GB, we can immediately cross it off our list.
Since RAID and network cards must be plugged directly into the motherboard or an expansion daughter card, it's the amount of these available slots that directly concerns us. We need at least two for both cards that should drastically reduce the size of our list, especially in the case of 1U servers.
While doing this compatibility verification, it is difficult to ignore prices listed next to each choice, or the total price changing with each selection. We might as well take advantage of that and fill in the rest of the spreadsheet, and make a copy for each vendor or configuration. Some overall choices are likely to be better complete matches, or offer better future expandability, or better price points, so tracking all of this is beneficial.
RAID controllers and network interfaces are somewhat special cases. Some servers, in order to reduce size, integrate these directly into the motherboard. This is especially true when it comes to network modules. If at all possible, try to resist integrated components.
If these fail, the entire server will require replacement. This makes it much more difficult and expensive to fulfill our redundancy requirement. Server-class motherboards without integrated network interfaces are rare, but we can use these as our backup path if their minimum speed matches what we've configured.
For instance, if we want a 10 GbE card, and the motherboard has integrated a 10 GbE module, we can reduce the amount of excess cards on our spreadsheet by one. It's very likely the integrated version is of lower quality, but it can suffice until the bad card is replaced.
Redundancy doesn't have to be expensive.
To round out our hardware selection phase, it's time to decide just what kind of case to order from our server vendor. This is the final protective element that hosts the motherboard, drives, and power supplies necessary to keep everything running. And like always, we place heavy emphasis on redundancy.
For the purposes of this section, we will concentrate primarily on 1U and 2U rack-mounted servers. Why not 4U or larger? Our goal is to obtain at least two of everything, with similar or matching specifications in every possible scenario. The idea is to scale horizontally, in order to more easily replace a failed component or server. As the size of the chassis increases, its cost, complexity, and resource consumption also rise. In this delicate balancing act, it's safer to err toward two smaller systems with respectable capabilities than one giant server that's twice as powerful.
Since the server chassis and motherboard are generally a package deal, it's a good idea to refer to the Managing motherboards recipe. We will be using a very similar process to choose a server case. This time, we will focus on adequate room for hard drives and redundant power supplies.
Now it's time to do some more research. Follow these steps:
- For our ideal count of active (not replacement) hard drives, remove any choice that doesn't have enough drive slots. Use this list if it's not immediately obvious:
- Maximum 2.5" drives in a 2U server is 24
- Maximum 3.5" drives in a 2U server is 8
- Maximum 2.5" drives in a 1U server is 8
- Maximum 3.5" drives in a 1U server is 4
- Refer to the final list of servers from our motherboard selection.
- Remove from consideration any chassis that does not support dual power supplies. This should rarely happen in server-class systems.
- As the list dwindles, give higher priority to cases with more fans or lower average operating temperatures.
This time, our job was much easier than considering motherboard constraints. This time, drives determine most of our decision.
Hot-swappable hard drives are slightly larger than their standard brethren, due to the swap enclosure. Yet cases exist than can hold up to 24 hot-swap drives across the front when stacked vertically. If we need that many storage devices, we save space by taking advantage of cases that can accommodate them. We also need to remember to reserve two drives for the operating system in a RAID-1, separate from our PostgreSQL storage. We can't diagnose problems on a server that can't boot.
Some cases reserve mounts inside, or at the rear, for operating-system drives. They are harder to replace, but make more room for storage dedicated to PostgreSQL. Here, operating system drives are treated as operating overhead without sacrificing case functionality.
If we need more drives than are available in any configuration, we should consider Direct Attached Storage (DAS), Network Attached Storage (NAS), or Storage Area Network (SAN). Some vendors supply drive extension cages specifically to provide more hot-swap bays for specific server models. While we want to conserve space when possible, these are relatively inexpensive and much smaller than an NAS or SAN if we haven't progressed to requiring such a device.
Regarding the dual power supplies, this is not negotiable. Many data centers provide two power rails per server rack. The intent is to provide two separate sources of power to the server in case the server's power supply fails, or power is cut to one of the sources. Sometimes these power sources even have separate generators. We're not the only ones interested in redundancy; data centers want to avoid outages too.
The last, more optional element, involves investigating the case itself. Many server cases have several fans inside and along the rear, and as a consequence, are very loud. This won't matter when the server is in the data center, but the number of fans and the shape of the airflow will directly affect the server temperature. Higher temperatures decrease system stability. It's not uncommon for vendors to list maximum operating temperatures of each case, so try to gravitate toward the cooler ones if all else is equal.
We use the word vendor frequently, and there's a reason for that. Short of outright accusing bare cases and motherboards of being faulty, they are simply not stable enough for our use. There are some great cases available that in many ways exceed the capabilities provided by established server providers.
We don't suggest the smaller vendors for a few reasons. Larger companies often have replacement policies for each server component, including the case and motherboard. Building a system ourselves may provide more satisfaction, but vendors presumably spend time testing for compatibility and failure conditions. They produce manuals hundreds of pages long detailing viable parts, configurations, and failure conditions of the entire unit.
However, one could just as easily argue that redundant servers increase failure tolerance, as there's always an available backup. Bare cases and motherboards are usually cheaper, and user-serviceable besides. That is a completely valid path, and if risk assessment suggests it's viable, give it a try. The advice we give is by no means set in stone.
SAN stands for Storage Area Network. Working in the industry, you may have encountered NAS (Network Attached Storage) as well. How exactly is that different, and how is it relevant to us?
It's subtle, but important. While both introduce networked storage, only a SAN grants direct block-level access, as if the allocation were raw, unformatted disk space. NAS systems operate one level higher, providing a fully formatted filesystem such as NFS or CIFS. This means our PostgreSQL database does not have direct control over the filesystem; locks, flushes, allocation, and read cache management are all controlled by a remote server.
When building a highly-available server, raw I/O and synchronization messages are very important, and NFS is more for sharing storage than extending the storage capabilities of a server. So what must we consider when deciding on how to best utilize a SAN, and when should we do this instead of using a cheaper solution such as direct attached storage?
We won't be discussing how to evaluate a SAN, which vendors produce the best hardware, or even basic configuration strategies. There are several entire books dedicated to SAN management and evaluation that are far beyond the scope of our overview. For building a highly-available PostgreSQL architecture, all we need to consider is the when and why, not the how.
Because we're going to cover both SAN performance and storage allocation, we recommend referring to the Having enough IOPS and Sizing storage recipes. Just like physical disks, we need to know how much space we need, and roughly how fast it should be to fulfill our transaction and query requirements.
Do we need a SAN? We can ask ourselves a few questions:
- Do our IOPS or storage requirements demand more than 20 hard drives?
- Will the size of our database reach or exceed 3TB within the next three years?
- Would the risk to the company be too high if we ever ran out of space?
- Is there already a SAN available for testing?
If we answer yes to any of these, a SAN might be in our best interests. In that case, we can determine if it would fulfill our needs.
Follow these steps if possible:
- Request a LUN from the infrastructure department with the necessary IOPS and storage requirements.
- If a SAN isn't available, many SAN vendors will provide testing equipment to encourage purchase. Try to obtain one of these.
- Have the infrastructure department format the allocation and attach it to a testing server. Keep note of the path to the storage.
- Create a basic PostgreSQL testing database with the following command-line operations as the
createdb pgbench pgbench -i -s 4000 pgbench
- Drop the system caches as a user capable of performing root-level commands, as follows:
echo 3 | sudo tee /proc/sys/vm/drop_caches
- Test the storage read IOPS with one final command as the
pgbench -S -c 24 -T 600 -j 2 pgbench
The first part of our process is to decide whether or not we actually need a SAN at all. If the database will remain relatively small, capable of residing easily on local hard drives for several years, we don't need a SAN just yet.
While it might seem arbitrary, setting 3 TB as a cutoff for local storage comes with a few justifications. First, consider the local drives. Even if they were capable of saturating a 6 Gbps disk controller, 3 TB would require over an hour to transfer to another local storage device. If that wasn't a bottleneck, there is still the network. With a 10 Gbps NIC and assuming no overhead, that's 40 minutes of transfer at full speed.
That directly affects speed of backups, synchronization, emergency data restores, and any number of other critical operations. Some RAID cards also require special configuration when handling over 4 TB of storage, out of which 3 TB is uncomfortably close if we ever need an extension. SAN devices can perform local storage snapshots for nearly instant data copies intended for other servers. If the other server also uses the same SAN, there's no transfer overhead.
And lastly, while RAID devices can be extended when online, there is a limit imposed by how many local disks are available to our server, either directly in the chassis, or from direct attached storage extensions. If there's ever any risk we can reach that maximum, SAN devices do not have any of these inherent limitations, which we can use to our advantage.
If a SAN is ever available for testing, we're still not done. Depending on the speed of configuration of the SAN or the storage allocation itself, performance may not be sufficient, so we should test the claims made by the SAN manufacturer before committing all of our storage to it.
A very easy way to do this is with a basic
pgbench test. The
pgbench command is provided by the PostgreSQL software, and it can test various aspects of a server. For our uses, we want to focus on the disk storage. We start by creating a new
pgbench database with
createdb, so the
pgbench command has somewhere to store its test data. The
-i option to
pgbench tells it to initialize new test data, and the
-s option describes the scale of test data we want.
A scale of 4000 creates a database roughly 60 GB in size. Feel free to adjust this scale to be larger than the amount of available RAM, which guarantees that the server cannot cache all of the test data and taint our performance results by inflating the numbers.
After initializing a new test database, there is a Linux command that can instruct the server to drop all available cached data. This means none of our test data is in memory before we start the benchmark. Again, we don't want to inflate our results, otherwise the SAN looks more capable than it really is.
The test itself comes from
pgbench again, which is instructed to only read the test data with the
-S option. Furthermore, we tell the benchmark to launch 24 clients with the
-c parameter, and to run the test for ten full minutes with the
-T option. While we used 24 clients here, consider any amount up to three times the number of available processor cores. The final
-j flag merely launches two concurrent benchmark threads, preventing the test itself from reducing overall performance due to CPU throttling.
This process should reveal how capable the SAN is, and if our production database will be safe and have good performance while relying on remote storage.
Notice how we never ask for a specific number of disks when requesting a SAN allocation. Modern SAN equipment operates on an implied service level agreement based on installed components. In effect, if we need 6,000 IOPS and 10 TB of space, the SAN will combine disks, cache, and even SSDs if necessary, to match those numbers as closely as possible.
This not only reduces the amount of risky micromanagement we perform as DBAs, but it acts as an abstraction layer between storage and server. In this case, storage can be modified any number of ways, enhanced, adjusted, or copied, without affecting the database installation itself.
The main problem we encounter when using a SAN instead of several servers configured with local storage, is that the SAN becomes a single point of failure. This is something to keep in mind as our journey to high availability progresses.
Now it's time to get serious. For several pages, we have discussed all the components that go into a stable server, and have strongly suggested obtaining multiple spares for each. Well, that applies to the server itself. Not only does this mean having a spare idle server in case of a catastrophic failure, but it means having an online server as well.
Determining how many excess servers we should have isn't quite that simple, but it's fairly close. This is where the project starts to get expensive, but high availability is never cheap; the company itself might depend on it.
For this, we want to consider the overall state of the application architecture. The database doesn't exist in a vacuum. Work with the system and application teams to get an idea of the other servers that depend on the database.
This won't be a very long list. In any case, follow these steps:
- For every critical OLTP system, allocate one online replica.
- For each two non-cached applications or web servers, consider one online replica.
- For each 10 cached applications or web servers, consider one online replica.
- For every stage or QA database server analog, allocate one spare server.
OLTP systems, by their very nature, produce a very high transactional volume. Any disruption to this volume is extremely visible and costly. A primary goal with running a highly-available service, such as a database, is to minimize downtime. So for any database instance that is a critical component, there should be a copy of the server configured in such a manner that near-immediate promotion to production status is possible.
Any server that needs direct access to the database, whether it be a queue system, application server, or web frontend, is sensitive to database overload. One way of diffusing this risk is to set up one database copy for every two to four directly-connected servers. These copies are only usable for reads and not writes, but a properly designed application can accommodate this limitation. Not only does this reduce contention on the database instance that must handle data writes, it all but eliminates the likelihood of one misbehaving query from taking down the entire constellation of client-visible services.
When a sophisticated cache is involved, the risk to the frontend is greatly reduced. Properly designed, a failed read from the database can default to a cached copy until reads can be re-established. This means we can subsist on fewer database replicas. If the application does not provide that kind of cache, our job as database advocate becomes one of working with appropriate technical leads until such a cache is established.
The extra QA resource may seem excessive at first, but it has a very important role. While the testing teams may never touch the spare server, we can use it in their stead. We can never safely configure a production system for online failover without first testing that configuration on two similarly equipped systems. To do otherwise risks failure of the automatic activation of alternate production servers, which is a de facto outage. Database migrations, upgrades, resynchronization, backup restores, all of these can be tested in the QA environment before they are needed for production use. Without a second server, none of this would be possible.
We have brought this up as a tip before, but this deserves special attention. PostgreSQL 9.2 and above now has the capability to stream replicated data from one database standby to another. Even with 10 GbE network cards, there is a limit to the amount of data our master server can or should transmit before its role is put at risk.
While there is still a limit to the number of replicas, we can maintain with this new functionality, overall traffic-and therefore risk-is mitigated. If our database is stuck on a version before 9.2, we may never realize these new benefits. At the time of writing, PostgreSQL 9.6 is the latest release, and 10.0 is well underway. A crafty DBA can encourage the company to adopt a forward stance regarding upgrades by providing an upgrade proposal, procedural checklist, and deployment integration tests.
pg_upgrade is a standard part of PostgreSQL, producing a robust upgrade plan and associated compatibility tests is much easier than in the past. By pushing for upgrades early, we can use new features such as cascading replication, and with PostgreSQL, that can heavily influence our resulting architecture. Consider this when choosing your hardware.
Did we suggest that having several servers was serious? We lied. The place where our servers live, the data center, also has several redundancies in place. Extra network lines, separate power sources, multiple generators, air conditioning and ventilation, everything a server can require.
Yet, some have joked that a common backhoe is the natural enemy of the Internet. There is more truth to that statement than its apparent lack of gravitas might suggest. Data centers are geographically insecure. Inclement weather, natural disasters, disrupted backbones, power outages, and of course, accidentally damaged trunk lines (from an errant backhoe?), and simple human error can all remove a data center from the grid. When a data center vanishes from the Internet, our servers become collateral damage.
However, we've done everything right! We have duplicates of everything, multiple parts, cables, even whole servers. What can we possibly do about the data center?
Well, it's complicated...
For this section, we will need a list of every database server in our proposed architecture, and the desired role for each.
This won't be a very long list. In any case, follow these steps:
- For every critical OLTP operating pair, allocate at least one standby.
- For every two online standby replicas, consider at least one standby.
- For every other database instance, allocate one standby.
This type of scenario is known as Disaster Recovery. In order to truly diffuse a data center outage, we need backups of every major database server, and even minor servers. The reasoning is simple: we don't know how long we have to operate at reduced capacity. At that point, even non-critical reporting services still need analogs, otherwise business decisions that depend on activity analysis may not be possible.
We only really need half the amount of database servers, as most disaster recovery scenarios are severe enough for raised alertness, reduced refresh times, manually extended queue timeouts, and more. Not only is this less expensive than having a copy of every server as the primary data center, but it also encourages closer monitoring until it can be restored. Larger companies can opt for complete parity between data centers, but this is not a requirement.
As DBAs, our scenario often resembles this:
Notice that we didn't make any reservations for QA or development database servers. In the case of a disaster, the primary concern is ensuring the continued availability of the application platform. Further development or testing is likely on hold for the duration of the outage in any case.
We cannot stress the importance of this section strongly enough. Some may consider an entire extra data center as optional due to the cost. It is not. Others may think a total of three servers for every primary system is too much maintenance overhead. Again, it is not. The price of a few servers must be weighed against the future of the company itself; it is the cost of admission into the world of high availability.
By the time we begin utilizing failover nodes, or any replicas in a separate data center, the damage has already been done. In the absence of these resources, a database crash can result in hours or even days of unavailability depending on the size of our database, exponentially compounding the effects of the original problem.
With this in mind, all critical production systems the author designs always have a minimum of four nodes: two mirrored production systems, and two mirrored disaster recovery analogs. This ensures even the disaster recovery system is online with one node while the other node is experiencing maintenance. Outages are unexpected, and we must always be prepared for them.