PostgreSQL certainly has a reputation. It's known for having a rich feature set and very stable software releases. The secure stance, which its default configuration takes, is simultaneously praised by security fans and criticized for its learning curve. The SQL specification conformance and data integrity features allow only the strictest ways to interact with the database, surprising to those who come from a background working with looser desktop database software. All of these points have an element of truth to them.
Another part of PostgreSQL's reputation is that it's slow, but in new releases PostgreSQL's performance has improved dramatically. There are many database operations where "the right thing" just plain takes longer to do than the alternative. As the simplest example of that, if you ask a database to store February 29, 2009 into a date field, a date that's only valid as an abstract one, it can just do that, the fast approach. Or it can check whether that date is valid to store into the destination field, note there is no such date in a regular calendar, and reject your change. That's always going to be slower. PostgreSQL is designed by and for the sort of people who don't like cutting corners just to make things faster or easier, and in cases where the only way you can properly handle something takes a while, that may be the only option available.
However, once you have the correct implementation of something, you can then go back and optimize it. That's the mode PostgreSQL has been in for the last few years. PostgreSQL usually rises above these smaller issues now to give excellent database performance anyway. Parts of it have the sort of great design that outperforms simpler approaches, even after paying the overhead that complexity can introduce. This is a fairly recent phenomenon though, which explains quite a bit of the perception that PostgreSQL is a slower database than its competitors.
In this chapter, we will cover the following topics:
- Performance of historical PostgreSQL releases
- PostgreSQL or another database?
- PostgreSQL Tools
- PostgreSQL application scaling life cycle
- Performance tuning as a practice
In September of 2016, PostgreSQL 9.6 was released. It included a number of internal architectural changes with significant changes in the performance such as Parallel execution of sequential scans, aggregates and joins. Specially improvement in the area of scalability on multi-CPU-socket servers. The complete change log can be found at (https://www.postgresql.org/docs/9.6/static/release-9-6.html).
The results were a major improvement in the ability of the database to scale upwards to handle a heavy load. Benchmarks on modern hardware really highlight just how far that version leapfrogged earlier ones. Here are the detailed benchmark results from version 8.3 to version 9.6. The PostgreSQL's tool pgbench is used to benchmark the different releases of PostgreSQL. The complete documentation can be found at (https://www.postgresql.org/docs/current/static/pgbench.html).
The test provide the Transactions Per Second (TPS) figure that measures total system speed in case of read-only transaction and mixed read/write transactions. Here is the mixed mode (read/write) performance comparison form version 8.3 to 9.6.
The performance can easily be seen from versions, you can observe that when increasing the number of clients increase the TPS, but at some peek value of clients TPS start decreasing. That's indicate that that are the approximately maximum no of clients that version can handle on particular hardware. Hence the rise in the number of clients at the peak load gives an idea of how well the database internals handle access to shared resources, the area 9.6 in particular included a significant upgrade to 8.4/8.3:
These improvements have been confirmed by other benchmarking results, albeit normally not covering such a wide range of versions. It's easy to see that any conclusion about PostgreSQL performance reached before late 2005, when 8.1 shipped, is completely out of date at this point. The speed improvement in 9.0 releases was an additional large leap. Performance is not the only reason to choose version greater than 9.0, there are other reasons to prefer using that one or later too.
More detail PostgreSQL benchmark can be found at http://blog.pgaddict.com/posts/performance-since-postgresql-7-4-to-9-4-pgbench
Due to the significant gains detailed in the previous section, if you have an older PostgreSQL system you'd like to make faster, the very first thing you should ask yourself is not how to tweak its settings, but instead if it's possible to upgrade to a newer version. If you're starting a new project, 9.6 is the earliest version you should consider. In addition to the performance improvements, there are some changes to that version that impact application coding that you'd be better off to start with, to avoid needing to retrofit later.
The final chapter of this book includes a reference guide to what the performance related features were added to each major version of PostgreSQL, from 8.4 through 9.6. You might discover that one of the features only available in a very recent version is compelling to you, and therefore have a strong preference to use that one. Many of these version specific changes are also highlighted throughout the book.
The older way of to upgrade an existing PostgreSQL install to a newer major version, such as going from 9.1.X to 9.2.X, was to dump and reload. The
pg_dumpall programs are used to write the entire contents of the database to a file, using the newer version of those programs. That way, if any changes need to be made to upgrade, the newer dumping program can try to handle them. Not all upgrade changes will happen automatically though. Then, depending on the format you dumped in, you can either restore that just by running the script it generates, or use the
pg_restore program to handle that task.
pg_restore can be a much better alternative in newer PostgreSQL versions that include a version with parallel restore capabilities.
If you are using a system that doesn't easily allow running more than one system PostgreSQL version at a time, such as the current RedHat Linux RPM packages, getting both old and new versions of PostgreSQL installed on your system at the same time can be difficult. There are some changes to improve this situation under development for PostgreSQL 9.0. Make sure to check the feasibility of running more than one version at once as part of planning an upgrade.
Dumping can take a while, and restoring even longer. While this is going on, your database likely needs to be down, so that you don't allow any changes that won't then be migrated over by the dump. For large databases, this downtime can be both large and unacceptable.
Nowadays, a real approach available for zero-downtime, 24/7 sites that need to do PostgreSQL upgrades, has been to use statement replication. Slony (http://slony.info/) is the most popular tool for that, and more information about it will be covered in the later replication chapter. One of Slony's features is that you don't have to be running the same version of PostgreSQL on all the nodes you are replicating to. You can bring up a new node running a newer PostgreSQL version; wait for replication to complete to it, then switch over once it matches the original.
PostgreSQL provides a new way, works without needing replication software. A program calls
pg_upgrade (https://www.postgresql.org/docs/9.6/static/pgupgrade.html) shipped with PostgreSQL distribution is capable of upgrading from older version to new version without the dump and reload, a process called in-place upgrading. While all in-place upgrades have some risk and need careful testing, in many cases
pg_upgrade will take you from 8.3 to the latest new version.
The PostgreSQL development community is now committed to allowing in-place upgrades to future versions. Now that terabyte and larger PostgreSQL installs are common, upgrading only via dump and reload just isn't always practical.
A dump/reload or use of tools such as
pg_upgrade is not needed for minor version updates, for example going from 9.4.3 to 9.4.4. These simply require stopping the server, installing the new version, and then running the newer database binary against the existing server data files. Some people avoid ever doing such upgrades once their application is running; for fear that a change in the database will cause a problem. This should never be the case for PostgreSQL. The policy of the PostgreSQL project described at http://www.postgresql.org/support/versioning states very clearly:
While upgrades always have some risk, PostgreSQL minor releases fix only frequently encountered, security, and data corruption bugs to reduce the risk of upgrading. The community considers not upgrading to be riskier than upgrading.
You should never find an unexpected change that breaks an application in a minor PostgreSQL upgrade. Bug, security, and corruption fixes are always done in the way that minimizes the odds of introducing an externally visible behavior change, and if that's not possible the reason why and suggested workarounds will be detailed in the release notes. What you will find is that some subtle problems resulting from resolved bugs could clear up after even a minor version update. It's not uncommon to discover a report of a problem to one of the PostgreSQL mailing lists is resolved in the latest minor version update compatible with that installation, and upgrading to that version is all that's needed to make the issue go away.
There are certainly situations where other database solutions will perform better. For example, PostgreSQL is missing features needed to perform well on some of the more difficult queries in the TPC-H test suite (see the database benchmarking chapter number 8 for more details). It's correspondingly less suitable for running large data warehouse applications than many of the commercial databases. If you need queries like some of the very heavy ones TPC-H includes, you may find that databases such as Oracle, DB2, and SQL Server still have a performance advantage worth paying for. There are also several PostgreSQL derived databases that include features making them more appropriate for data warehouses and similar larger systems. Examples include Greenplum, Aster Data, and Netezza.
For some types of web applications, you can only get acceptable performance by cutting corners on the data integrity features in ways that PostgreSQL just won't allow. A less strict database such as MySQL, or even a really minimal one such as SQLite might better serve these applications. Unlike the fairly mature data warehouse market, the design of this type of application is still moving around quite a bit. Work on approaches using the key-value based NoSQL approach, including CouchDB, MongoDB, and Cassandra, are all becoming more popular at the time of writing. All of them can easily outperform a traditional database, if you have no need to run the sort of advanced queries key/value stores are slower at handling.
But for many "normal" database use cases in the middle ground between those two extremes, PostgreSQL performance in 9.6 reached a point where it's more likely you'll run into the limitations of your hardware or application design before the database is your limiting factor. And some of PostgreSQL's traditional strengths, like its ability to handle complicated queries well and its heavy programmability, are all still there.
If you're used to your database vendor supplying a full tool chain with the database itself, from server management to application development, PostgreSQL may be a shock to you. Like many successful open-source projects, PostgreSQL tries to stay focused on the features it's uniquely good at. This is what the development community refers to as the PostgreSQL core: the main database server, and associated utilities that can only be developed as part of the database itself. When new features are proposed, if it's possible for them to be built and distributed "out of core", this is the preferred way to do things.
This approach keeps the database core as streamlined as possible, as well as allowing those external projects to release their own updates without needing to synchronize against the main database's release schedule.
Successful PostgreSQL deployments should recognize that a number of additional tools, each with their own specialized purpose, will need to be integrated with the database core server to build a complete system.
One part of the PostgreSQL core that you may not necessarily have installed is what's called the contrib modules, after the
contrib directory they are stored in. These are optional utilities shipped with the standard package, but that aren't necessarily installed by default on your system. The contrib code is maintained and distributed as part of the PostgreSQL core, but is not required for the server to operate.
From a code quality perspective, the contrib modules aren't held to quite as high of a standard primarily by how they're tested. The main server includes heavy regression tests for every feature, run across a large build farm of systems that look for errors. The optional contrib modules don't get that same level of testing coverage. But the same development team maintains the code itself, and some of the modules are extremely popular and well tested by users.
A listing of all the contrib modules available is at http://www.postgresql.org/docs/current/static/contrib.html.
One good way to check if you have contrib modules installed is to see if the
pgbench program is available. That's one of the few contrib components that installs a full program, rather than just scripts you can use. Here's a UNIX example of checking for
$ pgbench -V Â Â Â pgbench (PostgreSQL) 9.6
If you're using an RPM or DEB packaged version of PostgreSQL, as would be the case on many Linux systems, the optional
postgresql-contrib package contains all of the contrib modules and their associated installer scripts. You may have to add that package using
apt-get, or a similar mechanism if it isn't installed already. On Solaris, the package is named
If you're not sure where your system PostgreSQL contrib modules are installed, you can use a filesystem utility to search for them. The
locate works well for this purpose on many UNIX-like systems, as does the
find command. The file search utilities available on the Windows Start menu will work. A sample file you could look for is
pg_buffercache.sql which will be used in the upcoming Chapter 5, Memory for Database Caching, on memory allocation. Here's where you might find the file on some of the platforms that PostgreSQL supports:
- RHEL and CentOS Linux systems will put the main file you need into
- Debian or Ubuntu Linux systems will install the file at
- Solaris installs into
- The standard Windows one-click installer with default options will always include the contrib modules, and this one will be in
Building your own PostgreSQL from source code can be a straightforward exercise on some platforms, if you have the appropriate requirements already installed on the server. Details are documented at http://www.postgresql.org/docs/current/static/install-procedure.html.
After building the main server code, you'll also need to compile contrib modules like
pg_buffercache yourself too. Here's an example of how that would work, presuming that your PostgreSQL destination is
/usr/local/postgresql and there's a directory under there named thatÂ
source you put the source code into (this is not intended to be a typical or recommended structure you should use):
$ cd /usr/local/postgresql/source Â Â Â $ cd contrib/pg_buffercache/ Â Â Â $ make Â Â Â $ make install Â Â Â /bin/mkdir -p '/usr/local/postgresql/lib/postgresql' Â Â Â /bin/mkdir -p '/usr/local/postgresql/share/postgresql/contrib' Â Â Â /bin/sh ../../config/install-sh -c -m 755 pg_buffercache.so '/usr/local/postgresql/lib/postgresql/pg_buffercache.so' Â Â Â /bin/sh ../../config/install-sh -c -m 644 ./uninstall_pg_buffercache.sql '/usr/local/postgresql/share/postgresql/contrib' Â Â Â /bin/sh ../../config/install-sh -c -m 644 pg_buffercache.sql '/usr/local/postgresql/share/postgresql/contrib'
It's also possible to build and install all the contrib modules at once by running
make install from the
contrib directory. Note that some of these have more extensive source code build requirements. The
uuid-ossp module is an example of a more challenging one to compile yourself.
While some contrib programs, such as
pgbench, are directly executable, most are utilities that you install into a database in order to add extra features to it.
As an example, to install the
pg_buffercache module into a database named abc, the following command line would work (assuming the RedHat location for the file):
Â Â Â $ psql -d abc -f /usr/share/postgresql/contrib/pg_buffercache.sql
You could instead use the pgAdmin III GUI management utility, which is bundled with the Windows installer for PostgreSQL, instead of the command line:
- Navigate down to the database you want to install the module into
- Click on the SQL icon in the toolbar to bring up the command editor
File/Open. Navigate to
C:\Program Files\PostgreSQL/version/share/contrib/pg_buffercache.sqland open that file.
- Execute using either the green arrow or
You can do a quick test of whether the module installed on any type of system by running the following quick query:
Â Â Â SELECT * FROM pg_buffercache;
If any results come back, the module was installed. Note that
pg_buffercache will only be installable and usable by database superusers.
PostgreSQL introduces another way to install contrib module, called extensions. If any contrib module build with the extension specification which includes:
- The Extension SQL file (
- The Extension control file (
- The Extensions library file (
Let's look at the following commands:
- CREATE EXTENSION extension_name:Â This command will install a new extension called dummy, so there will be no need to manually execute a sql file like we did while installing
- ALTER EXTENSION extension name:Â This command will alter the extension and is mostly used to upgrade an extension.
- DROP EXTENIONS extension_name:Â This command deletes a extension, after this clean-up is done.
The official home of many PostgreSQL related projects is pgFoundry (http://pgfoundry.org/).
pgFoundry only hosts software for PostgreSQL, and it provides resources such as mailing lists and bug tracking in addition to file distribution. Many of the most popular PostgreSQL add-on programs are hosted there, including the following:
- Windows software allowing access to PostgreSQL through .Net and OLE
- Connection poolers pgpool and pgBouncer
- Database management utilities such as pgFouine, SkyTools, and pgtune
While sometimes maintained by the same people who work on the PostgreSQL core, pgFoundry code varies significantly in quality. One way to help spot the healthier projects is to note how regularly and recently new versions have been released.
Beyond what comes with the PostgreSQL core, the contrib modules, and software available on pgFoundry, there are plenty of other programs which will make PostgreSQL easier and more powerful available from sources all over the Internet. There are actually so many available that choosing the right package for a requirement can itself be overwhelming.
Some of the best programs will be highlighted throughout the book, to help provide a shortlist of ones you should consider early on. This approach, where you get a basic system running and then add additional components as needed, is the standard way in which large open-source projects are built.
It can be difficult for some corporate cultures, ones where any software installation requires everything from approval to a QA cycle, to adapt to that style. To improve the odds your PostgreSQL installation will be successful in such environments, it's important to start early on introducing this concept early on. Additional programs to add components building on the intentionally slim database core will be needed later, and not all of what's needed will be obvious at the beginning.
While every application has unique growth aspects, there are many common techniques that you'll find necessary as the application using a PostgreSQL database becomes used more heavily. The chapters of this book each focus on one of the common aspects of this process. The general path database servers follow includes:
- Select hardware to run the server on. Ideally, you'll test that hardware to make sure it performs as expected too.
- Setup all the parts of database disk layout: RAID level, filesystem, and possibly table/index layout on disk.
- Optimize the server configuration.
- Monitor server performance and how well queries are executing.
- Improve queries to execute more efficiently, or add indexes to help accelerate them.
- As it gets more difficult to just tune the server to do more work, instead reduce the amount it has to worry about by introducing connection pooling and caching.
- Replicate the data onto multiple servers and distribute reads among them.
- Partition larger tables into sections. Eventually really large ones may need to be split so they're written to multiple servers simultaneously.
This process is by no means linear. You can expect to make multiple passes over optimizing the server parameters. It may be the case that you decide to buy newer hardware first, rather than launching into replication or partitioning work that requires application redesign work. Some designs might integrate caching into the design from the very beginning. The important thing is to be aware of the various options available, and to collect enough data about what limits the system is reaching to help youÂ decide which of the potential changes are most likely to help.
Work on improving database performance has its own terminology just like any other field. Here are some terms or phrases that will be used throughout the book:
- Bottleneck or limiting factor: Both these terms will be used to refer to the current limitation that is most hindering better performance..
- Benchmarking: This refers to running a test to determine how fast a particular operation can run. This is often done to figure out where the bottleneck of a program or system is.
- Profiling: This refers to monitoring what parts of a program are using the most resources when running a difficult operation such as a benchmark. This is typically used to help prove where the bottleneck is, and whether it's been removed as expected after a change. Profiling a database application usually starts with monitoring tools such as
iostat. Popular profiling tools at the code level include
One of the interesting principles of performance tuning work is that, in general, you cannot figure out what the next bottleneck an application will run into is until you remove the current one. When presented with a system that's not as fast as someone would expect it to be, you'll often see people guessing what the current bottleneck is, or what the next one will be. That's generally a waste of time. You're always better off measuring performance, profiling the parts of the system that are slow, and using that to guess at causes and guide changes.
Let's say what you've looked at suggests that you should significantly increase
shared_buffers, the primary tunable for memory used to cache database reads and writes. This normally has some positive impact, but there are also potential negative things you could encounter instead. The information needed to figure out which category a new application will fall into, whether this change will increase or decrease performance, cannot be predicted from watching the server running with the smaller setting. This falls into the category of chaos theory where even a tiny change in starting conditions can end up rippling out to a very different end condition, as the server makes millions of decisions impacted to a small degree by that change. Similarly, if
shared_buffers is set too small, there are several other parameters that won't work as expected at all, such as those governing database checkpoints.
Since you can't predict what's going to happen most of the time, the mindset you need to adopt is one of heavy monitoring and change control. Monitor as much as possible, from application to database server to hardware. Introduce a small targeted change. Try to quantify what's different and be aware that some changes you have rejected as not positive won't always stay that way forever. Move the bottleneck to somewhere else, and you may discover that some parameter that didn't matter before is now suddenly the next limiting factor.
There's a popular expression on the mailing list devoted to PostgreSQL performance when people speculate about root causes without doing profiling to prove their theories: "less talk, more
gprof may not be the tool of choice for every performance issue, given it's more of a code profiling tool than a general monitoring one, the idea that you measure as much as possible before speculating as to root causes is always a sound one. And measure again to validate your change did what you expected, too.
Another principle that you'll find a recurring theme of this book is that you must be systematic about investigating performance issues. Do not assume your server is fast because you bought it from a reputable vendor; benchmark the individual components yourself. Don't start your database performance testing with application level tests; run synthetic database performance tests you can compare against other people's first. That way, when you run into the inevitable application slowdown, you'll already know your hardware is operating as expected and that the database itself is running well. Once your system goes into production, some of the basic things you might need to do in order to find a performance problem, such as testing hardware speed, become impossible to take the system down for.
You'll be in much better shape if every server you deploy is tested with a common methodology, which is exactly what later chapters in this book leads you through. Just because you're not a "hardware guy", that doesn't mean you should skip over the parts here that cover things like testing your disk performance. You need to do such work like that as often as possible when exposed to new systems. That's the only way to get a basic feel for whether something is operated within the standard range of behavior, or if, instead, there's something wrong.
PostgreSQL has come a long way in the last five years. After building solid database fundamentals, the many developers adding features across the globe have made significant strides in adding both new features and performance improvements in recent releases. The features added to the latest PostgreSQL 9.0, making replication and read scaling easier than ever before, are expected to further accelerate the types of applications the database is appropriate for. There are still some situations where PostgreSQL's feature set results in slower query processing than some of the commercial databases it might otherwise displace. If you're starting a new project using PostgreSQL, use the latest version possible, and definitely to deploy 9.6 or later. PostgreSQL works well in many common database applications, but there are certainly applications it's not the best choice for. Not everything you need to manage and optimize a PostgreSQL server will be included in a basic install. Expect to need some additional utilities that add features outside of what the core database aims to provide. Performance tuning is best approached as a systematic, carefully measured practice. In the following chapter, we will discuss the hardware best-suited for the PostgreSQL server.