|Read more about this book|
General logging setup is important but it is somewhat outside the scope of this article. You may need to set parameters such as log_destination, log_directory, and log_filename to save your log files in a way compatible with the system administrations requirements of your environment. These will all be set to reasonable defaults to get started with on most systems.
On UNIX-like systems, it's common for some of the database logging to be set in the script that starts and stops the server, rather than directly in the postgresql.conf file. If you instead use the pg_ctl command to manually start the server, you may discover that logging ends up on your screen instead. You'll need to look at the script that starts the server normally (commonly /etc/init.d/postgresql) to determine what it does, if you want to duplicate that behavior. In most cases, you just need to add –l logfilename to the pg_ctl command line to redirect its output to the standard location.
The default log_line_prefix is empty, which is not what you want. A good starting value here is the following:
This will put the following into every log line:
- %t: Timestamp
- %u: Database user name
- %r: Remote host connection is from
- %d: Database connection is to
- %p: Process ID of connection
It may not be obvious what you'd want all of these values for initially, particularly, the process ID. Once you've tried to chase down a few performance issues, the need for saving these values will be more obvious, and you'll be glad to already have this data logged.
Another approach worth considering is setting log_line_prefix such that the resulting logs will be compatible with the pgFouine program. That is a reasonable, general purpose logging prefix, and many sites end up needing to do some sort of query analysis eventually.
The options for this setting are as follows:
- none: Do not log any statement-level information.
- ddl: Log only Data Definition Language (DDL) statements such as CREATE and DROP. This can normally be left on even in production, and is handy to catch major changes introduced accidentally or intentionally by administrators.
- mod: Log any statement that modifies a value, which is essentially everything except for simple SELECT statements. If your workload is mostly SELECT based with relatively few data changes, this may be practical to leave enabled all the time.
- all: Log every statement. This is generally impractical to leave on in production due to the overhead of the logging. However, if your server is powerful enough relative to its workload, it may be practical to keep it on all the time.
Statement logging is a powerful technique for finding performance issues. Analyzing the information saved by log_statement and related sources for statement-level detail can reveal the true source for many types of performance issues. You will need to combine this with appropriate analysis tools.
Once you have some idea of how long a typical query statement should take to execute, this setting allows you to log only the ones that exceed some threshold you set. The value is in milliseconds, so you might set:
And then you'll only see statements that take longer than one second to run. This can be extremely handy for finding out the source of "outlier" statements that take much longer than most to execute.
If you are running 8.4 or later, you might instead prefer to use the auto_explain module: http://www.postgresql.org/docs/8.4/static/auto-explain.html instead of this feature. This will allow you to actually see why the queries that are running slowly are doing so by viewing their associated EXPLAIN plans.
Vacuuming and statistics
PostgreSQL databases require two primary forms of regular maintenance as data is added, updated, and deleted.
VACUUM cleans up after old transactions, including removing information that is no longer visible and returning freed space to where it can be re-used. The more often you UPDATE and DELETE information from the database, the more likely you'll need a regular vacuum cleaning regime. However, even static tables with data that never changes once inserted still need occasional care here.
ANALYZE looks at tables in the database and collects statistics about them— information like estimates of how many rows they have and how many distinct values are in there. Many aspects of query planning depend on this statistics data being accurate.
As both these tasks are critical to database performance over the long-term, starting in PostgreSQL 8.1 there is an autovacuum daemon available that will run in the background to handle these tasks for you. Its action is triggered by the number of changes to the database exceeding a threshold it calculates based on the existing table size.
The parameter for autovacuum is turned on by default in PostgreSQL 8.3, and the default settings are generally aggressive enough to work out of the box for smaller database with little manual tuning. Generally you just need to be careful that the amount of data in the free space map doesn't exceed max_fsm_pages, and even that requirement is automated away from being a concern as of 8.4.
Enabling autovacuum on older versions
If you have autovacuum available but it's not turned on by default, which will be the case with PostgreSQL 8.1 and 8.2, there are a few related parameters that must also be enabled for it to work, as covered in http://www.postgresql.org/docs/8.1/interactive/maintenance.html or http://www.postgresql.org/docs/8.2/interactive/routine-vacuuming.html.
The normal trio to enable in the postgresql.conf file in these versions are:
Note that as warned in the documentation, it's also wise to consider adjusting superuser_reserved_connections to allow for the autovacuum processes in these earlier versions.
The autovacuum you'll get in 8.1 and 8.2 is not going to be as efficient as what comes in 8.3 and later. You can expect it to take some fine tuning to get the right balance of enough maintenance without too much overhead, and because there's only a single worker it's easier for it to fall behind on a busy server. This topic isn't covered at length here. It's generally a better idea to put time into planning an upgrade to a PostgreSQL version with a newer autovacuum than to try and tweak an old one extensively, particularly if there are so many other performance issues that cannot be resolved easily in the older versions, too.
A few operations in the database server need working memory for larger operations than just regular sorting. VACUUM, CREATE INDEX, and ALTER TABLE ADD FOREIGN KEY all can allocate up to maintainance_work_mem worth of memory instead. As it's unlikely that many sessions will be doing one of these operations at once, it's possible to set this value much higher than the standard per-client work_mem setting. Note that at least autovacuum_max_workers (defaulting to 3 starting in version 8.3) will allocate this much memory, so consider those sessions (perhaps along with a session or two doing a CREATE INDEX) when setting this value.
Assuming you haven't increased the number of autovacuum workers, a typical high setting for this value on a modern server would be at five percent of the total RAM, so that even five such processes wouldn't exceed a quarter of available memory. This works out to approximately 50 MB of maintainance_work_mem per GB of server RAM.
PostgreSQL makes its decisions about how queries execute based on statistics collected about each table in your database. This information is collected by analyzing the tables, either with the ANALYZE statement or via autovacuum doing that step. In either case, the amount of information collected during the analyze step is set by default_statistics_target. Increasing this value makes analysis take longer, and as analysis of autovacuum happens regularly this turns into increased background overhead for database maintenance. But if there aren't enough statistics about a table, you can get bad plans for queries against it.
The default value for this setting used to be the very low (that is,10), but was increased to 100 in PostgreSQL 8.4. Using that larger value was popular in earlier versions, too, for general improved query behavior. Indexes using the LIKE operator tended to work much better with values greater than 100 rather than below it, due to a hard-coded change at that threshold.
Note that increasing this value does result in a net slowdown on your system if you're not ever running queries where the additional statistics result in a change to a better query plan. This is one reason why some simple benchmarks show PostgreSQL 8.4 as slightly slower than 8.3 at default parameters for each, and in some cases you might return an 8.4 install to a smaller setting. Extremely large settings for default_statistics_target are discouraged due to the large overhead they incur.
If there is just a particular column in a table you know that needs better statistics, you can use ALTER TABLE SET STATISTICS on that column to adjust this setting just for it. This works better than increasing the system-wide default and making every table pay for that requirement. Typically, the columns that really require a lot more statistics to work properly will require a setting near the maximum of 1000 (increased to 10,000 in later versions) to get a serious behavior change, which is far higher than you'd want to collect data for on every table in the database.
|Read more about this book|
Discussion in this article will focus mainly on common practice for initially setting these values.
Each WAL segment takes up 16 MB. As described at http://www.postgresql.org/docs/current/interactive/wal-configuration.html the maximum number of segments you can expect to be in use at any time is:
(2 + checkpoint_completion_target) * checkpoint_segments + 1
Note that in PostgreSQL versions before 8.3 that do not have spread checkpoints, you can still use this formula, just substitute the following code snippet for the value you'll be missing:
The easiest way to think about the result is in terms of the total size of all the WAL segments that you can expect to see on disk, which has both a disk cost and serves as something that can be used to estimate the time for recovery after a database crash. The expected peak pg_xlog size grows as shown in the following table:
The general rule of thumb you can extract here is that for every 32 checkpoint segments, expect at least 1 GB of WAL files to accumulate. As database crash recovery can take quite a while to process even that much data, 32 is as high as you want to make this setting for anything but a serious database server. The default of 3 is very low for most systems though; even a small install should consider an increase to at least 10.
Normally, you'll only want a value greater than 32 on a smaller server when doing bulk-loading, where it can help performance significantly and crash recovery isn't important. Databases that routinely do bulk loads may need a higher setting.
The default for this setting of 5 minutes is fine for most installations. If your system isn't able to keep up with writes and you've already increased checkpoint_segments to where the timeout is the main thing driving when checkpoints happen, it's reasonable to consider an increase to this value. Aiming for 10 minutes or more between checkpoints isn't dangerous; again it just increases how long database recovery after a crash will take. As this is one component to database server downtime after a crash, that's something you need a healthy respect for.
If you have increased checkpoint_segments to at least 10, it's reasonable at that point to also increase checkpoint_competion_target to its practical maximum of 0.9. This gives maximum checkpoint spreading, which theoretically means the smoothest I/O, too. In some cases keeping the default of 0.5 will still be better however, as it makes it less likely that one checkpoint's writes will spill into the next one.
It's unlikely that a value below 0.5 will be very effective at spreading checkpoints at all. Moreover, unless you have an extremely large value for the number of segments the practical difference between small changes in its value are unlikely to matter. One approach for the really thorough is to try both 0.5 and 0.9 with your application and see which one gives the smoother disk I/O curve over time, as judged by OS-level monitoring.
WAl settings refer to the PostgreSQL Write-Ahead Log (WAL).
While the documentation on wal_buffers suggests that the default of 64 KB is sufficient as long as no single transaction exceeds that value, in practice write-heavy benchmarks see optimal performance at higher values than you might expect from that, at least 1 MB or more. With the only downside being the increased use of shared memory, and as there's no case where more than a single WAL segment could need to be buffered, given modern server memory sizes the normal thing to do nowadays is to just set:
Then forget about it as a potential bottleneck or item to tune further. Only if you're tight on memory should you consider a smaller setting.
One purpose of wal_sync_method is to tune such caching behavior.
The default behavior here is somewhat different from most of the options. When the server source code is compiled, a series of possible ways to write are considered. The one believed most efficient then becomes the compiled-in default. This value is not written to the postgresql.conf file at initdb time though, making it different from other auto-detected, platform-specific values such as shared_buffers.
Before adjusting anything, you should check what your platform detected as the fastest safe method using SHOW; the following is a Linux example:
postgres=# show wal_sync_method;
On both Windows and the Mac OS X platforms, there is a special setting to make sure the OS clears any write-back caches. The safe value to use on these platforms that turns on this behavior is as follows:
If you have this setting available to you, you really want to use it! It does exactly the right thing to make database writes safe, while not slowing down other applications the way disabling an entire hard drive write cache will do.
This setting will not work on all platforms however. Note that you will see a performance drop going from the default to this value, as is always the case when going from unsafe to reliable caching behavior.
On other platforms, tuning wal_sync_method can be much more complicated. It's theoretically possible to improve write throughput on any UNIX-like system by switching from any write method that uses a write/fsync or write/fdatasync pair to using a true synchronous write. On platforms that support safe DSYNC write behavior, you may already see this as your default when checking it with SHOW:
Even though you won't see it explicitly listed in the configuration file as such. If this is the case on your platform, there's little optimization beyond that you can likely perform. open_datasync is generally the optimal approach, and when available it can even use direct I/O as well to bypass the operating system cache.
The Linux situation is perhaps the most complicated. As shown in the last code, this platform will default to fdatasync as the method used. It is possible to switch this to use synchronous writes with:
Also, in many cases you can discover this is faster—sometimes much faster—than the default behavior. However, whether this is safe or not depends on your filesystem. The default filesystem on most Linux systems, ext3, does not handle O_SYNC writes safely in many cases, which can result in corruption. See "PANIC caused by open_sync on Linux" at http://archives.postgresql.org/pgsqlhackers/2007-10/msg01310.php for an example of how dangerous this setting can be on that platform. There is evidence that this particular area has fi nally been cleaned up on recent (2.6.32) kernels when using the ext4 filesystem instead, but this has not been tested extensively at the database level yet.
In any case, your own tests of wal_sync_method should include the "pull the cord" test, where you power the server off unexpectedly, to make sure you don't lose any data with the method you've used. Testing at a very high load for a long period of time is also advisable, to find intermittent bugs that might cause a crash.
While all of the settings in this section can be adjusted per client, you'll still want good starting settings for these parameters in the main configuration file. Individual clients that need values outside the standard can always do so using the SET command within their session.
PostgreSQL is expected to have both its own dedicated memory (shared_buffers) as well as utilize the filesystem cache. In some cases, when making decisions like whether it is efficient to use an index or not, the database compares sizes it computes against the effective sum of all these caches; that's what it expects to find in effective_cache_size.
The same rough rule of thumb that would put shared_buffers at 25 percent of system memory would set effective_cache_size to between 50 and 75 percent of RAM. To get a more accurate estimate, first observe the size of the filesystem cache:
- UNIX-like systems: Add the free and cached numbers shown by the free or top commands to estimate the filesystem cache size
- Windows: Use the Windows Task Manager's Performance tab and look at the System Cache size
Assuming you have already started the database, you need to then add the shared_buffers figure to this value to arrive at a figure for effective_cache_size. If the database hasn't been started yet, usually the OS cache will be an accurate enough estimate, when it's not running. Once it is started, most of the database's dedicated memory will usually be allocated to its buffer cache anyway.
effective_cache_size does not allocate any memory. It's strictly used as input on how queries are executed, and a rough estimate is sufficient for most purposes. However, if you set this value much too high, actually executing the resulting queries may result in both the database and OS cache being disrupted by reading in the large number of blocks required to satisfy the query believed to fit easily in RAM.
It's rare you'll ever see this parameter tuned on a per-client basis, even though it is possible.
The overhead of waiting for physical disk commits was stressed as a likely bottleneck for committing transactions. If you don't have a battery-backed write cache to accelerate that, but you need better commit speed, what can you do? The standard approach is to disable synchronous_commit, which is sometimes alternately referred to as enabling asynchronous commits. This groups commits into chunks at a frequency determined by the related wal_writer_delay parameter. The default settings guarantee a real commit to disk at most 600 milliseconds after the client commit. During that window, which you can reduce in size with a corresponding decrease in speed-up, that data will not be recovered afterwards if your server crashes.
Note that it's possible to turn this parameter off for a single client during its session rather than making it a server-wide choice:
SET LOCAL synchronous_commit TO OFF;
This provides you with the option of having different physical commit guarantees for different types of data you put into the database. A routine activity monitoring table, one that was frequently inserted into and where a fraction of a second of loss is acceptable, would be a good candidate for asynchronous commit. An infrequently written table holding real-world monetary transactions should prefer the standard synchronous commit.
When a query is running that needs to sort data, the database estimates how much data is involved and then compares it to the work_mem parameter. If it's larger (and the default is only 1 MB), rather than sorting in memory it will write all the data out and use a disk-based sort instead. This is much, much slower than a memory based one. Accordingly, if you regularly sort data, and have memory to spare, a large increase in work_mem can be one of the most effective ways to speed up your server. A data warehousing report might on a giant server run with a gigabyte of work_mem for its larger reports.
The catch is that you can't necessarily predict the number of sorts any one client will be doing, and work_mem is a per-sort parameter rather than a per-client one. This means that memory use via work_mem is theoretically unbounded, where a number of clients sorting large enough things to happen concurrently.
In practice, there aren't that many sorts going on in a typical query, usually only one or two. And not every client that's active will be sorting at the same time. The normal guidance for work_mem is to consider how much free RAM is around after shared_buffers is allocated (the same OS caching size figure needed to compute effective_cache_size), divide by max_connections, and then take a fraction of that figure; a half of that would be an aggressive work_mem value. In that case, only if every client had two sorts active all at the same time would the server be likely to run out of memory, which is an unlikely scenario.
The work_mem computation is increasingly used in later PostgreSQL versions for estimating whether hash structures can be built in memory. Its use as a client, memory size threshold is not limited just to sorts. That's simply the easiest way to talk about the type of memory allocation decision it helps to guide.
Like synchronous_commit, work_mem can also be set per-client. This allows an approach where you keep the default to a moderate value, and only increase sort memory for the clients that you know are running large reports.
This parameter is common to tune, but explaining what it does requires a lot of background about how queries are planned. Particularly in earlier PostgreSQL versions, lowering this value from its default—for example, a reduction from 4.0 to 2.0—was a common technique. It was used for making it more likely that the planner would use indexed queries instead of the alternative of a sequential scan. With the smarter planner in current versions, this is certainly not where you want to start tuning at. You should prefer getting better statistics and setting the memory parameters as primary ways to influence the query planner.
If you are using PostgreSQL 8.3 or earlier versions, and you are using the database's table inheritance feature to partition your data, you'll need to turn this parameter on.
Starting in 8.4, constraint_exclusion defaults to a new smarter setting named partition that will do the right thing in most situations without it ever needing to be adjusted.
Tunables to avoid
There are a few parameters in the postgesql.conf that have gathered up poor guidance in other guides you might come across, and they might already be set badly in a server whose configuration you're now responsible for. Others have names suggesting a use for the parameter that actually doesn't exist. This section warns you about the most common of those to avoid adjusting.
If you just want to ignore crash recovery altogether, you can do that by turning off the fsync parameter. This makes the value for wal_sync_method irrelevant, because the server won't be doing any WAL sync calls anymore.
It is important to recognize that if you have any sort of server crash when fsync is disabled, it is likely your database will be corrupted and no longer start afterwards. Despite this being a terrible situation to be running a database under, the performance speedup of turning crash recovery off is so large that you might come across suggestions you disable fsync anyway. You should be equally hesitant to trust any other advice you receive from sources suggesting this, as it is an unambiguously dangerous setting to disable.
One reason this idea gained traction is that in earlier PostgreSQL versions, there was no way to reduce the number of fsync calls to a lower number—to trade-off some amount of reliability for performance. Starting 8.3, in most cases where people used to disable fsync it's a better idea to turn off synchronous_commit instead.
There is one case where fsync=off may still make sense: initial bulk loading. If you're inserting a very large amount of data into the database, and do not have hardware with a battery-backed write cache, you might discover this takes far too long to ever be practical. In this case, turning the parameter off during the load— where all data can easily be recreated if there is a crash causing corruption—may be the only way to get loading time below your target. Once your server is back up again, you should turn it right back on again.
Some systems will also turn off fsync on servers with redundant copies of the database— for example, slaves used for reporting purposes. These can always resynchronize against the master if their data gets corrupted.
Much like fsync, turning this parameter off increases the odds of database corruption in return for an increase in performance. You should only consider adjusting this parameter if you're doing extensive researching into your filesystem and hardware, in order to assure partial page writes do not happen.
commit_delay and commit_siblings
Before synchronous_commit was implemented, there was an earlier attempt to add that sort of feature enabled by the commit_delay and commit_siblings parameters. These are not effective parameters to tune in most cases. It is extremely difficult to show any speedup by adjusting them, and quite easy to slow every transaction down by tweaking them. The only case where they have shown some value is for extremely high I/O rate systems. Increasing the delay to a very small amount can make writes happen in bigger blocks, which sometimes turn out better aligned when combined with larger RAID stripe sizes in particular.
Many people see this name and assume that as they use prepared statements, a common technique to avoid SQL injection, that they need to increase this value. This is not the case; the two are not related. A prepared transaction is one that uses PREPARE TRANSACTION for two-phase commit (2PC). If you're not specifically using that command and 2PC, you can leave this value at its default. If you are using those features, only then will you likely need to increase it to match the number of connections.
Query enable parameters
It's possible to disable many of the query planner's techniques, in hopes of avoiding a known bad type of query. This is sometimes used as a work-around for the fact that PostgreSQL doesn't support direct optimizer hints for how to execute a query. You might see the following code snippet, suggested as a way to force use of indexes instead of sequential scans for example:
enable_seqscan = off
Generally this is a bad idea, and you should improve the information the query optimizer is working with so it makes the right decisions instead.
There are almost 200 values you might adjust in a PostgreSQL database's configuration, and getting them all right for your application can be quite a project. The guidelines here should get you into the general area where you should start at though, avoid the most common pitfalls, and give you an idea what settings are more likely to be valuable when you do run into trouble.
- The default values in the server configuration file are very short on logging information and have extremely small memory settings. Every server should get at least a basic round of tuning to work around the worst of the known issues.
- The memory-based tunables, primarily shared_buffers and work_mem, need to be adjusted carefully and in unison to make sure your system doesn't run out of memory altogether.
- The query planner needs to know about the memory situation and have good table statistics in order to make accurate plans.
- The autovacuum process is also critical to make sure the query planner has the right information to work with, as well as to keep tables maintained properly.
- In many cases, the server does not need to be restarted to make a configuration change, and many parameters can even be adjusted on a per-client basis for really fine-tuning.