UNIX Monitoring Tool for PostgreSQL

Exclusive offer: get 50% off this eBook here
PostgreSQL 9.0 High Performance

PostgreSQL 9.0 High Performance — Save 50%

Accelerate your PostgreSQL system

$29.99    $15.00
by Gregory Smith | October 2010 | Open Source

Performance of your database server is directly tied to how well the underlying operating system is working, and there the performance is driven by the hardware you're using. To fit all of these pieces together—hardware performance, operating system performance, and database performance—you need a good monitoring system.

The simple performance tools on a UNIX-derived system are straightforward to use, and it's easy to show examples of good and bad behavior, the best way to teach how those tools are useful for monitoring. In this article by Gregory Smith, author of PostgreSQL 9.0 High Performance, we will cover iostat; Unix's monitoring tool.

 

PostgreSQL 9.0 High Performance

PostgreSQL 9.0 High Performance

Accelerate your PostgreSQL system

  • Learn the right techniques to obtain optimal PostgreSQL database performance, from initial design to routine maintenance
  • Discover the techniques used to scale successful database installations
  • Avoid the common pitfalls that can slow your system down
  • Filled with advice about what you should be doing; how to build experimental databases to explore performance topics, and then move what you've learned into a production database environment
  • Covers versions 8.1 through 9.0

 

        Read more about this book      

(For more resources on PostgreSQL, see here.)

iostat

The data vmstat gives is a total across all devices on the system. If you want totals per disk device instead, you need to use iostat for that.

On Linux, iostat defaults to slightly different behavior than vmstat. When it uses "block", it means a 512 byte chunk of data, not the 1024 bytes chunk vmstat uses. You can switch iostat to using kilobytes instead using iostat -k, or you can just divide all the figures by two in order to get them on the same scale. Here's an example of the same data shown both ways:

$ iostat
Device tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda1 0.07 3.29 0.24 1579784 115560
$ iostat -k
Device tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda1 0.07 1.64 0.12 789892 57780

Since not all UNIX versions will have the kilobyte option available, the examples here all use the default 512 byte blocks, and accordingly halve the block figures to interpret using kilobyte units.

You'll likely find that you need to average iostat data over a slightly longer period of time than vmstat data. A single second of vmstat data is a summary of all the disks on the system. A PostgreSQL database goes through several common phases:

  • Just after a checkpoint: heavy full-page writes to WAL, fewer writes to database disks because there are fewer dirty buffer evictions.
  • Between checkpoints: most are an even mix of WAL and database writes.
  • Checkpoint in progress: Small to moderate WAL writes; increasingly heavy database writes as checkpoint data is written and starts flowing to disk.
  • Checkpoint sync phase: Minimal WAL writes because fewer full page writes are likely happening; heavy writes to database disks as all data is flushed out of the OS cache.

If you are looking at the vmstat data, or if you don't have the pg_xlog WAL data broken out onto a separate disk, you can't see the balance of the data vs. WAL writes change; you just see a total. But if you're grabbing really short iostat snapshots, you're likely to see writes bounce between the WAL and database disks, with the exact pattern depending on where in the checkpoint cycle you're at. You need to combine a few seconds of data (5 seconds is used for these examples) in order to have both types of writes be usefully averaged out:

$ iostat 5
avg-cpu: %user %nice %system %iowait %steal %idle
42.69 0.00 18.07 6.69 0.30 32.25

Device tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 0.00 0.00 0.00 0 0
sda1 0.00 0.00 0.00 0 0
sdc 80.80 0.00 1286.40 0 6432
sdc1 80.80 0.00 1286.40 0 6432
sdd 77.80 0.00 1251.20 0 6256
sdd1 77.80 0.00 1251.20 0 6256
sde 69.40 0.00 1086.40 0 5432
sde1 69.40 0.00 1086.40 0 5432
sdf 2348.20 0.00 88262.40 0 441312
sdf1 2348.20 0.00 88262.40 0 441312
md0 311.40 0.00 2491.20 0 12456

Since all of the activity relates to the single partition on these disks, there's a lot of redundant data in here. You should also note that many of the statistics for the software RAID volume used here are not very interesting—you have to look at the underlying physical disk devices instead. If you're using hardware RAID, that particular problem will go away, but you won't have any easy way to get actual disk performance information out of that abstraction layer either; you'll just see the summary for the whole logical RAID device. The following examples eliminate all the redundant lines, and place the md0 array device between its individual components and the device the WAL is on (sdf1), for easier readability.

Examples of good performance

When busy but not overloaded, iostat data for this system looks like the following:

$ iostat 5
avg-cpu: %user %nice %system %iowait %steal %idle
18.54 0.00 9.45 23.49 0.15 48.38

Device tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sdc1 1068.80 0.00 15740.80 0 78704
sdd1 1041.80 0.00 15459.20 0 77296
sde1 1028.00 0.00 15377.60 0 76888
md0 5969.20 0.00 47753.60 0 238768
sdf1 989.00 0.00 40449.60 0 202248

The %iowait figure of 23% is high enough to know the disks are busy, but not completely saturated yet. This is showing 20 MB/s (40449.6 512-byte blocks per second) being written to the WAL and 24 MB/s to the entire database disk array, the latter of which is evenly split as almost 8 MB/s to each of the three drives.

Linux also features an extended iostat mode. This produces a large number of derived statistics from the underlying data. Since that's too wide to display here, the first example showing all of the data here has been transposed to swap the row for columns and vice-versa:

$ iostat –x 5
sdc1 sdd1 sde1 md0 sdf1
rrqm/s 0 0 0 0 0
wrqm/s 411.8 404.6 396.2 0 3975.4
r/s 0 0 0 0 0
w/s 438.6 442 444.2 2461.4 1229.8
rsec/s 0 0 0 0 0
wsec/s 6956.8 6966.4 6915.2 19691.2 41643.2
avgrq-sz 15.86 15.76 15.57 8 33.86
avgqu-sz 67.36 67.09 62.93 0 0.65
await 158.18 158.85 148.39 0 0.55
svctm 1.2 1.2 1.19 0 0.51
%util 52.8 52.88 53.04 0 63.04

All of the values here with a "q" in them (most of what's listed on the following bulleted line) represent figures related to the read or write queues on these devices. Since the queue size doesn't correspond with any real-world figure you can benchmark the device against, it's hard to do anything with that data. The number of read and write requests is similarly useless in a database context. The following fields of iostat -x data are therefore not that useful here:

  • rrqm/s, wrqm/s, r/s, w/s, avgrq-sz, avgqu-sz

Solaris has a similar extended mode available using iostat -xc

This next example is similar to the iostat one given previously:

$ iostat –x 5
avg-cpu: %user %nice %system %iowait %steal %idle
21.51 0.00 11.08 23.75 0.10 43.56

Device rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sdc1 0.00 6956.80 15.86 67.36 158.18 1.20 52.80
sdd1 0.00 6966.40 15.76 67.09 158.85 1.20 52.88
sde1 0.00 6915.20 15.57 62.93 148.39 1.19 53.04
md0 0.00 19691.20 8.00 0.00 0.00 0.00 0.00
sdf 0.00 41643.20 33.86 0.65 0.55 0.51 63.04

That's 21 MB/s written to the WAL and 20 MB/s to the database disks, about 7 MB/s to each one. However, recall that the total disk read or write throughput available depends heavily on how random the workload is, which is normally a hard thing to estimate. The %util figure, which is by far the most valuable of the derived figures shown here, gives you a rough idea of that by noting how congested the device is to achieve that throughput. In this next example, there's minimal database I/O and heavy WAL I/O, typical of the period just after a checkpoint:

$ iostat –x 5
avg-cpu: %user %nice %system %iowait %steal %idle
49.35 0.00 22.00 3.80 0.25 24.60

Device rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sdc1 0.00 2649.10 15.01 0.76 4.31 0.06 1.04
sdd1 0.00 2895.01 14.95 0.90 4.64 0.06 1.12
sde1 0.00 2728.94 15.06 0.82 4.51 0.06 1.04
md0 0.00 8273.05 8.00 0.00 0.00 0.00 0.00
sdf1 0.00 103760.48 38.11 0.23 0.09 0.09 23.47

This is happily getting >50 MB/s out of the WAL volume but it's still only busy 23.5% of the time. This suggests writes to it are being cached by the disk controller and written quite efficiently. One of the reasons to break out the WAL onto its own disk is because it makes it so easy to monitor this balance between WAL and database writes, and to determine if the WAL volume (which only gets sequential writes normally) is keeping up. Since there are techniques to accelerate the WAL writes at the expense of something else, such as switching to an unjournaled filesystem, the %util figure can help you determine when the WAL is the system bottleneck and therefore necessary to accelerate that way.

A final example of good performance involves the database disks. There are some operations in PostgreSQL that can bypass writing to the WAL. For example, if you start a transaction that creates a new table and does a COPY into it, as long as you don't have PITR archiving turned on that data is not put through the WAL before being written to disk. The idea is that if the server crashes, the whole transaction will be rolled back anyway, which includes deleting the table data; therefore, whether it's consistent or not at the block level doesn't matter.

Here is what the database disks are capable of when running such a COPY, which essentially turns into sequential write I/O directly to the database:

$ iostat –x 5
avg-cpu: %user %nice %system %iowait %steal %idle
16.39 0.00 6.85 12.84 0.00 63.92

Device rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sdc1 25.60 58710.40 249.09 27.22 115.43 1.19 28.08
sdd1 24.00 58716.80 249.11 27.76 117.71 1.20 28.24
sde1 1.60 58667.20 250.51 28.31 120.87 1.14 26.80
md0 51.20 176094.40 8.00 0.00 0.00 0.00 0.00
sdf1 0.00 0.00 0.00 0.00 0.00 0.00 0.00

This is over 29 MB/s being written to each database disk, for a total of 88 MB/s to the RAID 0 array, and even that isn't fully utilizing the disks, as shown by the %util at about 28%. Given that this is a four-core server and the COPY is the only process running, a %user of 16 means that about 64% of a single CPU is busy here. The CPU and disks are likely waiting for each other a bit in this situation, and you might have to improve both to significantly speed this up. This example is from a server with a battery-backed RAID controller; without one, it's much easier to run into one of the disk bottlenecks here before the CPU ones.

Final iostat hint: on some versions you can switch the output to use megabytes/second as its units, which is often the easiest to read. The following syntax for example, usually makes for a good summary on Linux systems:

$ iostat -x -m 5

PostgreSQL 9.0 High Performance Accelerate your PostgreSQL system
Published: October 2010
eBook Price: $29.99
Book Price: $49.99
See more
Select your format and quantity:
        Read more about this book      

(For more resources on PostgreSQL, see here.)

Overloaded system samples

To get a more realistic workload, the next few samples use a much larger scale of pgbench database (1000) and more clients (64):

$ pgbench -i -s 1000 pgbench
$ pgbench -j 4 -c 64 -T 300 pgbench

This gives about 400 TPS on this server. The following snapshot shows one type of problem you can discover from the vmstat data:

$ vmstat 1
procs ----io---- --system--- -----cpu------
r b bi bo in cs us sy id wa st
3 62 4916 34296 11504 23871 8 6 0 85 0
2 64 7132 35828 13247 32406 11 8 0 81 0
4 63 6120 40756 11313 29722 35 7 0 58 0
0 48 3896 13712 6953 19544 28 3 2 66 1
0 12 400 25564 2222 3417 0 1 25 73 0
0 5 44 3700 818 1020 0 0 39 61 0
1 12 64 8364 1388 1773 6 0 44 50 0
1 45 704 7204 2133 3883 23 1 3 73 0
5 60 2912 26148 8677 15774 17 3 1 79 0
0 62 2376 15900 5648 12084 3 2 0 95 0

As mentioned, the spots where the cs figures drop dramatically (while the system was under heavy load) represent a drop in total system throughput. This example is a bit different, because the wa actually drops too—the system is so overloaded that it isn't even generating a full sized write load. This is typical of when the server is so overloaded that it even stops servicing client work, typically because of lock contention. You can see that from how all the user time has disappeared too. Also, when in a functional state, you can see most of the 64 clients (as well as some other database and system processes) in either the running or sleeping category. During the worst cs entry here, a mere five of those client processes (the b column) got any run time on the server during the one second interval. This profile is common when the cache on a disk controller has completely filled up, and clients are all stuck waiting for WAL data to flush to disk.

Sometimes the slowdown is purely I/O though, as in the following example:

$ iostat -x 5
avg-cpu: %user %nice %system %iowait %steal %idle
5.21 0.00 2.80 91.74 0.25 0.00
Device rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sdc1 2625.60 2057.60 14.67 38.79 101.47 3.14 100.08
sdd1 2614.40 2000.00 14.90 57.96 162.95 3.23 100.08
sde1 2736.00 1963.20 14.64 48.50 135.26 3.12 100.00
md0 7916.80 7206.40 10.60 0.00 0.00 0.00 0.00
sdf1 0.00 22190.40 50.20 0.84 1.79 1.34 59.20

The first thing to notice here is that the %util figure is rounded badly so it looks like over 100% in spots. That's a result of some sloppy accounting and computation, and not anything to be worried about. Next, note that the WAL is only getting 11 MB/s of writes to it. Meanwhile, the database disks are 100% utilized, but are actually processing under 8 MB/s in total (an even mix of reads and writes). This is what it looks like when the database has heavy random I/O. These underlying disks are only capable of about 2 MB/s of true random I/O, and sure enough they aren't doing much better than that even with a caching controller sitting in the middle to buffer and sort. This is typical of the checkpoint sync phase, where a large amount of random I/O has been pushed into the write cache and is now being forced out to disk.

To show some really unpleasant performance, now let's crank the scale up to 4000:

$ pgbench -i -s 4000 pgbench
$ pgbench -j 4 -c 64 -T 300 pgbench

This managed 250 TPS, but with very intermittent processing. A particularly bad period looks like the following:

$ vmstat 1
procs ----io---- --system--- -----cpu------
r b bi bo in cs us sy id wa st
1 63 5444 9864 8576 18268 4 3 0 92 0
0 42 4784 13916 7969 16716 4 2 3 90 0
0 59 464 4704 1279 1695 0 0 25 75 0
0 54 304 4736 1396 2147 0 0 25 74 0
0 42 264 5000 1391 1796 0 0 10 90 0
0 42 296 4756 1444 1996 1 0 10 89 0
0 29 248 5112 1331 1720 0 0 25 75 0
0 47 368 4696 1359 2113 0 0 23 76 0
1 48 344 5352 1287 1875 0 0 0 100 0
0 64 2692 12084 5604 9474 8 2 0 90 0
1 63 5376 9944 8346 18003 4 3 20 74 0
0 64 5404 10128 8890 18847 4 3 25 67 0

That's an 8 second long period of seriously degraded performance. Note the low cs counts. The total procs figures drop below the number of clients, and that there's no user time. These should be the familiar characteristics of all the clients getting stuck waiting for something by now. Here, the cause is pretty obvious; wa is high the whole time and even hits a full 100%, showing the server just can't keep up with the disk I/O load here.

You might wonder what that disk load looks like through the extended iostat data; here's a similar period shown next:

$ iostat –x 5
avg-cpu: %user %nice %system %iowait %steal %idle
2.35 0.00 1.85 85.49 0.15 10.16

Device rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sdc1 2310.40 1169.60 14.63 39.85 147.59 4.21 100.00
sdd1 2326.40 1216.00 14.53 71.41 264.60 4.10 100.00
sde1 2438.40 1230.40 14.77 47.88 157.12 4.01 99.60
md0 7044.80 4820.80 11.12 0.00 0.00 0.00 0.00
sdf1 0.00 19040.00 70.31 4.20 15.31 2.10 56.88

Note how low the read/write rates are, while still having 100% utilization. This shows another I/O load heavy on random reads and writes. Even without the Linux specific iostat -x data, you could also tell that from the combination of extremely high %iowait with low throughput:

$ iostat 5
avg-cpu: %user %nice %system %iowait %steal %idle
1.35 0.00 0.55 88.99 0.10 9.02

Device tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sdc1 252.80 1236.80 2550.40 6184 12752
sdd1 268.80 1377.60 2611.20 6888 13056
sde1 264.40 1312.00 2622.40 6560 13112
md0 1234.20 3924.80 7784.00 19624 38920
sdf1 118.00 0.00 6380.80 0 31904

We know that on sequential I/O these disks are capable of much higher throughputs than this. So when the server is at a %iowait of 89% but only managing to write less than 4 MB/s to the database disks, it's sensible to conclude they are coping with a mix of random read and write requests instead.

Summary

In this article we saw how to use basic operating system monitoring tools to determine what the database is doing.Running iostat allows you to see the balance of reads and writes to the various disks you've split your database and application over.


Further resources on this subject:


PostgreSQL 9.0 High Performance Accelerate your PostgreSQL system
Published: October 2010
eBook Price: $29.99
Book Price: $49.99
See more
Select your format and quantity:

About the Author :


Gregory Smith

Gregory Smith is the principal consultant in the United States for international PostgreSQL services firm 2ndQuadrant. Based in Baltimore, MD, he's been providing database advice to clients in industries like manufacturing, finance, and web development for twenty years. Dedicated to open-source technology since early in his career, Greg turned to full-time PostgreSQL work by diving in at the source code level. He contributes regular feature patches to the core database and has written a variety of database tools. This book reflects the perspective that he's gained over the last few years as one of the most prolific sources of advice on the popular PostgreSQL community mailing lists.

Books From Packt


Mastering phpMyAdmin 3.3.x for Effective MySQL Management
Mastering phpMyAdmin 3.3.x for Effective MySQL Management

MySQL Admin Cookbook
MySQL Admin Cookbook

MySQL 5.1 Plugin Development
MySQL 5.1 Plugin Development

Nginx HTTP Server
Nginx HTTP Server

PostgreSQL 9 Administration Cookbook
PostgreSQL 9 Administration Cookbook

YUI 2.8: Learning the Library
YUI 2.8: Learning the Library

OpenCart 1.4 Beginner's Guide
OpenCart 1.4 Beginner's Guide

Learning jQuery 1.3
Learning jQuery 1.3


No votes yet
Postgres SQL by
I've developed a nice daemon to monitor Postgres SQL fully integrated with zabbix everyone can try it is released under GPL V.3. I hope people can find this post useful http://www.smartmarmot.com

Post new comment

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
d
Y
x
1
Y
2
Enter the code without spaces and pay attention to upper/lower case.
Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software