Introducing PostgreSQL 9

In this article by Simon Riggs, Gianni Ciolli, Hannu Krosing, Gabriele Bartolini, the authors of PostgreSQL 9 Administration Cookbook - Second Edition, we will introduce PostgreSQL 9. PostgreSQL is a feature-rich, general-purpose database management system. It's a complex piece of software, but every journey begins with the first step.

(For more resources related to this topic, see here.)

We'll start with your first connection. Many people fall at the first hurdle, so we'll try not to skip that too swiftly. We'll quickly move on to enabling remote users, and from there, we will move to access through GUI administration tools.

We will also introduce the psql query tool.

PostgreSQL is an advanced SQL database server, available on a wide range of platforms. One of the clearest benefits of PostgreSQL is that it is open source, meaning that you have a very permissive license to install, use, and distribute PostgreSQL without paying anyone fees or royalties. On top of that, PostgreSQL is well-known as a database that stays up for long periods and requires little or no maintenance in most cases. Overall, PostgreSQL provides a very low total cost of ownership.

PostgreSQL is also noted for its huge range of advanced features, developed over the course of more than 20 years of continuous development and enhancement. Originally developed by the Database Research Group at the University of California, Berkeley, PostgreSQL is now developed and maintained by a huge army of developers and contributors. Many of those contributors have full-time jobs related to PostgreSQL, working as designers, developers, database administrators, and trainers. Some, but not many, of those contributors work for companies that specialize in support for PostgreSQL, like we (the authors) do. No single company owns PostgreSQL, nor are you required (or even encouraged) to register your usage.

PostgreSQL has the following main features:

  • Excellent SQL standards compliance up to SQL:2011
  • Client-server architecture
  • Highly concurrent design where readers and writers don't block each other
  • Highly configurable and extensible for many types of applications
  • Excellent scalability and performance with extensive tuning features
  • Support for many kinds of data models: relational, document (JSON and XML), and key/value

What makes PostgreSQL different?

The PostgreSQL project focuses on the following objectives:

  • Robust, high-quality software with maintainable, well-commented code
  • Low maintenance administration for both embedded and enterprise use
  • Standards-compliant SQL, interoperability, and compatibility
  • Performance, security, and high availability

What surprises many people is that PostgreSQL's feature set is more comparable with Oracle or SQL Server than it is with MySQL. The only connection between MySQL and PostgreSQL is that these two projects are open source; apart from that, the features and philosophies are almost totally different.

One of the key features of Oracle, since Oracle 7, has been snapshot isolation, where readers don't block writers and writers don't block readers. You may be surprised to learn that PostgreSQL was the first database to be designed with this feature, and it offers a complete implementation. In PostgreSQL, this feature is called Multiversion Concurrency Control (MVCC).

PostgreSQL is a general-purpose database management system. You define the database that you would like to manage with it. PostgreSQL offers you many ways to work. You can use a normalized database model, augmented with features such as arrays and record subtypes, or use a fully dynamic schema with the help of JSONB and an extension named hstore. PostgreSQL also allows you to create your own server-side functions in any of a dozen different languages.

PostgreSQL is highly extensible, so you can add your own data types, operators, index types, and functional languages. You can even override different parts of the system using plugins to alter the execution of commands or add a new optimizer.

All of these features offer a huge range of implementation options to software architects. There are many ways out of trouble when building applications and maintaining them over long periods of time.

In the early days, when PostgreSQL was still a research database, the focus was solely on the cool new features. Over the last 15 years, enormous amounts of code have been rewritten and improved, giving us one of the most stable and largest software servers available for operational use.

You may have read that PostgreSQL was, or is, slower than My Favorite DBMS, whichever that is. It's been a personal mission of mine over the last ten years to improve server performance, and the team has been successful in making the server highly performant and very scalable. That gives PostgreSQL enormous headroom for growth.

Who is using PostgreSQL? Prominent users include Apple, BASF, Genentech, Heroku, IMDB.com, Skype, McAfee, NTT, The UK Met Office, and The U. S. National Weather Service. 5 years ago, PostgreSQL received well in excess of 1 million downloads per year, according to data submitted to the European Commission, which concluded, "PostgreSQL is considered by many database users to be a credible alternative."

We need to mention one last thing. When PostgreSQL was first developed, it was named Postgres, and therefore many aspects of the project still refer to the word "postgres"; for example, the default database is named postgres, and the software is frequently installed using the postgres user ID. As a result, people shorten the name PostgreSQL to simply Postgres, and in many cases use the two names interchangeably.

PostgreSQL is pronounced as "post-grez-q-l". Postgres is pronounced as "post-grez."

Some people get confused, and refer to "Postgre", which is hard to say, and likely to confuse people. Two names are enough, so please don't use a third name!

The following sections explain the key areas in more detail.

Robustness

PostgreSQL is robust, high-quality software, supported by automated testing for both features and concurrency. By default, the database provides strong disk-write guarantees, and the developers take the risk of data loss very seriously in everything they do. Options to trade robustness for performance exist, though they are not enabled by default.

All actions on the database are performed within transactions, protected by a transaction log that will perform automatic crash recovery in case of software failure.

Databases may be optionally created with data block checksums to help diagnose hardware faults. Multiple backup mechanisms exist, with full and detailed Point-In-Time Recovery, in case of the need for detailed recovery. A variety of diagnostic tools are available.

Database replication is supported natively. Synchronous Replication can provide greater than "5 Nines" (99.999 percent) availability and data protection, if properly configured and managed.

Security

Access to PostgreSQL is controllable via host-based access rules. Authentication is flexible and pluggable, allowing easy integration with any external security architecture.

Full SSL-encrypted access is supported natively. A full-featured cryptographic function library is available for database users.

PostgreSQL provides role-based access privileges to access data, by command type.

Functions may execute with the permissions of the definer, while views may be defined with security barriers to ensure that security is enforced ahead of other processing.

All aspects of PostgreSQL are assessed by an active security team, while known exploits are categorized and reported at http://www.postgresql.org/support/security/.

Ease of use

Clear, full, and accurate documentation exists as a result of a development process where doc changes are required. Hundreds of small changes occur with each release that smooth off any rough edges of usage, supplied directly by knowledgeable users.

PostgreSQL works in the same way on small or large systems and across operating systems.

Client access and drivers exist for every language and environment, so there is no restriction on what type of development environment is chosen now, or in the future.

SQL Standard is followed very closely; there is no weird behavior, such as silent truncation of data.

Text data is supported via a single data type that allows storage of anything from 1 byte to 1 gigabyte. This storage is optimized in multiple ways, so 1 byte is stored efficiently, and much larger values are automatically managed and compressed.

PostgreSQL has a clear policy to minimize the number of configuration parameters, and with each release, we work out ways to auto-tune settings.

Extensibility

PostgreSQL is designed to be highly extensible. Database extensions can be loaded simply and easily using CREATE EXTENSION, which automates version checks, dependencies, and other aspects of configuration.

PostgreSQL supports user-defined data types, operators, indexes, functions and languages.

Many extensions are available for PostgreSQL, including the PostGIS extension that provides world-class Geographical Information System (GIS) features.

Performance and concurrency

PostgreSQL 9.4 can achieve more than 300,000 reads per second on a 32-CPU server, and it benchmarks at more than 20,000 write transactions per second with full durability.

PostgreSQL has an advanced optimizer that considers a variety of join types, utilizing user data statistics to guide its choices.

PostgreSQL provides MVCC, which enables readers and writers to avoid blocking each other.

Taken together, the performance features of PostgreSQL allow a mixed workload of transactional systems and complex search and analytical tasks. This is important because it means we don't always need to unload our data from production systems and reload them into analytical data stores just to execute a few ad hoc queries. PostgreSQL's capabilities make it the database of choice for new systems, as well as the right long-term choice in almost every case.

Scalability

PostgreSQL 9.4 scales well on a single node up to 32 CPUs. PostgreSQL scales well up to hundreds of active sessions, and up to thousands of connected sessions when using a session pool. Further scalability is achieved in each annual release.

PostgreSQL provides multinode read scalability using the Hot Standby feature. Multinode write scalability is under active development. The starting point for this is Bi-Directional Replication.

SQL and NoSQL

PostgreSQL follows SQL Standard very closely. SQL itself does not force any particular type of model to be used, so PostgreSQL can easily be used for many types of models at the same time, in the same database. PostgreSQL supports the more normal SQL language statement.

With PostgreSQL acting as a relational database, we can utilize any level of denormalization, from the full Third Normal Form, to the more normalized Star Schema models. PostgreSQL extends the relational model to provide arrays, row types, and range types.

A document-centric database is also possible using PostgreSQL's text, XML, and binary JSON (JSONB) data types, supported by indexes optimized for documents and by full text search capabilities.

Key/value stores are supported using the hstore extension.

Popularity

When MySQL was taken over some years back, it was agreed in the EU monopoly investigation that followed that PostgreSQL was a viable competitor. That's been certainly true, with the PostgreSQL user base expanding consistently for more than a decade.

Various polls have indicated that PostgreSQL is the favorite database for building new, enterprise-class applications. The PostgreSQL feature set attracts serious users who have serious applications. Financial services companies may be PostgreSQL's largest user group, though governments, telecommunication companies, and many other segments are strong users as well. This popularity extends across the world; Japan, Ecuador, Argentina, and Russia have very large user groups, and so do USA, Europe, and Australasia.

Amazon Web Services' chief technology officer Dr. Werner Vogels described PostgreSQL as "an amazing database", going on to say that "PostgreSQL has become the preferred open source relational database for many enterprise developers and start-ups, powering leading geospatial and mobile applications".

Commercial support

Many people have commented that strong commercial support is what enterprises need before they can invest in open source technology. Strong support is available worldwide from a number of companies.

2ndQuadrant provides commercial support for open source PostgreSQL, offering 24 x 7 support in English and Spanish with bug-fix resolution times.

EnterpriseDB provides commercial support for PostgreSQL as well as their main product, which is a variant of Postgres that offers enhanced Oracle compatibility.

Many other companies provide strong and knowledgeable support to specific geographic regions, vertical markets, and specialized technology stacks.

PostgreSQL is also available as hosted or cloud solutions from a variety of companies, since it runs very well in cloud environments.

A full list of companies is kept up to date at http://www.postgresql.org/support/professional_support/.

Research and development funding

PostgreSQL was originally developed as a research project at the University of California, Berkeley in the late 1980s and early 1990s. Further work was carried out by volunteers until the late 1990s. Then, the first professional developer became involved. Over time, more and more companies and research groups became involved, supporting many professional contributors. Further funding for research and development was provided by the NSF. The project also received funding from the EU FP7 Programme in the form of the 4CaaST project for cloud computing and the AXLE project for scalable data analytics. AXLE deserves a special mention because it is a 3-year project aimed at enhancing PostgreSQL's business intelligence capabilities, specifically for very large databases. The project covers security, privacy, integration with data mining, and visualization tools and interfaces for new hardware. Further details of it are available at http://www.axleproject.eu.

Other funding for PostgreSQL development comes from users who directly sponsor features and companies selling products and services based around PostgreSQL.

Monitoring

Databases are not isolated entities. They live on computer hardware using CPUs, RAM, and disk subsystems. Users access databases using networks. Depending on the setup, databases themselves may need network resources to function in any of the following ways: performing some authentication checks when users log in, using disks that are mounted over the network (not generally recommended), or making remote function calls to other databases.

This means that monitoring only the database is not enough. As a minimum, one should also monitor everything directly involved in using the database. This means knowing the following:

  • Is the database host available? Does it accept connections?
  • How much of the network bandwidth is in use? Have there been network interruptions and dropped connections?
  • Is there enough RAM available for the most common tasks? How much of it is left?
  • Is there enough disk space available? When will it run out of disk space?
  • Is the disk subsystem keeping up? How much more load can it take?
  • Can the CPU keep up with the load? How many spare idle cycles do the CPUs have?
  • Are other network services the database access depends on (if any) available? For example, if you use Kerberos for authentication, you need to monitor it as well.
  • How many context switches are happening when the database is running?
  • For most of these things, you are interested in history; that is, how have things evolved? Was everything mostly the same yesterday or last week?
  • When did the disk usage start changing rapidly?
  • For any larger installation, you probably have something already in place to monitor the health of your hosts and network.

The two aspects of monitoring are collecting historical data to see how things have evolved and getting alerts when things go seriously wrong. Tools based on Round Robin Database Tool (RRDtool) such as Cacti and Munin are quite popular for collecting the historical information on all aspects of the servers and presenting this information in an easy-to-follow graphical form. Seeing several statistics on the same timescale can really help when trying to figure out why the system is behaving the way it is.

Another popular open source solution is Ganglia, a distributed monitoring solution particularly suitable for environments with several servers and in multiple locations.

Another aspect of monitoring is getting alerts when something goes really wrong and needs (immediate) attention.

For alerting, one of the most widely used tools is Nagios, with its fork (Icinga) being an emerging solution. The aforementioned trending tools can integrate with Nagios.

However, if you need a solution for both the alerting and trending aspects of a monitoring tool, you might want to look into Zabbix.

Then, of course, there is Simple Network Management Protocol (SNMP), which is supported by a wide array of commercial monitoring solutions. Basic support for monitoring PostgreSQL through SNMP is found in pgsnmpd. This project does not seem very active though. However, you can find more information about pgsnmpd and download it from http://pgsnmpd.projects.postgresql.org/.

Providing PostgreSQL information to monitoring tools

Historical monitoring information is best to use when all of it is available from the same place and at the same timescale. Most monitoring systems are designed for generic purposes, while allowing application and system developers to integrate their specific checks with the monitoring infrastructure. This is possible through a plugin architecture. Adding new kinds of data inputs to them means installing a plugin. Sometimes, you may need to write or develop this plugin, but writing a plugin for something such as Cacti is easy. You just have to write a script that outputs monitored values in simple text format.

In most common scenarios, the monitoring system is centralized and data is collected directly (and remotely) by the system itself or through some distributed components that are responsible for sending the observed metrics back to the main node.

As far as PostgreSQL is concerned, some useful things to include in graphs are the number of connections, disk usage, number of queries, number of WAL files, most numbers from pg_stat_user_tables and pg_stat_user_indexes, and so on, as shown here:

PostgreSQL 9 Administration Cookbook - Second Edition

An example of a dashboard in Cacti

The preceding Cacti screenshot includes data for CPU, disk, and network usage; pgbouncer connection pooler; and the number of PostgreSQL client connections. As you can see, they are nicely correlated.

One Swiss Army knife script, which can be used from both Cacti and Nagios/Icinga, is check_postgres. It is available at http://bucardo.org/wiki/Check_postgres. It has ready-made reporting actions for a large array of things worth monitoring in PostgreSQL.

For Munin, there are some PostgreSQL plugins available at the Munin plugin repository at https://github.com/munin-monitoring/contrib/tree/master/plugins/postgresql.

The following screenshot shows a Munin graph about PostgreSQL buffer cache hits for a specific database, where cache hits (blue line) dominate reads from the disk (green line):

PostgreSQL 9 Administration Cookbook - Second Edition

Finding more information about generic monitoring tools

Setting up the tools themselves is a larger topic. In fact, each of these tools has more than one book written about them. The basic setup information and the tools themselves can be found at the following URLs:

Real-time viewing using pgAdmin

You can also use pgAdmin to get a quick view of what is going on in the database. For better control, you need to install the adminpack extension in the destination database, by issuing this command:

CREATE EXTENSION adminpack;

This extension is a part of the additionally supplied modules of PostgreSQL (aka contrib). It provides several administration functions that PgAdmin (and other tools) can use in order to manage, control, and monitor a Postgres server from a remote location.

Once you have installed adminpack, connect to the database and then go to Tools | Server Status. This will open a window similar to what is shown in the following screenshot, reporting locks and running transactions:

PostgreSQL 9 Administration Cookbook - Second Edition

Loading data from flat files

Loading data into your database is one of the most important tasks. You need to do this accurately and quickly. Here's how.

Getting ready

You'll need a copy of pgloader, which is available at http://github.com/dimitri/pgloader.

At the time of writing this article, the current stable version is 3.1.0. The 3.x series is a major rewrite, with many additional features, and the 2.x series is now considered obsolete.

How to do it…

PostgreSQL includes a command named COPY that provides the basic data load/unload mechanism. The COPY command doesn't do enough when loading data, so let's skip the basic command and go straight to pgloader.

To load data, we need to understand our requirements, so let's break this down into a step-by-step process, as follows:

  1. Identify the data files and where they are located. Make sure that pgloader is installed at the location of the files.
  2. Identify the table into which you are loading, ensure that you have the permissions to load, and check the available space.
  3. Work out the file type (fixed, text, or CSV) and check the encoding.
  4. Specify the mapping between columns in the file and columns on the table being loaded. Make sure you know which columns in the file are not needed—pgloader allows you to include only the columns you want. Identify any columns in the table for which you don't have data. Do you need them to have a default value on the table, or does pgloader need to generate values for those columns through functions or constants?
  5. Specify any transformations that need to take place. The most common issue is date formats, though possibly there may be other issues.
  6. Write the pgloader script.
  7. pgloader will create a log file to record whether the load has succeeded or failed, and another file to store rejected rows. You need a directory with sufficient disk space if you expect them to be large. Their size is roughly proportional to the number of failing rows.
  8. Finally, consider what settings you need for performance options. This is definitely last, as fiddling with things earlier can lead to confusion when you're still making the load work correctly.
  9. You must use a script to execute pgloader. This is not a restriction; actually it is more like best practice, because it makes it much easier to iterate towards something that works. Loads never work the first time, except in the movies!

Let's look at a typical example from pgloader's documentation—the example.load file:

LOAD CSV
   FROM 'GeoLiteCity-Blocks.csv' WITH ENCODING iso-646-us
       HAVING FIELDS
       (
           startIpNum, endIpNum, locId
       )
   INTO postgresql://user@localhost:54393/dbname?geolite.blocks
       TARGET COLUMNS
       (
           iprange ip4r using (ip-range startIpNum endIpNum),
         locId
       )
   WITH truncate,
       skip header = 2,
       fields optionally enclosed by '"',
       fields escaped by backslash-quote,
       fields terminated by '\t'
 
   SET work_mem to '32 MB', maintenance_work_mem to '64 MB';

We can use the load script like this:

pgloader --summary summary.log example.load

How it works…

pgloader copes gracefully with errors. The COPY command loads all rows in a single transaction, so only a single error is enough to abort the load. pgloader breaks down an input file into reasonably sized chunks, and loads them piece by piece. If some rows in a chunk cause errors, then pgloader will split it iteratively until it loads all the good rows and skips all the bad rows, which are then saved in a separate "rejects" file for later inspection. This behavior is very convenient if you have large data files with a small percentage of bad rows; for instance, you can edit the rejects, fix them, and finally, load them with another pgloader run.

Versions 2.x of pgloader were written in Python and connected to PostgreSQL through the standard Python client interface. Version 3.x is written in Common Lisp. Yes, pgloader is less efficient than loading data files using a COPY command, but running a COPY command has many more restrictions: the file has to be in the right place on the server, has to be in the right format, and must be unlikely to throw errors on loading. pgloader has additional overhead, but it also has the ability to load data using multiple parallel threads, so it can be faster to use as well. pgloader's ability to call out to reformat functions is often essential in most cases; straight COPY is just too simple.

pgloader also allows loading from fixed-width files, which COPY does not.

There's more…

If you need to reload the table completely from scratch, then specify the –WITH TRUNCATE clause in the pgloader script.

There are also options to specify SQL to be executed before and after loading the data. For instance, you may have a script that creates the empty tables before, or you can add constraints after, or both.

After loading, if we have load errors, then there will be some junk loaded into the PostgreSQL tables. It is not junk that you can see, or that gives any semantic errors, but think of it more like fragmentation. You should think about whether you need to add a VACUUM command after the data load, though this will make the load take possibly much longer.

We need to be careful to avoid loading data twice. The only easy way of doing that is to make sure that there is at least one unique index defined on every table that you load. The load should then fail very quickly.

String handling can often be difficult, because of the presence of formatting or nonprintable characters. The default setting for PostgreSQL is to have a parameter named standard_conforming_strings set to off, which means that backslashes will be assumed to be escape characters. Put another way, by default, the \n string means line feed, which can cause data to appear truncated. You'll need to turn standard_conforming_strings to on, or you'll need to specify an escape character in the load-parameter file.

If you are reloading data that has been unloaded from PostgreSQL, then you may want to use the pg_restore utility instead. The pg_restore utility has an option to reload data in parallel, -j number_of_threads, though this is only possible if the dump was produced using the custom pg_dump format. This can be useful for reloading dumps, though it lacks almost all of the other pgloader features discussed here.

If you need to use rows from a read-only text file that does not have errors, and you are using version 9.1 or later of PostgreSQL, then you may consider using the file_fdw contrib module. The short story is that it lets you create a "virtual" table that will parse the text file every time it is scanned. This is different from filling a table once and for all, either with COPY or pgloader; therefore, it covers a different use case. For example, think about an external data source that is maintained by a third party and needs to be shared across different databases.

You may wish to send an e-mail to Dimitri Fontaine, the current author and maintainer of most of pgloader. He always loves to receive e-mails from users.

Summary

PostgreSQL provides a lot of features, which make it the most advanced open source database.

Resources for Article:


Further resources on this subject:


You've been reading an excerpt of:

PostgreSQL 9 Administration Cookbook - Second Edition

Explore Title