PostgreSQL – New Features

In this article, Jayadevan Maymala, author of the book, PostgreSQL for Data Architects, you will see how to troubleshoot the initial hiccups faced by people who are new to PostgreSQL.

We will look at a few useful, but not commonly used data types. We will also cover pgbadger, a nifty third-party tool that can run through a PostgreSQL log. This tool can tell us a lot about what is happening in the cluster. Also, we will look at a few key features that are part of PostgreSQL 9.4 release. We will cover a couple of useful extensions.

(For more resources related to this topic, see here.)

Interesting data types

We will start with the data types. PostgreSQL does have all the common data types we see in databases. These include:

  • The number data types (smallint, integer, bigint, decimal, numeric, real, and double)
  • The character data types (varchar, char, and text)
  • The binary data types
  • The date/time data types (including date, timestamp without timezone, and timestamp with timezone)
  • BOOLEAN data types

However, this is all standard fare. Let's start off by looking at the RANGE data type.


This is a data type that can be used to capture values that fall in a specific range. Let's look at a few examples of use cases.

Cars can be categorized as compact, convertible, MPV, SUV, and so on. Each of these categories will have a price range. For example, the price range of a category of cars can start from $15,000 at the lower end and the price range at the upper end can start from $40,000.

We can have meeting rooms booked for different time slots. Each room is booked during different time slots and is available accordingly.

Then, there are use cases that involve shift timings for employees. Each shift begins at a specific time, ends at a specific time, and involves a specific number of hours on duty. We would also need to capture the swipe-in and swipe-out time for employees.

These are some use cases where we can consider range types. Range is a high-level data type; we can use int4range as the appropriate subtype for the car price range scenario. For the booking the meeting rooms and shifting use cases, we can consider tsrange or tstzrange (if we want to capture time zone as well).

It makes sense to explore the possibility of using range data types in most scenarios, which involve the following features:

  • From and to timestamps/dates for room reservations
  • Lower and upper limit for price/discount ranges
  • Scheduling jobs
  • Timesheets

Let's now look at an example. We have three meeting rooms. The rooms can be booked and the entries for reservations made go into another table (basic normalization principles). How can we find rooms that are not booked for a specific time period, say, 10:45 to 11:15? We will look at this with and without the range data type:

CREATE TABLE rooms(id serial, descr varchar(50));
INSERT INTO rooms(descr)
SELECT concat('Room ', generate_series(1,3));
CREATE TABLE room_book (id serial , room_id integer, from_time 
timestamp, to_time timestamp , res tsrange);   INSERT INTO room_book (room_id,from_time,to_time,res) values(1,'2014-7-30 10:00:00', '2014-7-30 11:00:00', '(2014-7-30
10:00:00,2014-7-30 11:00:00)');   INSERT INTO room_book (room_id,from_time,to_time,res) values(2,'2014-7-30 10:00:00', '2014-7-30 10:40:00', '(2014-7-30
10:00,2014-7-30 10:40:00)');   INSERT INTO room_book (room_id,from_time,to_time,res) values(2,'2014-7-30 11:20:00', '2014-7-30 12:00:00', '(2014-7-30
11:20:00,2014-7-30 12:00:00)');   INSERT INTO room_book (room_id,from_time,to_time,res) values(3,'2014-7-30 11:00:00', '2014-7-30 11:30:00', '(2014-7-30
11:00:00,2014-7-30 11:30:00)');

PostgreSQL has the OVERLAPS operator. This can be used to get all the reservations that overlap with the period for which we wanted to book a room:

SELECT room_id FROM room_book WHERE (from_time,to_time) OVERLAPS 
('2014-07-30 10:45:00','2014-07-30 11:15:00');

If we eliminate these room IDs from the master list, we have the list of rooms available. So, we prefix the following command to the preceding SQL:

SELECT id FROM rooms

We get a room ID that is not booked from 10:45 to 11:15. This is the old way of doing it. With the range data type, we can write the following SQL statement:

SELECT id FROM rooms
SELECT room_id FROM room_book WHERE res && '(2014-07-30 
10:45:00,2014-07-30 11:15:00)';

Do look up GIST indexes to improve the performance of queries that use range operators.

Another way of achieving the same is to use the following command:

SELECT id FROM rooms
SELECT room_id FROM room_book WHERE
'2014-07-30 10:45:00' < to_time AND '2014-07-30 11:15:00' > 

Now, let's look at the finer points of how a range is represented. The range values can be opened using [ or ( and closed with ] or ). [ means include the lower value and ( means exclude the lower value. The closing (] or )) has a similar effect on the upper values.

When we do not specify anything, [) is assumed, implying include the lower value, but exclude the upper value. Note that the lower bound is 3 and upper bound is 6 when we mention 3,5, as shown here:

SELECT int4range(3,5,'[)') lowerincl ,int4range(3,5,'[]') 
bothincl, int4range(3,5,'()') bothexcl , int4range(3,5,'[)') upperexcl; lowerincl | bothincl | bothexcl | upperexcl -----------+----------+----------+----------- [3,5)       | [3,6)       | [4,5)       | [3,5)

Using network address types

The network address types are cidr, inet, and macaddr. These are used to capture IPv4, IPv6, and Mac addresses. Let's look at a few use cases.

When we have a website that is open to public, a number of users from different parts of the world access it. We may want to analyze the access patterns. Very often, websites can be used by users without registering or providing address information. In such cases, it becomes even more important that we get some insight into the users based on the country/city and similar location information. When anonymous users access our website, an IP is usually all we get to link the user to a country or city. Often, this becomes our not-so-accurate unique identifier (along with cookies) to keep track of repeat visits, to analyze website-usage patterns, and so on.

The network address types can also be useful when we develop applications that monitor a number of systems in different networks to check whether they are up and running, to monitor resource consumption of the systems in the network, and so on.

While data types (such as VARCHAR or BIGINT) can be used to store IP addresses, it's recommended to use one of the built-in types PostgreSQL provides to store network addresses. There are three data types to store network addresses. They are as follows:

  • inet: This data type can be used to store an IPV4 or IPV6 address along with its subnet. The format in which data is to be inserted is Address/y, where y is the number of bits in the netmask.
  • cidr: This data type can also be used to store networks and network addresses. Once we specify the subnet mask for a cidr data type, PostgreSQL will throw an error if we set bits beyond the mask, as shown in the following example:
CREATE TABLE nettb (id serial, intclmn inet, cidrclmn 
cidr); CREATE TABLE INSERT INTO nettb (intclmn , cidrclmn) VALUES
('', ''); INSERT 0 1 INSERT INTO nettb (intclmn , cidrclmn) VALUES
('', ''); ERROR: invalid cidr value: "" LINE 1: ...b (intclmn , cidrclmn) VALUES ('',
^ DETAIL: Value has bits set to right of mask. INSERT INTO nettb (intclmn , cidrclmn) VALUES
('', ''); INSERT 0 1 SELECT * FROM nettb; id |     intclmn     |   cidrclmn     ----+-----------------+----------------- 1 |   | 2 | |

Let's also look at a couple of useful operators available within network address types. Does an IP fall in a subnet? This can be figured out using <<=, as shown here:

SELECT id,intclmn FROM nettb ;
id |   intclmn  
1 |
3 |
4 |
5 |
SELECT id,intclmn FROM nettb where intclmn <<= 
inet''; id |   intclmn   3 | 5 |   SELECT id,intclmn FROM nettb where intclmn <<=
inet''; id |   intclmn   3 |

The operator used in the preceding command checks whether the column value is contained within or equal to the value we provided. Similarly, we have the equality operator, that is, greater than or equal to, bitwise AND, bitwise OR, and other standard operators.

The macaddr data type can be used to store Mac addresses in different formats.

hstore for key-value pairs

A key-value store available in PostgreSQL is hstore. Many applications have requirements that make developers look for a schema-less data store. They end up turning to one of the NoSQL databases (Cassandra) or the simple and more prevalent stores such as Redis or Riak. While it makes sense to opt for one of these if the objective is to achieve horizontal scalability, it does make the system a bit complex because we now have more moving parts. After all, most applications do need a relational database to take care of all the important transactions along with the ability to write SQL to fetch data with different projections. If a part of the application needs to have a key-value store (and horizontal scalability is not the prime objective), the hstore data type in PostgreSQL should serve the purpose. It may not be necessary to make the system more complex by using different technologies that will also add to the maintenance overhead.

Sometimes, what we want is not an entirely schema-less database, but some flexibility where we are certain about most of our entities and their attributes but are unsure about a few. For example, a person is sure to have a few key attributes such as first name, date of birth, and a couple of other attributes (irrespective of his nationality). However, there could be other attributes that undergo change. A U.S. citizen is likely to have a Social Security Number (SSN); someone from Canada has a Social Insurance Number (SIN). Some countries may provide more than one identifier. There can be more attributes with a similar pattern.

There is usually a master attribute table (which links the IDs to attribute names) and a master table for the entities. Writing queries against tables designed on an EAV approach can get tricky. Using hstore may be an easier way of accomplishing the same. Let's see how we can do this using hstore with a simple example.

The hstore key-value store is an extension and has to be installed using CREATE EXTENSION hstore. We will model a customer table with first_name and an hstore column to hold all the dynamic attributes:

CREATE TABLE customer(id serial, first_name varchar(50), 
dynamic_attributes hstore); INSERT INTO customer (first_name ,dynamic_attributes) VALUES ('Michael','ssn=>"123-465-798" '), ('Smith','ssn=>"129-465-798" '), ('James','ssn=>"No data" '), ('Ram','uuid=>"1234567891" , npr=>"XYZ5678",
ratnum=>"Somanyidentifiers" ');

Now, let's try retrieving all customers with their SSN, as shown here:

SELECT first_name, dynamic_attributes FROM customer
       WHERE dynamic_attributes ? 'ssn';
first_name | dynamic_attributes
Michael   | "ssn"=>"123-465-798"
Smith     | "ssn"=>"129-465-798"
James     | "ssn"=>"No data"

Also, those with a specific SSN:

SELECT first_name,dynamic_attributes FROM customer
       WHERE dynamic_attributes -> 'ssn'= '123-465-798';
first_name | dynamic_attributes
Michael   | "ssn"=>"123-465-798"

If we want to get records that do not contain a specific SSN, just use the following command:

WHERE NOT dynamic_attributes -> 'ssn'= '123-465-798'
Also, replacing it with WHERE NOT dynamic_attributes ? 'ssn'; gives us the following command:
first_name |                          dynamic_attributes        
Ram       | "npr"=>"XYZ5678", "uuid"=>"1234567891", 

As is the case with all data types in PostgreSQL, there are a number of functions and operators available to fetch data selectively, update data, and so on.

We must always use the appropriate data types. This is not just for the sake of doing it right, but because of the number of operators and functions available with a focus on each data type; hstore stores only text. We can use it to store numeric values, but these values will be stored as text. We can index the hstore columns to improve performance. The type of index to be used depends on the operators we will be using frequently.


JavaScript Object Notation (JSON) is an open standard format used to transmit data in a human-readable format. It's a language-independent data format and is considered an alternative to XML. It's really lightweight compared to XML and has been steadily gaining popularity in the last few years.

PostgreSQL added the JSON data type in Version 9.2 with a limited set of functions and operators. Quite a few new functions and operators were added in Version 9.3. Version 9.4 adds one more data type: jsonb.json, which is very similar to JSONB.

The jsonb data type stores data in binary format. It also removes white spaces (which are insignificant) and avoids duplicate object keys. As a result of these differences, JSONB has an overhead when data goes in, while JSON has extra processing overhead when data is retrieved (consider how often each data point will be written and read). The number of operators available with each of these data types is also slightly different. As it's possible to cast one data type to the other, which one should we use depends on the use case. If the data will be stored as it is and retrieved without any operations, JSON should suffice. However, if we plan to use operators extensively and want indexing support, JSONB is a better choice. Also, if we want to preserve whitespace, key ordering, and duplicate keys, JSON is the right choice.

Now, let's look at an example. Assume that we are doing a proof of concept project for a library management system. There are a number of categories of items (ranging from books to DVDs). We wouldn't have information about all the categories of items and their attributes at the piloting stage. For the pilot stage, we could use a table design with the JSON data type to hold various items and their attributes:

   item_id serial,
   details json

Now, we will add records. All DVDs go into one record, books go into another, and so on:

INSERT INTO items (details) VALUES
                 "DVDs" :[
                        {"Name":"The Making of Thunderstorms", 
"Types":"Educational",                          "Age-group":"5-10","Produced By":"National Geographic"                          },                          {"Name":"My nightmares", "Types":"Movies", "Categories":"Horror",                          "Certificate":"A", "Director":"Dracula","Actors":                                [{"Name":"Meena"},{"Name":"Lucy"},{"Name":"Van Helsing"}]                          },                          {"Name":"My Cousin Vinny", "Types":"Movies", "Categories":"Suspense",                          "Certificate":"A", "Director": "Jonathan
Lynn","Actors":                          [{"Name":"Joe "},{"Name":"Marissa"}] }] }' );

A better approach would be to have one record for each item. Now, let's take a look at a few JSON functions:

SELECT   details->>'DVDs' dvds, pg_typeof(details->>'DVDs') 
datatype      FROM items; SELECT   details->'DVDs' dvds ,pg_typeof(details->'DVDs')
datatype      FROM items;

Note the difference between ->> and -> in the following screenshot. We are using the pg_typeof function to clearly see the data type returned by the functions. Both return the JSON object field. The first function returns text and the second function returns JSON:

PostgreSQL for Data Architects

Now, let's try something a bit more complex: retrieve all movies in DVDs in which Meena acted with the following SQL statement:

WITH tmp (dvds) AS
(SELECT json_array_elements(details->'DVDs') det FROM items)
SELECT * FROM tmp , json_array_elements(tmp.dvds#>'{Actors}') as a 
WHERE    a->>'Name'='Meena';

We get the record as shown here:

PostgreSQL for Data Architects

We used one more function and a couple of operators. The json_array_elements expands a JSON array to a set of JSON elements. So, we first extracted the array for DVDs. We also created a temporary table, which ceases to exist as soon as the query is over, using the WITH clause. In the next part, we extracted the elements of the array actors from DVDs. Then, we checked whether the Name element is equal to Meena.


PostgreSQL added the xml data type in Version 8.3. Extensible Markup Language (XML) has a set of rules to encode documents in a format that is both human-readable and machine-readable. This data type is best used to store documents. XML became the standard way of data exchanging information across systems. XML can be used to represent complex data structures such as hierarchical data. However, XML is heavy and verbose; it takes more bytes per data point compared to the JSON format. As a result, JSON is referred to as fat-free XML. XML structure can be verified against XML Schema Definition Documents (XSD). In short, XML is heavy and more sophisticated, whereas JSON is lightweight and faster to process.

We need to configure PostgreSQL with libxml support (./configure --with-libxml) and then restart the cluster for XML features to work. There is no need to reinitialize the database cluster.

Inserting and verifying XML data

Now, let's take a look at what we can do with the xml data type in PostgreSQL:

CREATE TABLE tbl_xml(id serial, docmnt xml);
INSERT INTO tbl_xml(docmnt ) VALUES ('Not xml');
INSERT INTO tbl_xml (docmnt)
       SELECT query_to_xml( 'SELECT now()',true,false,'') ;
SELECT xml_is_well_formed_document(docmnt::text), docmnt
       FROM tbl_xml;

Then, take a look at the following screenshot:

PostgreSQL for Data Architects

First, we created a table with a column to store the XML data. Then, we inserted a record, which is not in the XML format, into the table. Next, we used the query_to_xml function to get the output of a query in the XML format. We inserted this into the table. Then, we used a function to check whether the data in the table is well-formed XML.

Generating XML files for table definitions and data

We can use the table_to_xml function if we want to dump the data from a table in the XML format. Append and_xmlschema so that the function becomes table_to_xml_and_xmlschema, which will also generate the schema definition before dumping the content.

If we want to generate just the definitions, we can use table_to_xmlschema.

PostgreSQL also provides the xpath function to extract data as follows:

SELECT xpath('/table/row/now/text()',docmnt) FROM tbl_xml
       WHERE id = 2;

Using properly designed tables with separate columns to capture each attribute is always the best approach from a performance standpoint and update/write-options perspective. Data types such as json/xml are best used to temporarily store data when we need to provide feeds/extracts/views to other systems or when we get data from external systems. They can also be used to store documents. The maximum size for a field is 1 GB. We must consider this when we use the database to store text/document data.


Now, we will look at a must-have tool if we have just started with PostgreSQL and want to analyze the events taking place in the database.

For those coming from an Oracle background, this tool provides reports similar to AWR reports, although the information is more query-centric. It does not include data regarding host configuration, wait statistics, and so on.

Analyzing the activities in a live cluster provides a lot of insight. It tells us about load, bottlenecks, which queries get executed frequently (we can focus more on them for optimization). It even tells us if the parameters are set right, although a bit indirectly. For example, if we see that there are many temp files getting created while a specific query is getting executed, we know that we either have a buffer issue or have not written the query right.

For pgbadger to effectively scan the log file and produce useful reports, we should get our logging configuration right as follows:

log_destination = 'stderr'
logging_collector = on
log_directory = 'pg_log'
log_filename = 'postgresql-%Y-%m-%d.log'
log_min_duration_statement = 0
log_connections = on
log_disconnections = on
log_duration = on
log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d '
log_lock_waits = on
track_activity_query_size = 2048

It might be necessary to restart the cluster for some of these changes to take effect.

We will also ensure that there is some load on the database using pgbench. It's a utility that ships with PostgreSQL and can be used to benchmark PostgreSQL on our servers. We can initialize the tables required for pgbench by executing the following command at shell prompt:

pgbench -i pgp

This creates a few tables on the pgp database. We can log in to psql (database pgp) and check:


             List of relations
Schema |       Name      | Type | Owner  
public | pgbench_accounts | table | postgres
public | pgbench_branches | table | postgres
public | pgbench_history | table | postgres
   public | pgbench_tellers | table | postgres

Now, we can run pgbench to generate load on the database with the following command:

pgbench -c 5 -T10 pgp

The T option passes the duration for which pgbench should continue execution in seconds, c passes the number of clients, and pgp is the database.

At shell prompt, execute:


Once the file is downloaded, unzip the file using the following command:


Use cd to the directory pgbadger-master as follows:

cd pgbadger-master

Execute the following command:

./pgbadger /pgdata/9.3/pg_log/postgresql-2014-07-31.log –o 

Replace the log file name in the command with the actual name. It will generate a myoutput.html file.

The HTML file generated will have a wealth of information about what happened in the cluster with great charts/tables. In fact, it takes quite a bit of time to go through the report. Here is a sample chart that provides the distribution of queries based on execution time:

PostgreSQL for Data Architects

The following screenshot gives an idea about the number of performance metrics provided by the report:

PostgreSQL for Data Architects

If our objective is to troubleshoot performance bottlenecks, the slowest individual queries and most frequent queries under the top drop-down list is the right place to start. Once the queries are identified, locks, temporary file generation, and so on can be studied to identify the root cause. Of course, EXPLAIN is the best option when we want to refine individual queries.

If the objective is to understand how busy the cluster is, the Overview section and Sessions are the right places to explore.

The logging configuration used may create huge log files in systems with a lot of activity. Tweak the parameters appropriately to ensure that this does not happen.

With this, we covered most of the interesting data types, an interesting extension and a must-use tool from PostgreSQL ecosystem. Now, let's cover a few interesting features in PostgreSQL Version 9.4.

Features over time

Applying filters in Versions 8.0, 9.0, and 9.4 gives us a good idea about how quickly features are getting added to the database.

Interesting features in 9.4

Each version of PostgreSQL adds many features grouped into different categories (such as performance, backend, data types, and so on). We will look at a few features that are more likely to be of interest (because they help us improve performance or they make maintenance and configuration easy).

Keeping the buffer ready

As we saw earlier, reads from disk have a significant overhead compared to those from memory. There are quite a few occasions when disk reads are unavoidable. Let's see a few examples.

In a data warehouse, the Extract, Transform, Load (ETL) process, which may happen once a day usually, involves a lot of raw data getting processed in memory before being loaded into the final tables. This data is mostly transactional data. The master data, which does not get processed on a regular basis, may be evicted from memory as a result of this churn.

Reports typically depend a lot on master data. When users refresh their reports after ETL, it's highly likely that the master data will be read from disk, resulting in a drop in the response time. If we could ensure that the master data as well as the recently processed data is in the buffer, it can really improve user experience.

In a transactional system like an airline reservation system, a change to the fare rule may result in most of the fares being recalculated. This is a situation similar to the one described previously, ensuring that the fares and availability data for the most frequently searched routes in the buffer can provide a better user experience. This applies to an e-commerce site selling products also. If the product/price/inventory data is always available in memory, it can be retrieved very fast.

You must use PostgreSQL 9.4 for trying out the code in the following sections.

So, how can we ensure that the data is available in the buffer? A pg_prewarm module has been added as an extension to provide this functionality. The basic syntax is very simple: SELECT pg_prewarm('tablename');. This command will populate the buffers with data from the table. It's also possible to mention the blocks that should be loaded into the buffer from the table.

We will install the extension in a database, create a table, and populate some data. Then, we will stop the server, drop buffers (OS), and restart the server. We will see how much time a SELECT count(*) takes. We will repeat the exercise, but we will use pg_prewarm before executing SELECT count(*) at psql:

INSERT INTO myt(name) SELECT concat(generate_series(1,10000),'name');

Now, stop the server using pg_ctl at the shell prompt:

pg_ctl stop -m immediate

Clean OS buffers using the following command at the shell prompt (will need to use sudo to do this):

echo 1 > /proc/sys/vm/drop_caches

The command may vary depending on the OS. Restart the cluster using pg_ctl start.

Then, execute the following command:

Time: 333.115 ms

We should repeat the steps of shutting down the server, dropping the cache, and starting PostgreSQL. Then, execute SELECT pg_prewarm('myt'); before SELECT count(*).

The response time goes down significantly. Executing pg_prewarm does take some time, which is close to the time taken to execute the SELECT count(*) against a cold cache. However, the objective is to ensure that the user does not experience a delay.

(1 row)
Time: 7.002 ms

Better recoverability

A new parameter called recovery_min_apply_delay has been added in 9.4. This will go to the recovery.conf file of the slave server. With this, we can control the replay of transactions on the slave server. We can set this to approximately 5 minutes and then the standby will replay the transaction from the master when the standby system time is 5 minutes past the time of commit at the master. This provides a bit more flexibility when it comes to recovering from mistakes. When we keep the value at 1 hour, the changes at the master will be replayed at the slave after one hour. If we realize that something went wrong on the master server, we have about 1 hour to stop the transaction replay so that the action that caused the issue (for example, accidental dropping of a table) doesn't get replayed at the slave.

Easy-to-change parameters

An ALTER SYSTEM command has been introduced so that we don't have to edit postgresql.conf to change parameters. The entry will go to a file named We can execute ALTER SYSTEM SET work_mem='12MB'; and then check the file at psql:

\! more
# Do not edit this file manually!
# It will be overwritten by ALTER SYSTEM command.
work_mem = '12MB'

We must execute SELECT pg_reload_conf(); to ensure that the changes are propagated.

Logical decoding and consumption of changes

Version 9.4 introduces physical and logical replication slots. We will look at logical slots as they let us track changes and filter specific transactions. This lets us pick and choose from the transactions that have been committed. We can grab some of the changes, decode, and possibly replay on a remote server. We do not have to have an all-or-nothing replication. As of now, we will have to do a lot of work to decode/move the changes.

Two parameter changes are necessary to set this up. These are as follows:

The max_replication_slots parameter (set to at least 1) and wal_level (set to logical). Then, we can connect to a database and create a slot as follows:


The first parameter is the name we give to our slot and the second parameter is the plugin to be used. Test_decoding is the sample plugin available, which converts WAL entries into text representations as follows:

INSERT INTO myt(id) values (4);
INSERT INTO myt(name) values ('abc');

Now, we will try retrieving the entries:

SELECT * FROM pg_logical_slot_peek_changes('myslot',NULL,NULL);

Then, check the following screenshot:

PostgreSQL for Data Architects

This function lets us take a look at the changes without consuming them so that the changes can be accessed again:

SELECT * FROM pg_logical_slot_get_changes('myslot',NULL,NULL);

This is shown in the following screenshot:

PostgreSQL for Data Architects

This function is similar to the peek function, but the changes are no longer available to be fetched again as they get consumed.


In this article, we covered a few data types that data architects will find interesting. We also covered what is probably the best utility available to parse the PostgreSQL log file to produce excellent reports. We also looked at some of the interesting features in PostgreSQL version 9.4, which will be of interest to data architects.

Resources for Article:

Further resources on this subject:

You've been reading an excerpt of:

PostgreSQL for Data Architects

Explore Title
comments powered by Disqus