PostgreSQL as an Extensible RDBMS

Usama Dar

March 2015

This article by Usama Dar, the author of the book PostgreSQL Server Programming - Second Edition, explains the process of creating a new operator, overloading it, optimizing it, creating index access methods, and much more. PostgreSQL is an extensible database. I hope you've learned this much by now. It is extensible by virtue of the design that it has. As discussed before, PostgreSQL uses a catalog-driven design. In fact, PostgreSQL is more catalog-driven than most of the traditional relational databases. The key benefit here is that the catalogs can be changed or added to, in order to modify or extend the database functionality. PostgreSQL also supports dynamic loading, that is, a user-written code can be provided as a shared library, and PostgreSQL will load it as required.

(For more resources related to this topic, see here.)

Extensibility is critical for many businesses, which have needs that are specific to that business or industry. Sometimes, the tools provided by the traditional database systems do not fulfill those needs. People in those businesses know best how to solve their particular problems, but they are not experts in database internals. It is often not possible for them to cook up their own database kernel or modify the core or customize it according to their needs. A truly extensible database will then allow you to do the following:

  • Solve domain-specific problems in a seamless way, like a native solution
  • Build complete features without modifying the core database engine
  • Extend the database without interrupting availability

PostgreSQL not only allows you to do all of the preceding things, but also does these, and more with utmost ease. In terms of extensibility, you can do the following things in a PostgreSQL database:

  1. Create your own data types
  2. Create your own functions
  3. Create your own aggregates
  4. Create your own operators
  5. Create your own index access methods (operator classes)
  6. Create your own server programming language
  7. Create foreign data wrappers (SQL/MED) and foreign tables

What can't be extended?

Although PostgreSQL is an extensible platform, there are certain things that you can't do or change without explicitly doing a fork, as follows:

  1. You can't change or plug in a new storage engine. If you are coming from the MySQL world, this might annoy you a little. However, PostgreSQL's storage engine is tightly coupled with its executor and the rest of the system, which has its own benefits.
  2. You can't plug in your own planner/parser. One can argue for and against the ability to do that, but at the moment, the planner, parser, optimizer, and so on are baked into the system and there is no possibility of replacing them. There has been some talk on this topic, and if you are of the curious kind, you can read some of the discussion at http://bit.ly/1yRMkK7.
  3. We will now briefly discuss some more of the extensibility capabilities of PostgreSQL. We will not dive deep into the topics, but we will point you to the appropriate link where more information can be found.

Creating a new operator

Now, let's take look at how we can add a new operator in PostgreSQL. Adding new operators is not too different from adding new functions. In fact, an operator is syntactically just a different way to use an existing function. For example, the + operator calls a built-in function called numeric_add and passes it the two arguments.

When you define a new operator, you must define the data types that the operator expects as arguments and define which function is to be called.

Let's take a look at how to define a simple operator. You have to use the CREATE OPERATOR command to create an operator.

Let's use that function to create a new Fibonacci operator, ##, which will have an integer on its left-hand side:

CREATE OPERATOR ## (PROCEDURE=fib, LEFTARG=integer);

Now, you can use this operator in your SQL to calculate a Fibonacci number:

testdb=# SELECT 12##;
?column?
----------
144
(1 row)

Note that we defined that the operator will have an integer on the left-hand side. If you try to put a value on the right-hand side of the operator, you will get an error:

postgres=# SELECT ##12;
ERROR: operator does not exist: ## integer at character 8
HINT: No operator matches the given name and argument type(s). You
might need to add explicit type casts.
STATEMENT: select ##12;
ERROR: operator does not exist: ## integer
LINE 1: select ##12;
^
HINT: No operator matches the given name and argument type(s). You
might need to add explicit type casts.

Overloading an operator

Operators can be overloaded in the same way as functions. This means, that an operator can have the same name as an existing operator but with a different set of argument types. More than one operator can have the same name, but two operators can't share the same name if they accept the same types and positions of the arguments. As long as there is a function that accepts the same kind and number of arguments that an operator defines, it can be overloaded.

Let's override the ## operator we defined in the last example, and also add the ability to provide an integer on the right-hand side of the operator:

CREATE OPERATOR ## (PROCEDURE=fib, RIGHTARG=integer);

Now, running the same SQL, which resulted in an error last time, should succeed, as shown here:

testdb=# SELECT ##12;
?column?
----------
144
(1 row)

You can drop the operator using the DROP OPERATOR command.

You can read more about creating and overloading new operators in the PostgreSQL documentation at http://www.postgresql.org/docs/current/static/sql-createoperator.html and http://www.postgresql.org/docs/current/static/xoper.html.

There are several optional clauses in the operator definition that can optimize the execution time of the operators by providing information about operator behavior. For example, you can specify the commutator and the negator of an operator that help the planner use the operators in index scans. You can read more about these optional clauses at http://www.postgresql.org/docs/current/static/xoper-optimization.html.

Since this article is just an introduction to the additional extensibility capabilities of PostgreSQL, we will just introduce a couple of optimization options; any serious production quality operator definitions should include these optimization clauses, if applicable.

Optimizing operators

The optional clauses tell the PostgreSQL server about how the operators behave. These options can result in considerable speedups in the execution of queries that use the operator. However, if you provide these options incorrectly, it can result in a slowdown of the queries. Let's take a look at two optimization clauses called commutator and negator.

COMMUTATOR

This clause defines the commuter of the operator. An operator A is a commutator of operator B if it fulfils the following condition:

x A y = y B x.

It is important to provide this information for the operators that will be used in indexes and joins. As an example, the commutator for > is <, and the commutator of = is = itself.

This helps the optimizer to flip the operator in order to use an index. For example, consider the following query:

SELECT * FROM employee WHERE new_salary > salary;

If the index is defined on the salary column, then PostgreSQL can rewrite the preceding query as shown:

SELECT * from employee WHERE salary < new_salary

This allows PostgreSQL to use a range scan on the index column salary. For a user-defined operator, the optimizer can only do this flip around if the commutator of a user-defined operator is defined:

CREATE OPERATOR > (LEFTARG=integer, RIGHTARG=integer, PROCEDURE=comp, COMMUTATOR = <)

NEGATOR

The negator clause defines the negator of the operator. For example, <> is a negator of =. Consider the following query:

SELECT * FROM employee WHERE NOT (dept = 10);

Since <> is defined as a negator of =, the optimizer can simplify the preceding query as follows:

SELECT * FROM employee WHERE dept <> 10;

You can even verify that using the EXPLAIN command:

postgres=# EXPLAIN SELECT * FROM employee WHERE NOT
dept = 'WATER MGMNT';
QUERY PLAN
---------------------------------------------------------
Foreign Scan on employee (cost=0.00..1.10 rows=1 width=160)
Filter: ((dept)::text <> 'WATER MGMNT'::text)
Foreign File: /Users/usamadar/testdata.csv
Foreign File Size: 197
(4 rows)

Creating index access methods

Let's discuss how to index new data types or user-defined types and operators. In PostgreSQL, an index is more of a framework that can be extended or customized for using different strategies. In order to create new index access methods, we have to create an operator class. Let's take a look at a simple example.

Let's consider a scenario where you have to store some special data such as an ID or a social security number in the database. The number may contain non-numeric characters, so it is defined as a text type:

CREATE TABLE test_ssn (ssn text);
INSERT INTO test_ssn VALUES ('222-11-020878');
INSERT INTO test_ssn VALUES ('111-11-020978');

Let's assume that the correct order for this data is such that it should be sorted on the last six digits and not the ASCII value of the string.

The fact that these numbers need a unique sort order presents a challenge when it comes to indexing the data. This is where PostgreSQL operator classes are useful. An operator allows a user to create a custom indexing strategy.

Creating an indexing strategy is about creating your own operators and using them alongside a normal B-tree.

Let's start by writing a function that changes the order of digits in the value and also gets rid of the non-numeric characters in the string to be able to compare them better:

CREATE OR REPLACE FUNCTION fix_ssn(text)
RETURNS text AS $$
BEGIN
RETURN substring($1,8) || replace(substring($1,1,7),'-','');
END;
$$LANGUAGE 'plpgsql' IMMUTABLE;

Let's run the function and verify that it works:

testdb=# SELECT fix_ssn(ssn) FROM test_ssn;
fix_ssn
-------------
02087822211
02097811111
(2 rows)

Before an index can be used with a new strategy, we may have to define some more functions depending on the type of index. In our case, we are planning to use a simple B-tree, so we need a comparison function:

CREATE OR REPLACE FUNCTION ssn_compareTo(text, text)
RETURNS int AS
$$
BEGIN
IF fix_ssn($1) < fix_ssn($2)
THEN
RETURN -1;
ELSIF fix_ssn($1) > fix_ssn($2)
THEN
RETURN +1;
ELSE
RETURN 0;
END IF;
END;
$$ LANGUAGE 'plpgsql' IMMUTABLE;

It's now time to create our operator class:

CREATE OPERATOR CLASS ssn_ops
FOR TYPE text USING btree
AS
OPERATOR 1 < ,
OPERATOR 2 <= ,
OPERATOR 3 = ,
OPERATOR 4 >= ,
OPERATOR 5 > ,
FUNCTION 1 ssn_compareTo(text, text);

You can also overload the comparison operators if you need to compare the values in a special way, and use the functions in the compareTo function as well as provide them in the CREATE OPERATOR CLASS command.

We will now create our first index using our brand new operator class:

CREATE INDEX idx_ssn ON test_ssn (ssn ssn_ops);

We can check whether the optimizer is willing to use our special index, as follows:

testdb=# SET enable_seqscan=off;
testdb=# EXPLAIN SELECT * FROM test_ssn WHERE ssn = '02087822211';
QUERY PLAN
------------------------------------------------------------------
Index Only Scan using idx_ssn on test_ssn (cost=0.13..8.14 rows=1
width=32)
Index Cond: (ssn = '02087822211'::text)
(2 rows)

Therefore, we can confirm that the optimizer is able to use our new index.

You can read about index access methods in the PostgreSQL documentation at http://www.postgresql.org/docs/current/static/xindex.html.

Creating user-defined aggregates

User-defined aggregate functions are probably a unique PostgreSQL feature, yet they are quite obscure and perhaps not many people know how to create them. However, once you are able to create this function, you will wonder how you have lived for so long without using this feature.

This functionality can be incredibly useful, because it allows you to perform custom aggregates inside the database, instead of querying all the data from the client and doing a custom aggregate in your application code, that is, the number of hits on your website per minute from a specific country.

PostgreSQL has a very simple process for defining aggregates. Aggregates can be defined using any functions and in any languages that are installed in the database. Here are the basic steps to building an aggregate function in PostgreSQL:

  1. Define a start function that will take in the values of a result set; this function can be defined in any PL language you want.
  2. Define an end function that will do something with the final output of the start function. This can be in any PL language you want.
  3. Define the aggregate using the CREATE AGGREGATE command, providing the start and end functions you just created.

Let's steal an example from the PostgreSQL wiki at http://wiki.postgresql.org/wiki/Aggregate_Median.

In this example, we will calculate the statistical median of a set of data. For this purpose, we will define start and end aggregate functions.

Let's define the end function first, which takes an array as a parameter and calculates the median. We are assuming here that our start function will pass an array to the following end function:

CREATE FUNCTION _final_median(anyarray) RETURNS float8 AS $$
WITH q AS
(
SELECT val
FROM unnest($1) val
WHERE VAL IS NOT NULL
ORDER BY 1
),
cnt AS
(
SELECT COUNT(*) AS c FROM q
)
SELECT AVG(val)::float8
FROM
(
SELECT val FROM q
LIMIT 2 - MOD((SELECT c FROM cnt), 2)
OFFSET GREATEST(CEIL((SELECT c FROM cnt) / 2.0) - 1,0)
) q2;
$$ LANGUAGE sql IMMUTABLE;

Now, we create the aggregate as shown in the following code:

CREATE AGGREGATE median(anyelement) (
SFUNC=array_append,
STYPE=anyarray,
FINALFUNC=_final_median,
INITCOND='{}'
);

The array_append start function is already defined in PostgreSQL. This function appends an element to the end of an array.

In our example, the start function takes all the column values and creates an intermediate array. This array is passed on to the end function, which calculates the median.

Now, let's create a table and some test data to run our function:

testdb=# CREATE TABLE median_test(t integer);
CREATE TABLE
testdb=# INSERT INTO median_test SELECT generate_series(1,10);
INSERT 0 10

The generate_series function is a set returning function that generates a series of values, from start to stop with a step size of one.

Now, we are all set to test the function:

testdb=# SELECT median(t) FROM median_test;
median
--------
5.5
(1 row)

The mechanics of the preceding example are quite easy to understand. When you run the aggregate, the start function is used to append all the table data from column t into an array using the append_array PostgreSQL built-in. This array is passed on to the final function, _final_median, which calculates the median of the array and returns the result in the same data type as the input parameter. This process is done transparently to the user of the function who simply has a convenient aggregate function available to them.

You can read more about the user-defined aggregates in the PostgreSQL documentation in much more detail at http://www.postgresql.org/docs/current/static/xaggr.html.

Using foreign data wrappers

PostgreSQL foreign data wrappers (FDW) are an implementation of SQL Management of External Data (SQL/MED), which is a standard added to SQL in 2013.

FDWs are drivers that allow PostgreSQL database users to read and write data to other external data sources, such as other relational databases, NoSQL data sources, files, JSON, LDAP, and even Twitter.

You can query the foreign data sources using SQL and create joins across different systems or even across different data sources.

There are several different types of data wrappers developed by different developers and not all of them are production quality. You can see a select list of wrappers on the PostgreSQL wiki at http://wiki.postgresql.org/wiki/Foreign_data_wrappers.

Another list of FDWs can be found on PGXN at http://pgxn.org/tag/fdw/.

Let's take look at a small example of using file_fdw to access data in a CSV file.

First, you need to install the file_fdw extension. If you compiled PostgreSQL from the source, you will need to install the file_fdw contrib module that is distributed with the source. You can do this by going into the contrib/file_fdw folder and running make and make install. If you used an installer or a package for your platform, this module might have been installed automatically.

Once the file_fdw module is installed, you will need to create the extension in the database:

postgres=# CREATE EXTENSION file_fdw;
CREATE EXTENSION

Let's now create a sample CSV file that uses the pipe, |, as a separator and contains some employee data:

$ cat testdata.csv
AARON, ELVIA J|WATER RATE TAKER|WATER MGMNT|81000.00|73862.00
AARON, JEFFERY M|POLICE OFFICER|POLICE|74628.00|74628.00
AARON, KIMBERLEI R|CHIEF CONTRACT EXPEDITER|FLEET
MANAGEMNT|77280.00|70174.00

Now, we should create a foreign server that is pretty much a formality because the file is on the same server. A foreign server normally contains the connection information that a foreign data wrapper uses to access an external data resource. The server needs to be unique within the database:

CREATE SERVER file_server FOREIGN DATA WRAPPER file_fdw;

The next step, is to create a foreign table that encapsulates our CSV file:

CREATE FOREIGN TABLE employee (
emp_name VARCHAR,
job_title VARCHAR,
dept VARCHAR,
salary NUMERIC,
sal_after_tax NUMERIC
) SERVER file_server
OPTIONS (format 'csv',header 'false' , filename '/home/pgbook/14/
testdata.csv', delimiter '|', null '');'');

The CREATE FOREIGN TABLE command creates a foreign table and the specifications of the file are provided in the OPTIONS section of the preceding code. You can provide the format, and if the first line of the file is a header (header 'false'), in our case there is no file header.

We then provide the name and path of the file and the delimiter used in the file, which in our case is the pipe symbol |. In this example, we also specify that the null values should be represented as an empty string.

Let's run a SQL command on our foreign table:

postgres=# select * from employee;
-[ RECORD 1 ]-+-------------------------
emp_name | AARON, ELVIA J
job_title | WATER RATE TAKER
dept | WATER MGMNT
salary | 81000.00
sal_after_tax | 73862.00
-[ RECORD 2 ]-+-------------------------
emp_name | AARON, JEFFERY M
job_title | POLICE OFFICER
dept | POLICE
salary | 74628.00
sal_after_tax | 74628.00
-[ RECORD 3 ]-+-------------------------
emp_name | AARON, KIMBERLEI R
job_title | CHIEF CONTRACT EXPEDITER
dept | FLEET MANAGEMNT
salary | 77280.00
sal_after_tax | 70174.00

Great, looks like our data is successfully loaded from the file.

You can also use the \d meta command to see the structure of the employee table:

postgres=# \d employee;
Foreign table "public.employee"
Column | Type | Modifiers | FDW Options
---------------+-------------------+-----------+-------------
emp_name | character varying | |
job_title | character varying | |
dept | character varying | |
salary | numeric | |
sal_after_tax | numeric | |
Server: file_server
FDW Options: (format 'csv', header 'false',
filename '/home/pg_book/14/testdata.csv', delimiter '|',
"null" '')

You can run explain on the query to understand what is going on when you run a query on the foreign table:

postgres=# EXPLAIN SELECT * FROM employee WHERE salary > 5000;
QUERY PLAN
---------------------------------------------------------
Foreign Scan on employee (cost=0.00..1.10 rows=1 width=160)
Filter: (salary > 5000::numeric)
Foreign File: /home/pgbook/14/testdata.csv
Foreign File Size: 197
(4 rows)

The ALTER FOREIGN TABLE command can be used to modify the options.

More information about the file_fdw is available at http://www.postgresql.org/docs/current/static/file-fdw.html.

You can take a look at the CREATE SERVER and CREATE FOREIGN TABLE commands in the PostgreSQL documentation for more information on the many options available. Each of the foreign data wrappers comes with its own documentation about how to use the wrapper. Make sure that an extension is stable enough before it is used in production. The PostgreSQL core development group does not support most of the FDW extensions.

If you want to create your own data wrappers, you can find the documentation at http://www.postgresql.org/docs/current/static/fdwhandler.html as an excellent starting point. The best way to learn, however, is to read the code of other available extensions.

Summary

This includes the ability to add new operators, new index access methods, and create your own aggregates. You can access foreign data sources, such as other databases, files, and web services using PostgreSQL foreign data wrappers. These wrappers are provided as extensions and should be used with caution, as most of them are not officially supported.

Even though PostgreSQL is very extensible, you can't plug in a new storage engine or change the parser/planner and executor interfaces. These components are very tightly coupled with each other and are, therefore, highly optimized and mature.

Resources for Article:


Further resources on this subject:


You've been reading an excerpt of:

PostgreSQL Server Programming - Second Edition

Explore Title