If you think that a PostgreSQL server is just a storage system, and the only way to communicate with it is by executing SQL statements, you are limiting yourself tremendously. That is using just a tiny part of the database's features.
A PostgreSQL server is a powerful framework that can be used for all kinds of data processing, and even some non-data server tasks. It is a server platform that allows you to easily mix and match functions and libraries from several popular languages. Consider this complicated, multi-language sequence of work:
Call a string parsing function in Perl.
Convert the string to XSLT and process the result using JavaScript.
Ask for a secure stamp from an external time-stamping service such as www.guardtime.com, using their SDK for C.
Write a Python function to digitally sign the result.
This can be implemented as a series of simple function calls using several of the available server programming languages. The developer needing to accomplish all this work can just call a single PostgreSQL function without having to be aware of how the data is being passed between languages and libraries:
SELECT convert_to_xslt_and_sign(raw_data_string);
In this book, we will discuss several facets of PostgreSQL server programming. PostgreSQL has all of the native server-side programming features available in most larger database systems such as triggers, automated actions invoked automatically each time data is changed. But it has uniquely deep abilities to override the built-in behavior down to very basic operators. Examples of this customization include the following.
Writing User-defined functions (UDF) in C for carrying out complex computations:
Add complicated constraints to make sure that data in the server meets guidelines.
Create triggers in many languages to make related changes to other tables, log the actions, or forbid the action to happen if it does not meet certain criteria.
Define new data types and operators in the database.
Use the geography types defined in the PostGIS package.
Add your own index access methods for either existing or new data types, making some queries much more efficient.
What sort of things can you do with these features? There are limitless possibilities, such as the ones listed as follows:
Write data extractor functions to get just the interesting parts from structured data, such as XML or JSON, without needing to ship the whole, possibly huge, document to the client application.
Process events asynchronously, like sending mail without slowing down the main application. You could create a mail queue for changes to user info, populated by a trigger. A separate mail-sending process can consume this data whenever it's notified by an application process.
The rest of this chapter is presented as a series of descriptions of common data management tasks showing how they can be solved in a robust and elegant way via server programming.
The samples in this chapter are all tested to work, but they come with minimal commentary. They are here just to show you various things server programming can accomplish. The techniques described will be explained thoroughly in later chapters.
Developers program their code in a number of different languages and it could be designed to run just about anywhere. When writing an application, some people follow the philosophy that as much of the logic as possible for the application, should be pushed to the client. We see this in the explosion of applications leveraging JavaScript inside browsers. Others like to push the logic into the middle tier with an application server handling the business rules. These are all valid ways to design an application, so why would you want to program in the database server?
Let's start with a simple example. Many applications include a list of customers who have a balance in their account. We'll use this sample schema and data:
CREATE TABLE accounts(owner text, balance numeric); INSERT INTO accounts VALUES ('Bob',100); INSERT INTO accounts VALUES ('Mary',200);
Tip
Downloading the example code
You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
When using a database, the most common way to interact with it is to use SQL queries. If you want to move 14 dollars from Bob's account to Mary's account, with simple SQL it would look like this:
UPDATE accounts SET balance = balance - 14.00 WHERE owner = 'Bob'; UPDATE accounts SET balance = balance + 14.00 WHERE owner = 'Mary';
But you have to also make sure that Bob actually has enough money (or credit) on his account. It's also important that if anything fails then none of the transactions happen. In an application program, the preceding code snippet becomes:
BEGIN; SELECT amount FROM accounts WHERE owner = 'Bob' FOR UPDATE; -- now in the application check that the amount is actually bigger than 14 UPDATE accounts SET amount = amount - 14.00 WHERE owner = 'Bob'; UPDATE accounts SET amount = amount + 14.00 WHERE owner = 'Mary'; COMMIT;
But did Mary actually have an account? If she did not, the last UPDATE
will succeed by updating zero rows. If any of the checks fail, you should do a ROLLBACK
instead of COMMIT
. Once you have done all this for all the clients that transfer money, a new requirement will invariably arrive. Perhaps, the minimum amount that can be transferred is now 5.00
. You will need to revisit all your code in all your clients again.
So what can you do to make all of this more manageable, more secure, and more robust? This is where server programming, executing code on the database server itself, can help. You can move the computations, checks, and data manipulations entirely into a User-defined function (UDF) on the server. This does not just ensure that you have only one copy of operation logic to manage, but also makes things faster by not needing several round-trips between client and server. If required, you can also make sure that only as much information as needed is given out of the database. For example, there is no business for most client applications to know how much money Bob has on his account. Mostly, they only need to know if there is enough money to make the transfer, or more to the point, if the transaction succeeded.
PostgreSQL includes its own programming language named PL/pgSQL that is aimed to integrate easily with SQL commands. PL stands for programming language, and this is just one of the many languages available for writing server code. pgSQL is shorthand for PostgreSQL.
Unlike basic SQL, PL/pgSQL includes procedural elements, like the ability to use if
/then
/else
statements and loops. You can easily execute SQL statements, or even loop over the result of a SQL statement in the language.
The integrity checks needed for the application can be done in a PL/pgSQL function which takes three arguments: names of the payer and recipient, and the amount to pay. This sample also returns the status of the payment:
CREATE OR REPLACE FUNCTION transfer( i_payer text, i_recipient text, i_amount numeric(15,2)) RETURNS text AS $$ DECLARE payer_bal numeric; BEGIN SELECT balance INTO payer_bal FROM accounts WHERE owner = i_payer FOR UPDATE; IF NOT FOUND THEN RETURN 'Payer account not found'; END IF; IF payer_bal < i_amount THEN RETURN 'Not enough funds'; END IF; UPDATE accounts SET balance = balance + i_amount WHERE owner = i_recipient; IF NOT FOUND THEN RETURN 'Recipient does not exist'; END IF; UPDATE accounts SET balance = balance - i_amount WHERE owner = i_payer; RETURN 'OK'; END; $$ LANGUAGE plpgsql;
Here are a few examples of using this function, assuming you haven't executed the previously proposed UPDATE
statements yet:
postgres=# SELECT * FROM accounts; owner | balance -------+--------- Bob | 100 Mary | 200 (2 rows) postgres=# SELECT * FROM transfer('Bob','Mary',14.00); transfer ---------- OK (1 row) postgres=# SELECT * FROM accounts; owner | balance -------+--------- Mary | 214.00 Bob | 86.00 (2 rows)
Your application would need to check the return code and decide how to handle these errors. As long as it was written to reject any unexpected value, you could extend this function to do more checking, such as minimum transferrable amount, and be sure it would be prevented. There are three errors this can return:
postgres=# SELECT * FROM transfer('Fred','Mary',14.00); transfer ------------------------- Payer account not found (1 row) postgres=# SELECT * FROM transfer('Bob','Fred',14.00); transfer -------------------------- Recipient does not exist (1 row) postgres=# SELECT * FROM transfer('Bob','Mary',500.00); transfer ------------------ Not enough funds (1 row)
For these checks to always work, you would need to make all transfer operations go through the function, rather than manually changing the values with SQL statements.
The sample output shown here has been created with PostgreSQL's psql
utility, usually running on a Linux system. Most of the code will work the same way if you are using a GUI utility like pgAdmin3
to access the server instead. When you see lines like this:
postgres=# SELECT 1;
The postgres=#
part is the prompt shown by the psql
command.
Examples in this book have been tested using PostgreSQL 9.2. They will probably work on PostgreSQL version 8.3 and later. There have not been many major changes to how server programming happens in the last few versions of PostgreSQL. The syntax has become stricter over time to reduce the possibility of mistakes in server programming code. Due to the nature of those changes, most code from newer versions will still run on the older ones, unless it uses very new features. However, the older code can easily fail to run due to one of the newly-enforced restrictions.
When using the psql
utility to execute a query, PostgreSQL normally outputs the result using vertically aligned columns:
$ psql -c "SELECT 1 AS test" test ------ 1 (1 row) $ psql psql (9.2.1) Type "help" for help. postgres=# SELECT 1 AS test; test ------ 1 (1 row)
You can tell when you're seeing a regular output because it will end up showing the number of rows.
This type of output is hard to fit into the text of a book like this. It's easier to print the output from what the program calls the expanded display, which breaks each column into a separate line. You can switch to expanded using either the -x
command-line switch, or by sending \x
to the psql
program. Here is an example of using each:
$ psql -x -c "SELECT 1 AS test" -[ RECORD 1 ] test | 1 $ psql psql (9.2.1) Type "help" for help. postgres=# \x Expanded display is on. postgres=# SELECT 1 AS test; -[ RECORD 1 ] test | 1
Notice how the expanded output doesn't show the row count, and it numbers each output row. To save space, not all of the examples in the book will show the expanded output being turned on. You can normally tell which type you're seeing by differences like this, whether you're seeing rows or RECORD
. The expanded mode will be normally preferred when the output of the query is too wide to fit into the available width of the book.
Server programming can mean a few different things. Server programming is not just writing server functions. There are many other things you can do in the server which can be considered programming.
For more complex tasks you can define your own types, operators, and casts from one type to another, letting you actually compare apples and oranges.
As shown in the next example, you can define the type, fruit_qty
, for fruit-with-quantity and then teach PostgreSQL to compare apples and oranges, say to make one orange to be worth 1.5 apples and convert apples to oranges:
postgres=# CREATE TYPE FRUIT_QTY as (name text, qty int); postgres=# SELECT '("APPLE", 3)'::FRUIT_QTY; fruit_quantity ---------------- (APPLE,3) (1 row) CREATE FUNCTION fruit_qty_larger_than(left_fruit FRUIT_QTY, right_fruit FRUIT_QTY) RETURNS BOOL AS $$ BEGIN IF (left_fruit.name = 'APPLE' AND right_fruit.name = 'ORANGE') THEN RETURN left_fruit.qty > (1.5 * right_fruit.qty); END IF; IF (left_fruit.name = 'ORANGE' AND right_fruit.name = 'APPLE' ) THEN RETURN (1.5 * left_fruit.qty) > right_fruit.qty; END IF; RETURN left_fruit.qty > right_fruit.qty; END; $$ LANGUAGE plpgsql; postgres=# SELECT fruit_qty_larger_than('("APPLE", 3)'::FRUIT_QTY,'("ORANGE", 2)'::FRUIT_QTY); fruit_qty_larger_than ----------------------- f (1 row) postgres=# SELECT fruit_qty_larger_than('("APPLE", 4)'::FRUIT_QTY,'("ORANGE", 2)'::FRUIT_QTY); fruit_qty_larger_than ----------------------- t (1 row) CREATE OPERATOR > ( leftarg = FRUIT_QTY, rightarg = FRUIT_QTY, procedure = fruit_qty_larger_than, commutator = > ); postgres=# SELECT '("ORANGE", 2)'::FRUIT_QTY > '("APPLE", 2)'::FRUIT_QTY; ?column? ---------- t (1 row) postgres=# SELECT '("ORANGE", 2)'::FRUIT_QTY > '("APPLE", 3)'::FRUIT_QTY; ?column? ---------- f (1 row)
Server programming can also mean setting up automated actions (triggers), so that some operations in the database cause some other things to happen as well. For example, you can set up a process where making an offer on some items is automatically reserved to them in the stock table.
So let's create a fruit stock table:
CREATE TABLE fruits_in_stock ( name text PRIMARY KEY, in_stock integer NOT NULL, reserved integer NOT NULL DEFAULT 0, CHECK (in_stock between 0 and 1000 ), CHECK (reserved <= in_stock) );
The CHECK
constraints make sure that some basic rules are followed: you can't have more than 1000
fruits in stock (they'll probably go bad), you can't have negative stock, and you can't reserve more than what you have.
CREATE TABLE fruit_offer ( offer_id serial PRIMARY KEY, recipient_name text, offer_date timestamp default current_timestamp, fruit_name text REFERENCES fruits_in_stock, offered_amount integer );
The offer
table has an ID for the offer (so you can distinguish between offers later), recipient, date, offered fruit name, and offered amount.
For automating the reservation management, you first need a TRIGGER
function, which implements the management logic:
CREATE OR REPLACE FUNCTION reserve_stock_on_offer () RETURNS trigger AS $$ BEGIN IF TG_OP = 'INSERT' THEN UPDATE fruits_in_stock SET reserved = reserved + NEW.offered_amount WHERE name = NEW.fruit_name; ELSIF TG_OP = 'UPDATE' THEN UPDATE fruits_in_stock SET reserved = reserved - OLD.offered_amount + NEW.offered_amount WHERE name = NEW.fruit_name; ELSIF TG_OP = 'DELETE' THEN UPDATE fruits_in_stock SET reserved = reserved - OLD.offered_amount WHERE name = OLD.fruit_name; END IF; RETURN NEW; END; $$ LANGUAGE plpgsql;
You have to tell PostgreSQL to call this function each and every time the offer row is changed:
CREATE TRIGGER manage_reserve_stock_on_offer_change AFTER INSERT OR UPDATE OR DELETE ON fruit_offer FOR EACH ROW EXECUTE PROCEDURE reserve_stock_on_offer();
After this we are ready to test the functionality. First, we will add some fruit to our stock:
INSERT INTO fruits_in_stock(name,in_stock)
Then, we check that stock (this is using the expanded display):
postgres=# \x Expanded display is on. postgres=# SELECT * FROM fruits_in_stock; -[ RECORD 1 ]---- name | APPLE in_stock | 500 reserved | 0 -[ RECORD 2 ]---- name | ORANGE in_stock | 500 reserved | 0
Next, let's make an offer of 100
apples to Bob:
postgres=# INSERT INTO fruit_offer(recipient_name,fruit_name,offered_amount) VALUES('Bob','APPLE',100); INSERT 0 1 postgres=# SELECT * FROM fruit_offer; -[ RECORD 1 ]--+--------------------------- offer_id | 1 recipient_name | Bob offer_date | 2013-01-25 15:21:15.281579 fruit_name | APPLE offered_amount | 100 postgres=# SELECT * FROM fruits_in_stock; -[ RECORD 1 ]---- name | ORANGE in_stock | 500 reserved | 0 -[ RECORD 2 ]---- name | APPLE in_stock | 500 reserved | 100
On checking the stock we see that indeed 100 apples are reserved:
postgres=# SELECT * FROM fruits_in_stock; -[ RECORD 1 ]---- name | ORANGE in_stock | 500 reserved | 0 -[ RECORD 2 ]---- name | APPLE in_stock | 500 reserved | 100
If we change the offered amount, the reservation follows:
postgres=# UPDATE fruit_offer SET offered_amount = 115 WHERE offer_id = 1; UPDATE 1 postgres=# SELECT * FROM fruits_in_stock; -[ RECORD 1 ]---- name | ORANGE in_stock | 500 reserved | 0 -[ RECORD 2 ]---- name | APPLE in_stock | 500 reserved | 115
We also get some extra benefits. First, because of the constraint on the stock table, you can't sell the reserved apples:
postgres=# UPDATE fruits_in_stock SET in_stock = 100 WHERE name = 'APPLE'; ERROR: new row for relation "fruits_in_stock" violates check constraint "fruits_in_stock_check" DETAIL: Failing row contains (APPLE, 100, 115).
More interestingly, you also can't reserve more than you have, even though the constraints are on another table:
postgres=# UPDATE fruit_offer SET offered_amount = 1100 WHERE offer_id = 1; ERROR: new row for relation "fruits_in_stock" violates check constraint "fruits_in_stock_check" DETAIL: Failing row contains (APPLE, 500, 1100). CONTEXT: SQL statement "UPDATE fruits_in_stock SET reserved = reserved - OLD.offered_amount + NEW.offered_amount WHERE name = NEW.fruit_name" PL/pgSQL function reserve_stock_on_offer() line 8 at SQL statement
When you finally delete the offer, the reservation is released:
postgres=# DELETE FROM fruit_offer WHERE offer_id = 1; DELETE 1 postgres=# SELECT * FROM fruits_in_stock; -[ RECORD 1 ]---- name | ORANGE in_stock | 500 reserved | 0 -[ RECORD 2 ]---- name | APPLE in_stock | 500 reserved | 0
In a real system, you probably would archive the old offer before deleting it.
If you need to know who did what to the data and when it was done, one way to do that is to log every action that is performed on an important table.
There are at least two equally valid ways of doing the auditing:
Use auditing triggers
Allow tables to be accessed only through functions, and do the auditing inside these functions
Here, we will take a look at minimal examples of both the approaches.
First, let's create the tables:
CREATE TABLE salaries( emp_name text PRIMARY KEY, salary integer NOT NULL ); CREATE TABLE salary_change_log( changed_by text DEFAULT CURRENT_USER, changed_at timestamp DEFAULT CURRENT_TIMESTAMP, salary_op text, emp_name text, old_salary integer, new_salary integer ); REVOKE ALL ON salary_change_log FROM PUBLIC; GRANT ALL ON salary_change_log TO managers;
You don't generally want your users to be able to change audit logs, so grant only the managers the right to access these. If you plan to let users access the salary table directly, you should put a trigger on it for auditing:
CREATE OR REPLACE FUNCTION log_salary_change () RETURNS trigger AS $$ BEGIN IF TG_OP = 'INSERT' THEN INSERT INTO salary_change_log(salary_op,emp_name,new_salary) VALUES (TG_OP,NEW.emp_name,NEW.salary); ELSIF TG_OP = 'UPDATE' THEN INSERT INTO salary_change_log(salary_op,emp_name,old_salary,new_salary) VALUES (TG_OP,NEW.emp_name,OLD.salary,NEW.salary); ELSIF TG_OP = 'DELETE' THEN INSERT INTO salary_change_log(salary_op,emp_name,old_salary) VALUES (TG_OP,NEW.emp_name,OLD.salary); END IF; RETURN NEW; END; $$ LANGUAGE plpgsql SECURITY DEFINER; CREATE TRIGGER audit_salary_change AFTER INSERT OR UPDATE OR DELETE ON salaries FOR EACH ROW EXECUTE PROCEDURE log_salary_change ();
Now, let's test out some salary management:
postgres=# INSERT INTO salaries values('Bob',1000); INSERT 0 1 postgres=# UPDATE salaries set salary = 1100 where emp_name = 'Bob'; UPDATE 1 postgres=# INSERT INTO salaries values('Mary',1000); INSERT 0 1 postgres=# UPDATE salaries set salary = salary + 200; UPDATE 2 postgres=# SELECT * FROM salaries; -[ RECORD 1 ]-- emp_name | Bob salary | 1300 -[ RECORD 2 ]-- emp_name | Mary salary | 1200
Each one of those changes is saved into the salary change log table for auditing purposes:
postgres=# SELECT * FROM salary_change_log; -[ RECORD 1 ]-------------------------- changed_by | frank changed_at | 2012-01-25 15:44:43.311299 salary_op | INSERT emp_name | Bob old_salary | new_salary | 1000 -[ RECORD 2 ]-------------------------- changed_by | frank changed_at | 2012-01-25 15:44:43.313405 salary_op | UPDATE emp_name | Bob old_salary | 1000 new_salary | 1100 -[ RECORD 3 ]-------------------------- changed_by | frank changed_at | 2012-01-25 15:44:43.314208 salary_op | INSERT emp_name | Mary old_salary | new_salary | 1000 -[ RECORD 4 ]-------------------------- changed_by | frank changed_at | 2012-01-25 15:44:43.314903 salary_op | UPDATE emp_name | Bob old_salary | 1100 new_salary | 1300 -[ RECORD 5 ]-------------------------- changed_by | frank changed_at | 2012-01-25 15:44:43.314903 salary_op | UPDATE emp_name | Mary old_salary | 1000new_salary | 1200
On the other hand, you may not want anybody to have direct access to the salary table, in which case you can perform the following:
REVOKE ALL ON salaries FROM PUBLIC;
Also, give users access to only two functions: the first is for any user looking at salaries and the other is for changing salaries, which is available only to managers.
The functions themselves will have all the access to underlying tables because they are declared as SECURITY DEFINER
, which means they run with the privileges of the user who created them.
The salary lookup function will look like the following:
CREATE OR REPLACE FUNCTION get_salary(text) RETURNS integer AS $$ -- if you look at other people's salaries, it gets logged INSERT INTO salary_change_log(salary_op,emp_name,new_salary) SELECT 'SELECT',emp_name,salary FROM salaries WHERE upper(emp_name) = upper($1) AND upper(emp_name) != upper(CURRENT_USER); – don't log select of own salary -- return the requested salary SELECT salary FROM salaries WHERE upper(emp_name) = upper($1); $$ LANGUAGE SQL SECURITY DEFINER;
Notice that we implemented a "soft security" approach, where you can look up for other people's salaries, but you have to do it responsibly, that is, only when you need to as your manager will know that you have checked.
The set_salary()
function abstracts away the need to check if the user exists; if the user does not, it is created. Setting someone's salary to 0
will remove him from the salary table. Thus, the interface is much simplified and the client application of these functions needs to know and do less:
CREATE OR REPLACE FUNCTION set_salary(i_emp_name text, i_salary int) RETURNS TEXT AS $$ DECLARE old_salary integer; BEGIN SELECT salary INTO old_salary FROM salaries WHERE upper(emp_name) = upper(i_emp_name); IF NOT FOUND THEN INSERT INTO salaries VALUES(i_emp_name, i_salary); INSERT INTO salary_change_log(salary_op,emp_name,new_salary) VALUES ('INSERT',i_emp_name,i_salary); RETURN 'INSERTED USER ' || i_emp_name; ELSIF i_salary > 0 THEN UPDATE salaries SET salary = i_salary WHERE upper(emp_name) = upper(i_emp_name); INSERT INTO salary_change_log (salary_op,emp_name,old_salary,new_salary) VALUES ('UPDATE',i_emp_name,old_salary,i_salary); RETURN 'UPDATED USER ' || i_emp_name; ELSE -- salary set to 0 DELETE FROM salaries WHERE upper(emp_name) = upper(i_emp_name); INSERT INTO salary_change_log(salary_op,emp_name,old_salary) VALUES ('DELETE',i_emp_name,old_salary); RETURN 'DELETED USER ' || i_emp_name; END IF; END; $$ LANGUAGE plpgsql SECURITY DEFINER;
Now, drop the audit
trigger (or the changes will be logged twice) and test the new functionality:
postgres=# DROP TRIGGER audit_salary_change ON salaries; DROP TRIGGER postgres=# postgres=# SELECT set_salary('Fred',750); -[ RECORD 1 ]------------------ set_salary | INSERTED USER Fred postgres=# SELECT set_salary('frank',100); -[ RECORD 1 ]------------------- set_salary | INSERTED USER frank postgres=# SELECT * FROM salaries ; -[ RECORD 1 ]--- emp_name | Bob salary | 1300 -[ RECORD 2 ]--- emp_name | Mary salary | 1200 -[ RECORD 3 ]--- emp_name | Fred salary | 750 -[ RECORD 4 ]--- emp_name | frank salary | 100 postgres=# SELECT set_salary('mary',0); -[ RECORD 1 ]----------------- set_salary | DELETED USER mary postgres=# SELECT * FROM salaries ; -[ RECORD 1 ]--- emp_name | Bob salary | 1300 -[ RECORD 2 ]--- emp_name | Fred salary | 750 -[ RECORD 3 ]--- emp_name | frank salary | 100 postgres=# SELECT * FROM salary_change_log ; ... -[ RECORD 6 ]-------------------------- changed_by | gsmith changed_at | 2013-01-25 15:57:49.057592 salary_op | INSERT emp_name | Fred old_salary | new_salary | 750 -[ RECORD 7 ]-------------------------- changed_by | gsmith changed_at | 2013-01-25 15:57:49.062456 salary_op | INSERT emp_name | frank old_salary | new_salary | 100 -[ RECORD 8 ]-------------------------- changed_by | gsmith changed_at | 2013-01-25 15:57:49.064337 salary_op | DELETE emp_name | mary old_salary | 1200 new_salary |
We notice that employee names don't have consistent cases. It would be easy to enforce consistency by adding a constraint:
CHECK (emp_name = upper(emp_name))
However, it is even better to just make sure that it is stored as uppercase, and the simplest way to do it is by using trigger
:
CREATE OR REPLACE FUNCTION uppercase_name () RETURNS trigger AS $$ BEGIN NEW.emp_name = upper(NEW.emp_name); RETURN NEW; END; $$ LANGUAGE plpgsql; CREATE TRIGGER uppercase_emp_name BEFORE INSERT OR UPDATE OR DELETE ON salaries FOR EACH ROW EXECUTE PROCEDURE uppercase_name ();
The next set_salary()
call for a new employee will now insert emp_name
in uppercase:
postgres=# SELECT set_salary('arnold',80); -[ RECORD 1 ]------------------- set_salary | INSERTED USER arnold
As the uppercasing happened inside a trigger, the function response still shows a lowercase name, but in the database it is uppercase:
postgres=# SELECT * FROM salaries ; -[ RECORD 1 ]--- emp_name | Bob salary | 1300 -[ RECORD 2 ]--- emp_name | Fred salary | 750 -[ RECORD 3 ]--- emp_name | frank salary | 100 -[ RECORD 4 ]--- emp_name | ARNOLD salary | 80
After fixing the existing mixed-case emp_names
, we can make sure that all emp_names
will be in uppercase in the future by adding a constraint:
postgres=# update salaries set emp_name = upper(emp_name) where not emp_name = upper(emp_name); UPDATE 3 postgres=# alter table salaries add constraint emp_name_must_be_uppercasepostgres-# CHECK (emp_name = upper(emp_name)); ALTER TABLE
If this behavior is needed in more places, it would make sense to define a new type – say u_text
, which is always stored as uppercase. You will learn more about this approach in the chapter about defining user types.
The last example in this chapter is about using functions for different ways of sorting.
Say we are given a task of sorting words by their vowels only, and in addition to that, make the last vowel the most significant one when sorting. While this task may seem really complicated at first, it is easy to solve with functions:
CREATE OR REPLACE FUNCTION reversed_vowels(word text) RETURNS text AS $$ vowels = [c for c in word.lower() if c in 'aeiou'] vowels.reverse() return ''.join(vowels) $$ LANGUAGE plpythonu IMMUTABLE; postgres=# select word,reversed_vowels(word) from words order by reversed_vowels(word); word | reversed_vowels -------------+----------------- Abracadabra | aaaaa Great | ae Barter | ea Revolver | eoe (4 rows)
The best part is that you can use your new function in an index definition:
postgres=# CREATE INDEX reversed_vowels_index ON words (reversed_vowels(word)); CREATE INDEX
The system will automatically use this index whenever the function reversed_vowels(word)
is used in the WHERE
clause or ORDER BY
.
Developing application software is complicated. Some of the approaches to help manage that complexity are so popular that they've been given simple acronyms to remember them. Next, we'll introduce some of these principles and show how server programming helps make them easier to follow.
One of the main techniques to successful programming is writing simple code. That is, writing code that you can easily understand three years from now, and that others can understand as well. It is not always achievable, but it almost always makes sense to write your code in the simplest way possible. You may rewrite parts of it later for various reasons such as speed, code compactness, to show off how clever you are, and so on. But always write the code first in a simple way, so you can absolutely be sure that it does what you want. Not only do you get working on code fast, you also have something to compare to when you try more advanced ways to do the same thing.
And remember, debugging is harder than writing code; so if you write the code in the most complex way you can, you will have a really hard time debugging it.
It is often easier to write a set returning function instead of a complex query. Yes, it will probably run slower than the same thing implemented as a single complex query due to the fact that the optimizer can do very little to code written as functions, but the speed may be sufficient for your needs. If more speed is needed, it's very likely to refactor the code piece by piece, joining parts of the function into larger queries where the optimizer has a better chance of discovering better query plans until the performance is acceptable again.
Remember that for most of the times, you don't need the absolutely fastest code. For your clients or bosses, the best code is the one that does the job well and arrives on time.
This one means to try to implement any piece of business logic just once, and put the code for doing it in the right place.
It may sometimes be hard, for example you do want to do some checking of your web forms in the browser, but still do the final check in the database. But as a general guideline it is very much valid.
Server programming helps a lot here. If your data manipulation code is in the database near the data, all the data users have easy access to it, and you will not need to manage a similar code in a C++ Windows program, two PHP websites, and a bunch of Python scripts doing nightly management tasks. If any of them needs to do this thing to a customer's table, they just call:
SELECT * FROM do_this_thing_to_customers(arg1, arg2, arg3);
And that's it!
If the logic behind the function needs changing, you just change the function with no downtime and no complicated orchestration of pushing database query updates to several clients. Once the function is changed in the database, it is changed for all users.
In other words, don't do more than you absolutely need to.
If you have a creepy feeling that your client is not yet well aware of what the final database will look like or what it will do, it's helpful to resist the urge to design "everything" into the database. A much better way is to do the minimal implementation that satisfies the current spec, but do it with extensibility in mind. It is much easier to "paint yourself into a corner" when implementing a big spec with large imaginary parts.
If you organize your access to the database through functions, it is often possible to do even large rewrites of business logic without touching the frontend application code. Your application still does SELECT * FROM do_this_thing_to_customers(arg1, arg2, arg3)
even after you have rewritten the function five times and changed the whole table structure twice.
Usually when you hear the acronym SOA, it comes from Enterprise Software people selling you a complex set of SOAP services. But the essence of the SOA is organizing your software platform as a set of services that clients and other services call for performing certain well-defined atomic tasks, such as:
Checking a user's password and credentials
Presenting him/her with a list of his/her favorite websites
Selling him/her a new red dog collar with complementary membership in the red-collared dog club
These services can be implemented as SOAP calls with corresponding WSDL definitions and Java servers with servlet containers, and complex management infrastructure. They can also be a set of PostgreSQL functions, taking a set of arguments and returning a set of values. If arguments or return values are complex, they can be passed as XML or JSON, but often a simple set of standard PostgreSQL data types is enough. In Chapter 9, Scaling Your Database with PL/Proxy, we will learn how to make such PostgreSQL-based SOA service infinitely scalable.
Some of the preceding techniques are available in other databases, but PostgreSQL's extensibility does not stop here. In PostgreSQL, you can just write User-defined functions (UDFs) in any of the most popular scripting languages. You can also define your own types, not just domains, which are standard types with some extra constraints attached, but new full-fledged types too.
For example, a Dutch company MGRID has developed value with unit set of data types, so that you can divide 10 km by 0.2 hour and get the result in 50 km/h. Of course, you can also cast the same result to meters per second or any other unit of speed. And yes, you can get this as a fraction of c—the speed of light.
This kind of functionality needs both the types themselves and overloaded operands, which know that if you divide distance by time then the result is speed. You will also need user-defined casts, which are automatically- or manually-invoked conversion functions between types.
MGRID developed this for use in medical applications where the cost of error can be high—the difference between 10 ml and 10 cc can be vital. But using a similar system could also have averted many other disasters, where using wrong units has ended with producing bad computation results. If the unit is always there together with the amount, the possibility for these kinds of errors is very much diminished. You can also add your own index methods if you have some programming skills and your problem domain is not well served by the existing indexes. There is already a respectable set of index types included in the core PostgreSQL, as well as several others which are developed outside the core.
The latest index method which became officially included in PostgreSQL is KNN (K Nearest Neighbor)—a clever index, which can return K rows ordered by their distance from the desired search target. One use of KNN is in fuzzy text search, where this can be used for ranking full-text search results by how well they match the search terms. Before KNN, this kind of thing was done by querying all rows which matched even a little, and then sorting all these by the distance function and returning K top rows as the last step.
If done using KNN index, the index access can start returning the rows in the desired order; so a simple LIMIT K
function will return you the K top matches.
The KNN index can also be used for real distances, for example answering the request "Give me the 10 nearest pizza places to central station."
As you see, index types are separate from the data types they index. As another example, the same GIN (General Inverted Index) can be used for full-text search (together with stemmers, thesauri, and other text processing stuff) as well as indexing elements of integer arrays.
Yet another place where server-side programming can be used is for caching values, which are expensive to compute. The basic pattern here is:
Check if the value is cached.
If not or the value is too old, compute and cache it.
Return the cached value.
For example, calculating sales for a company is the perfect item to cache. Perhaps, a large retail company has 1,000 stores with potentially millions of individual sales transactions per day. If the corporate headquarters is looking for sales trends, it is much more efficient if the daily sales numbers were precalculated at the store level instead of summing up millions of daily transactions.
If the value is simple, like looking up a user's information from a single table based on the user ID, you don't need to do anything. The value becomes cached in PostgreSQL's internal page cache, and all lookups to it are so fast that even on a very fast network most of the time spent doing the lookups are in the network, not in the actual lookup. In such a case, getting data from a PostgreSQL database is as fast as getting it from any other in-memory cache (like memcached) but without any extra overhead in managing the cache.
Another use-case of caching is implementing materialized views. These are views which are precomputed only when needed, not each time one selects from that view. Some SQL databases have materialized views as a separate database object, but in PostgreSQL you have to do it all yourself, using other database features for automating the whole process.
The main advantages of doing most data manipulation code server-side are the following.
Doing the computation near data is almost always a performance win, as the latencies for getting the data are minimal. In a typical data-intensive computation, most of the time tends to be spent in getting the data. Therefore, making data access inside the computation faster is the best way to make the whole thing fast. On my laptop it takes 2.2 ms to query one random row from a 1,00,000 row database into the client, but it takes only 0.12 ms to get the data inside the database. This is 20 times faster and this is inside the same machine over Unix sockets. The difference can be bigger if there is a network connection between client and server.
A small real-word story:
A friend of mine was called to help a large company (I'm sure you all know it, though I can't tell you which one) to try to make its e-mail sending application faster. They had implemented their e-mail generation system with all the latest Java EE technologies, first getting the data from the database, passing the data around between services, and serializing and de-serializing it several times before finally doing XSLT transform on the data to produce the e-mail text. The end result being that it produced only a few hundred e-mails per second and they were falling behind with their responses.
When he rewrote the process to use a PL/Perl function inside the database to format the data and the query returned already fully-formatted e-mails, it suddenly started spewing out tens of thousands of e-mails per second, and they had to add a second copy of sent mail to actually be able to send them out.
If all data manipulation code is in a database, either as database functions or views, the actual upgrade process becomes very easy. All that is needed is running a DDL script that redefines the functions and all the clients automatically use the new code with no downtime, and no complicated coordination between several frontend systems and teams.
If all access for some possibly insecure servers goes through functions, the database user of these servers use can be granted only the access to the needed functions and nothing else. They can't see the table data or even the fact that these tables exist. So even if that server becomes compromised, all it can do is continue to call the same functions. Also, there is no possibility to steal passwords, e-mails, or other sensitive information by issuing its own queries like SELECT * FROM users;
and getting all the data there is in the database.
And the most important thing, programming in server is fun!
Programming inside the database server is not always the first thing that comes to mind to many developers, but it's unique placement inside the application stack gives it some powerful advantages. Your application can be faster, more secure, and more maintainable by pushing your logic into the database. With server-side programming in PostgreSQL, you can:
Secure your data using functions
Audit access to your data using triggers
Enrich your data using custom data types
Analyze your data using custom operators
And this is just the very start of what you can do inside PostgreSQL. Throughout the rest of this book, you will learn about many other ways to write powerful applications by programming inside PostgreSQL.