PostgreSQL is an object-relational database management system (ORDBMS) based on the INGRES (INteractive Graphics REtrieval System) package, which was developed at the University of California, Berkeley. The POSTGRES (Post Ingres) project started in 1985, and version 1 was released to a small number of external users in June of 1989. Now, with more than 20 years of development, PostgreSQL has become the most advanced open source database, available all over the world.
This chapter introduces the development of PostgreSQL and how PostgreSQL has become a popular Database as a Service (DBaaS) among the current clouds. We will present an overview of the numerous features of PostgreSQL in various development environments, such as NodeJS and Django. Moving forward, we will introduce you to the PostGIS extension, which is a PostgreSQL facility for a geospatial PostgreSQL database. We will also present the PostgREST standalone web server, which aims to do one thing well: add an HTTP interface to any PostgreSQL databases or RESTful APIs.
Then, in the second part of the chapter, we will learn about the administration of PostgreSQL. We will utilize DevOps through the setting up of PostgreSQL high availability (HA) clusters. We will also set up New Relic to monitor a PostgreSQL database, carry out performance tests on a PostgreSQL database with PGBench and JMeter, and use PostgreSQL testing frameworks.
In this chapter, we will cover the following main topics:
- An overview of PostgreSQL development
- An overview of PostgreSQL administration
An overview of PostgreSQL development
In 1994, Postgres95 was released to the world by Andrew Yu and Jolly Chen as an open source descendant of the original POSTGRES Berkeley code; they added a SQL language interpreter to POSTGRES. By 1996, the name "Postgres95" was changed to a new name – PostgreSQL Version 6.0 – combining the original POSTGRES version and recent versions with SQL capability. Recently, DB-Engines and the SD Times 2018 100 have featured PostgreSQL as the "DBMS of the Year 2017."
In October 2019, the first PostgreSQL 12.0 version was released by the PostgreSQL Global Development Group. Since then, PostgreSQL has been the most advanced open source database all over the world. Version 12 provides many important improvements, including the following:
- The increased performance of standard B-tree indexes and also the reduction of the index size for B-tree indexes.
- The ability to rebuild indexes concurrently so that a REINDEX operation will not block any index writes; the parallel indexes introduced from PostgreSQL 10 now get more benefits with the new REINDEX CONCURRENTLY statement.
- The efficient capabilities regarding partitioning performance so that developers can now process (that is, query or alter) thousands of partitions simultaneously without blocking, and they can use foreign keys to reference partitioned tables.
- The most common value statistics for the CREATE STATISTICS command, leading to improved query plans.
- The common table expressions using WITH queries can now be inlined for quicker queries.
- The INCLUDE clause for generalized search tree (GiST) indexes is an extensible data structure that allows you to develop indices over any kind of data.
- The reduction of write-ahead log (WAL) overheads generated from a GiST, GIN, or SP-GiST index.
- The checksum control ability via the pg checksums statement (which used to be pg_verify_checksums) so that developers can enable or disable an offline cluster without dumping and reloading data (note that online checksum enablement is still in progress and not yet available in PostgreSQL 12).
Since October 2019, the latest PostgreSQL version 12 has been released with more improvements to the performance of the INSERT and COPY statements for partitioned tables and the attachment of a new table partition without blocking queries. You can read more about the fixes to previous releases at https://www.postgresql.org/docs/12/release-12.html.
Finally, PostgreSQL 12 benefits all users with notable improvements to query performance over larger datasets and space utilization; PostgreSQL 12 has quickly received good reviews and evaluations across the database industry. Version 12 of PostgreSQL is obviously the preferred open source database for all developers.
PostgreSQL 12 is now developed on clouds or so-called cloud databases and DBaaS.
What is DBaaS?
DBaaS, which is sometimes referred to as a cloud database, provides many kinds of databases as a managed service. DBaaS works in the same way as Infrastructure as a Service (IaaS) or Platform as a Service (PaaS). IaaS provides infrastructure components and PaaS provides development platforms as managed services in the cloud. In fact, the offerings of IaaS and PaaS often include databases.
When customers demand a DBaaS offering in the cloud, they only pay for what they use on a monthly or annual basis. They do not have to pay for what they do not use. The cloud providers are responsible for managing database services such as maintenance, upgrades, or administration for their customers. At present, the DBaaS offerings include both relational database management systems (RDBMS) and NoSQL databases.
The primary advantage of the cloud service model is that customers do not have to install or maintain their software in the data center; it is well understood by every developer. However, there are also some disadvantages, such as a lack of control over network performance issues or the inability to compress data or other database maintenance tasks.
PostgreSQL can be delivered as DBaaS on many clouds, such as Amazon Web Services (AWS), Google Cloud SQL, Microsoft Azure, Heroku, and EnterpriseDB Cloud.
The development of PostgreSQL by various environments
Most applications, at some point, need to persist data. This can be through files, local storage, cloud services, or often databases. Relational database systems are usually a good default choice, particularly PostgreSQL, which is a very powerful open source SQL server.
Some companies have resources to organize their own dedicated database team. If you are lucky enough to work in such a company, they will probably craft all of their stored procedures inside their databases. So, you will only have to use the language of your choice – NodeJS or Python – to call these stored procedures. In practice, this might not occur very often, and the truth is that many developers, or even so-called full stack developers, know very little about SQL and will seek out various abstractions in their favorite package repository.
Features of PostgreSQL with NodeJS
As a newfangled piece of technology, NodeJS is a cutting-edge web server that generates vast attention, not only from start-ups but also from giant enterprises. NodeJS is considered an exceptional framework for the IT market because of the following factors:
- NodeJS is ideal for data-intensive, real-time applications, and it enables event-driven programming.
- Outperforming conventional web servers, NodeJS has been a viable option for Yahoo, eBay, and Walmart.
- NodeJS employs push technology.
- It is also a lightweight web server in terms of memory usage.
- NodeJS can be a good solution for data-dense, real-time web applications across multiple devices.
The following features of NodeJS are usually emphasized by developers:
- The super-fast web server: NodeJS operates at a very high speed by using a single-threaded event loop model to process all of its asynchronous I/O operations. Therefore, any major actions can be performed quickly with NodeJS, such as network connections, filesystems, and reading/writing to databases. NodeJS supports developers by allowing them to create quick and robust network applications and offers parallel connections to increase throughput.
- Real-time data streaming: NodeJS considers both HTTP requests and responses as data streams. Hence, when data comes in the form of streams, the overall processing time can be reduced because NodeJS can process files and upload files at the same time. Developers can choose NodeJS for real-time video or audio recording.
- Real-time web applications: Because NodeJS is very fast, it is obviously a winner for games and chat apps as well as any other multi-user real-time web apps. The synchronization process is fast and orderly due to the event-driven architecture of NodeJS, and the event loop of NodeJS through the web socket protocol handles the multi-user function.
- Node Package Manager (NPM): NodeJS is an open source suite with more than 60,000 modules in the NPM.
- A good solution for synchronization: NodeJS is very efficient in its ability to solve common network development problems because it manages asynchronous I/O very well with its single-threaded event loop. In order to handle many clients, all I/O tasks are undertaken together in NodeJS.
Features of PostgreSQL with Python
Many Python developers prefer to use PostgreSQL for their applications. Therefore, Python's sample code for using PostgreSQL has been better documented with typical usage scenarios. Both PostgreSQL and Python are popular because of the following:
- PostgreSQL's open source license is free, so developers can easily operate as many databases as they wish without any cost.
- We can now find developers who have PostgreSQL experience more easily than other relational databases.
- Django is a Python Model-View-Template (MVT) web framework, which defines itself as a "batteries included" web framework.
- Django is simple and robust, and it is one of the most famous frameworks used by websites such as Instagram, YouTube, Google, and NASA.
The following features are emphasized by developers of Python (Django):
- PostgreSQL's connection with Python: Python developers connect to PostgreSQL databases via the psycopg2 database driver and then use an object-relational mapper (ORM) to turn PostgreSQL tables into objects; finally, these objects can be utilized in their Python applications.
- A "batteries included" philosophy: The philosophy of "batteries included" means that Python provides plenty of functionalities, including magical ORM, multisite and multi-language support, MVT layouts, RSS and Atom feeds, AJAX support, free APIs, URL routing, easy database migrations, session handling, HTTP libraries and templating libraries, code layouts, and a default admin section.
- Good tutorials: Django has very good documentation that developers can quickly update with the latest information, such as technical requirements and quick-start details, detailed release notes, backward-incompatible changes, and online topics on Python development.
- The Django admin interface: The admin interface is one of the key advantages of Django. From a few simple lines of Python code, developers can get a good featured admin interface.
- Built-in templates: One of the advantages of Django is its robust built-in template system that facilitates the process of application development.
- Good scalability: Django can handle heavy traffic and the mobile app usage of many users; it maximizes scalability while minimizing web hosting costs. Django can also execute fine for a large number of hosts while maintaining a relatively cheap or even free hosting price.
- Best web application security: Django has protection against SQL injections, XSS and CSRF attacks, and clickjacking, as it hides the source code of websites.
PostGIS spatial extension
PostGIS is a spatial extender for PostgreSQL. It can be a good OpenGIS Simple Features for SQL-compliant spatial database because it allows you to use location SQL queries for geographic objects while remaining free and open source. Because spatial data is usually related to various types of data, PostGIS allows PostgreSQL developers to encode more complex spatial relationships. The first version (0.1) of PostGIS was released in May 2001 by Refractions Research. The latest released version of PostGIS is PostGIS 3.1.2.
When using PostGIS, we are able to execute our spatial data just like anything else in SQL statements. With the power of SQL, developers can conveniently perform spatial database transactions, backups, integrity checks, less data redundancy, multi-user operations, and security controls.
The following is a list of advantages of PostGIS:
- It offers complicated spatial tasks, spatial operators, and spatial functions.
- It significantly shortens the development time of applications.
- It allows spatial SQL querying using simple expressions for spatial relationships, such as the distance, adjacency, and containment, and for spatial operations, such as the area, length, intersection, union, and buffer.
In 2006, the Open Geospatial Consortium evaluated that "PostGIS implements the specified standard for simple features for SQL." In fact, PostGIS can be used as a backend for many software systems such as OpenStreetMap, ArcGIS, OpenJUMP, MapGuide, Kosmo, and QGIS. Furthermore, PostGIS also has an open source extension named "pgRouting" that provides geospatial routing functionality with many algorithms, including the all-pairs algorithm, Johnson’s algorithm, the Floyd-Warshall algorithm, A*, the bidirectional Dijkstra and traveling salesperson algorithms, and more.
The PostgREST RESTful API for PostgreSQL databases
PostgREST is a web server created by Joe Nelson that effectively turns any PostgreSQL database into an API. This tool manages client authentication through JSON Web Tokens (JWTs). It also securely embeds a database role name in the JWT and uses that database role for all connections. The latest PostgREST release is version 7.0.1, and one of its earliest versions was PostgREST 0.2.6 from February 18, 2015.
PostgREST is built on three core concepts:
- JSON encoding
- Authentication by JWTs
- Resource embedding
Taking advantage of the very fast PostgreSQL JSON encoder for a large number of responses, PostgREST is also a very fast RESTful API. At first, when PostgREST connects to a database, it connects using an authenticator role that only has login capability. Then, whenever an authenticated request is sent to PostgREST that contains a JWT token, the token will be decoded using the secret key to set up a database role for the request. Using the &select= parameter, PostgREST can also query the related tables for us through defined foreign keys that have been declared inside the PostgreSQL database.
PostgREST's philosophy is that it aims to do one thing well: add an HTTP interface to any PostgreSQL database.
An overview of PostgreSQL administration
The daily tasks of any PostgreSQL database administrator can include the optimization of parameter values defined by the PostgreSQL architecture, setting up and maintaining PostgreSQL HA clusters, and creating dashboards to monitor their PostgreSQL database clusters. They can also get requests to proceed with performance tests a few times per month if their PostgreSQL databases frequently encounter heavy traffic loads.
The PostgreSQL architecture
The PostgreSQL architecture is very simple. It consists of shared memory and a few background processes and data files, as shown in the following diagram:
Figure 1.1 – The PostgreSQL architecture
For a better understanding, let's discuss the preceding architecture in the following sections.
Shared memory refers to the memory that is reserved for database caching and transaction log caching. The most important elements in shared memory are shared buffers and WAL buffers:
- Shared buffers: The purpose of a shared buffer is to minimize DISK I/O. For this purpose, the following principles must be met:
- You need to access very large buffers (that is, tens or hundreds of gigabytes worth) quickly.
- You should minimize contention when many users access it at the same time.
- Frequently used blocks must remain in the buffer for as long as possible.
- WAL buffers: The WAL buffer is a buffer that temporarily stores changes to the database:
- The contents stored in the WAL buffer are written to the WAL file at a predetermined point in time.
- From a backup and recovery point of view, WAL buffers and WAL files are very important.
PostgreSQL includes the VACUUM statement, which reclaims waste storage using dead tuples inside the database. This is because when we delete tuples, they are not physically removed from their table until a vacuum is called. So, on frequently updated tables, we need to perform VACUUM periodically.
The PostgreSQL autovacuum launcher is an optional daemon. By default, it is enabled inside the PostgreSQL configuration using the autovacuum = on parameter. Therefore, unless you turn it off, the autovacuum launcher will wake up every autovacuum_naptime seconds.
Whenever the autovacuum occurs, by default, every 1 minute, it invokes multiple processes. The number of concurrent vacuum processes is defined by the autovacuum_worker parameter value inside the PostgreSQL configuration file. The vacuum work does not lock on any tables, hence it does not interrupt other database tasks.
The PostgreSQL writer and checkpointer
Inside the shared buffer, there are pages that have been modified, but their changes have not been written to disk yet; those pages are called dirty pages. A checkpoint is a specific point in time when all of the dirty pages in the shared buffer are written to disk by the checkpointer. So, if there is a crash, the database recovery will start from the most recent checkpoint where all the data was surely consistent. The PostgreSQL checkpointer flushes all of the dirty pages at certain intervals to create a checkpoint for necessary recovery.
There are two types of writers in PostgreSQL: the WAL writer and the background writer. Between the two checkpoints, the background writer flushes some dirty pages to disk so that there are always enough clean pages to reduce the amount of blocking that occurs during the queries.
PostgreSQL does not write the updates directly to the data files on disk but on commit; these updates are written to the WAL file sequentially, and later, the checkpointer will write all the dirty blocks to each of the respective data files because data file writing cannot be sequential, so users will not have to wait for the delay to locate the data block inside a data file.
Hence, the WAL writer will write our updates from the shared buffer to the WAL file.
PostgreSQL process types
PostgreSQL has four types of processes:
- The postmaster process: The postmaster process is the first process to begin when you start PostgreSQL. It performs recovery, initializes shared memory, and runs background processes. It also creates a backend process when there is a connection request from the client process.
- The background process: The list of background processes required for PostgreSQL operations can be found in the following table:
|Logger||This writes the error message to the log file.|
|Checkpointer||When a checkpoint occurs, the dirty data inside the shared buffer is written to the file.|
|Writer||This periodically writes the dirty buffer to a file.|
|Wal writer||This writes the WAL buffer to the WAL file.|
This launches a new worker process and carries out vacuum operations on bloated tables.
|Archiver||When in Archive.log mode, this copies the WAL file to the specified directory.|
|Stats collector||This collects DBMS usage statistics such as session execution information (pg_stat_activity) and table usage statistical information (pg_stat_all_tables).|
Table 1.1 – The background processes of PostgreSQL
- The backend process: The backend process performs the query request of the user process and then transmits the result. Some memory structures are required for query execution; this is called local memory.
- The client process: The client process refers to the background process that is assigned for every backend user connection. Usually, the postmaster process will fork a child process that is dedicated to serving a user connection.
Managing HA in PostgreSQL
Managing HA in PostgreSQL is very important to ensure that database clusters maintain exceptional uptime and strong operational performance so that your data is always available to the application.
Master-slave might be the most basic and easiest HA architecture for developers. It is based on one master database with one or more standby servers. These standby databases will remain synchronized (or almost synchronized) with the master, depending on whether the replication is synchronous or asynchronous.
It is important to understand that PostgreSQL does not have a native mechanism to control failovers, that is, when the master fails automatically, the standby server becomes the new master with downtime as close to 0 as possible. To perform this procedure, there are third-party tools such as repmgr, pgpool-II, or patroni, to name a few.
These tools are placed in a layer above PostgreSQL, and they control the health status of the master; when a problem occurs, these tools fire a series of actions to promote the standby server as the new master.
There are several ways to classify a standby database:
- By the nature of the replication:
- Physical standbys: Disk blocks are copied.
- Logical standbys: The streaming of the data changes.
- By the synchronicity of the transactions:
- Asynchronous: There is a possibility of data loss.
- Synchronous: There is no possibility of data loss; the commits in the master wait for the response of the standby.
- By usage:
- Hot standbys: Support read-only connections; the hot standbys are configured for synchronous-commit mode, so their master server must wait for the hot standbys to confirm that they have consolidated the transaction log (when a commit statement is performed on the master, the progress of the commit will only be completed after all of the hot standbys have finished consolidating the transaction log).
- Async standbys: These kinds of servers are configured by asynchronous-commit mode; therefore, the master server will not wait for the async standbys to consolidate the transaction log (when a commit statement is performed on the master, the progress of the commit will not wait for these async standbys to consolidate).
PostgreSQL uses a stream of WAL records to synchronize the standby databases. They can be synchronous or asynchronous, and the entire database server is replicated.
However, a master-slave setup is not enough to effectively ensure HA, as we also need to handle failures. To handle failures, we need to be able to detect them. Once we know there is a failure, for example, errors on the master, or the master is not responding, then we can select a slave and failover mechanism to it with the smallest amount of delay possible. It is important that this process is as efficient as possible, in order to restore full functionality so that the applications can start functioning again. PostgreSQL itself does not include an automatic failover mechanism, so it will require some custom script or third-party tools for this automation.
After a failover happens, the applications need to be notified accordingly so that they can start using the new master. Additionally, we need to evaluate the state of our architecture after a failover because we can run into a situation where we only have the new master running (for instance, we had a master and only one slave before the issue). In that case, we will need to somehow add a slave so as to recreate the master-slave setup we originally had for HA.
Benchmarking PostgreSQL performance
PostgreSQL is the most advanced open source database, hence, PostgreSQL performance should be the first option to evaluate.
Developers benchmark a PostgreSQL database so that they can check the capability and behavior of the database against their application. Based on the benchmarking plan, different hardware can yield different results. It is very important to separate the benchmarked database server from other servers, such as the servers generating the load or the servers collecting performance metrics. As a part of the benchmarking results, developers will obtain application characteristics, such as the following:
- Is the application read/write-intensive?
- How is the read/write split (60:40)?
- How large is the dataset?
- Is the data and structure representative of the actual production database?
Key performance factors in PostgreSQL
A production environment is consolidated with different components, hardware such as CPU/memory, and operating systems. PostgreSQL is installed to communicate with other components of the production environment. Please bear in mind that if the production environment is not properly configured, the overall performance will be degraded.
Some PostgreSQL queries can run faster or slower depending on the configuration that has been set. The goal of database performance benchmarking is to achieve the largest possible throughput. The key performance factors that affect a database are the workload, resources, optimization, and contention. The workload includes batch jobs, dynamic queries for online transactions, and data analytics queries that are used to generate reports. The workload will be different according to the period of each day, week, or month, and it also depends on the applications. The optimization of every database is unique: it can be a database-level configuration or query-level optimization. Contention refers to the conflict condition because two or more components of the workload could attempt to use a single resource at the same time. If contention increases, the throughput will decrease.
Using pgbench for PostgreSQL benchmarking
The pgbench application is a standard benchmarking tool of the software provided alongside the official distribution of PostgreSQL. Using pgbench, we repeatedly execute given SQL commands that measure the following performance aspects of your PostgreSQL databases:
- Memory performance and disk performance
- Read/write performance
- Connection performance
- Ad hoc queries' performance (that is, special queries for the needs of services or operational information)
Monitoring PostgreSQL databases
There are several key metrics that you will definitely want to keep track of when it comes to database performance, and they are not all database-specific.
You should keep track of shared buffer usage while reading or updating data. In the shared buffer cache, PostgreSQL checks for the execution of a request. It will take data from the disk if the block is not found in the shared buffer cache. Once this is done, the data will be cached in the shared buffer cache of the database.
Implementing effective database monitoring can offer benefits such as increasing performance, increasing the availability of the application, faster database outage detection, and accurately analyzing storage requirements and index performance.
The DevOps environment for PostgreSQL
DevOps is a new and revolutionary way in which to deploy applications and services, including tools for automation, continuous development, integration, and deployment. Usually, frequent delays can occur in Agile deployment sprints when databases are added to the continuous integration and deployment stages, especially PostgreSQL databases. So, DevOps can provide various tools for PostgreSQL database deployment, automation, and scalability.
Among DevOps' tools, Vagrant and VirtualBox are two pieces of software that focus on automation tasks. They allow you to develop and manage virtual machines and environments in a single workflow. They reduce the development environment setup time. Additionally, the benefits of PostgreSQL in a virtual machine are that they are able to standardize the operating system to provide a consistent environment, and they will completely isolate unrelated databases from each other. Using Vagrant and VirtualBox, it only takes a couple of minutes to create a fresh PostgreSQL database server.
Also, Vagrant integrates with other DevOps configuration management tools such as Ansible and Terraform; these are two infrastructure tools that provide script templates to start PostgreSQL database clusters automatically on the cloud. Ansible and Terraform are good ways in which to automate PostgreSQL in order to speed up the development process.
The goal of database automation by DevOps is clear: your PostgreSQL servers will appear the same. Therefore, your PostgreSQL servers will obtain consistency through the quality assurance, staging, and production phases. Puppet is also a well-known DevOps tool that we can use to manage PostgreSQL configuration files, the Postgres user, and PostgreSQL backup crontabs. In comparison, Jenkins, a DevOps tool, can be used to deploy and test virtually any software project. Jenkins can also contribute to the automation of PostgreSQL test databases during software testing tasks.
PostgreSQL testing frameworks
PostgreSQL testing frameworks allow developers to process automated tests for existing stored procedures with unit tests and Test-Driven Development (TDD). Because all test cases are saved in our PostgreSQL databases, they will not be stored by files, version control utilities, or external resources. One test case is a group of many database tests, and each test case is executed independently. After a test case has been executed, the changes that it has made are automatically rolled back.
There are four famous test frameworks for PostgreSQL, as follows:
- pgTap: This is a unit test framework for PostgreSQL.
- PGUnit: This is a simpler testing tool than pgTap.
- Simple pgunit: This offers another approach for unit tests.
- Testgres: This is a Python testing framework for PostgreSQL.
These testing frameworks allow users to really verify the structure of their database schema, to exercise any views, procedures, functions, rules, or triggers. Furthermore, the testgres Python tool can also start a replica for a PostgreSQL database and even execute performance benchmarks for the database.
In this chapter, we learned about PostgreSQL development. We discovered how PostgreSQL is a popular DBaaS among the clouds and explored the good features of PostgreSQL through various application environments such as NodeJS, Django, PostGIS, and PostgREST. Next, we explored the administration of PostgreSQL by understanding the PostgreSQL architecture, PostgreSQL HA clusters, and PostgreSQL performance. Later, we learned the definition and/or principles of PostgreSQL database monitoring and introduced various DevOps tools for PostgreSQL. Finally, we developed an understanding of popular PostgreSQL database testing frameworks.
In the next chapter, we will learn how to set up a PostgreSQL RDS on a famous cloud – AWS – for ATM machines. This RDS will then be used in the remaining chapters of this book.