Reader small image

You're reading from  PostgreSQL 14 Administration Cookbook

Product typeBook
Published inMar 2022
PublisherPackt
ISBN-139781803248974
Edition1st Edition
Concepts
Right arrow
Authors (2):
Simon Riggs
Simon Riggs
author image
Simon Riggs

Simon Riggs is the CTO of 2ndQuadrant, having contributed to PostgreSQL as a major developer and committer for 14 years. He has written and designed features for replication, performance, BI, management, and security. Under his guidance, 2ndQuadrant is now a leading developer of open source PostgreSQL, serving hundreds of clients in USA, Europe, and worldwide. Simon is a frequent speaker at many conferences on PostgreSQL Futures. He has worked as a database architect for 30 years.
Read more about Simon Riggs

Gianni Ciolli
Gianni Ciolli
author image
Gianni Ciolli

Gianni Ciolli is the Vice President for Solutions Architecture at EnterpriseDB (EDB). As a PostgreSQL consultant, he has driven many successful enterprise deployments for customers in every part of the globe.Gianni is respected worldwide as a popular speaker and trainer at many PostgreSQL conferences in Europe and abroad over the last 14 years. He has worked with free and open-source software since the 1990s as an active member of the community (Prato Linux User Group, and Italian PostgreSQL Users Group). Gianni has a Ph.D. in Mathematics from the University of Florence. He lives in London with his son. His other interests include music, drama, poetry and athletics.
Read more about Gianni Ciolli

View More author details
Right arrow

Chapter 5: Tables and Data

This chapter covers a range of general recipes for your tables and for working with the data they contain. Many of the recipes contain general advice, with specific PostgreSQL examples.

Some system administrators that I've met work only on the external aspects of a database server. What's actually in the database is someone else's problem.

Look after your data, and your database will look after you. Keep your data clean, and your queries will run faster and cause fewer application errors. You'll also gain many friends in the business. Getting called in the middle of the night to fix data problems just isn't cool.

In this chapter, we will cover the following recipes:

  • Choosing good names for database objects
  • Handling objects with quoted names
  • Enforcing the same name and definition for columns
  • Identifying and removing duplicates
  • Preventing duplicate rows
  • Finding a unique key for a set of data
  • ...

Choosing good names for database objects

The easiest way to help other people understand a database is to ensure that all the objects have a meaningful name.

What makes a name meaningful?

Getting ready

Take some time to reflect on your database to make sure you have a clear view of its purpose and main use cases. This is because all the items in this recipe describe certain naming choices that you need to consider carefully given your specific circumstances.

How to do it…

Here are the points you should consider when naming your database objects:

  • The name follows the existing standards and practices in place. Inventing new standards isn't helpful; enforcing existing standards is.
  • The name clearly describes the role or table contents.
  • For major tables, use short, powerful names.
  • Name lookup tables after the table to which they are linked, such as account_status.
  • For associative or linked tables, use...

Handling objects with quoted names

PostgreSQL object names can contain spaces and mixed-case characters if we enclose the table names in double quotes. This can cause some difficulties and security issues, so this recipe is designed to help you if you get stuck with this kind of problem.

Case-sensitivity issues can often be a problem for people more used to working with other database systems, such as MySQL, or for people who are facing the challenge of migrating code away from MySQL.

Getting ready

First, let's create a table that uses a quoted name with mixed cases, such as the following:

CREATE TABLE "MyCust"
AS
SELECT * FROM cust;

How to do it...

If we try to access these tables without the proper case, we get this error:

postgres=# SELECT count(*) FROM mycust;
ERROR:   relation "mycust" does not exist
LINE 1: SELECT * FROM mycust;

So, we write it in the correct case:

postgres=# SELECT count(*) FROM MyCust...

Enforcing the same name and definition for columns

Sensibly designed databases have smooth, easy-to-understand definitions. This allows all users to understand the meaning of data in each table. It is an important way of removing data quality issues.

Getting ready

If you want to run the queries in this recipe as a test, then use the following examples. Alternatively, you can just check for problems in your own database:

CREATE SCHEMA s1;
CREATE SCHEMA s2;
CREATE TABLE s1.X(col1 smallint,col2 TEXT); 
CREATE TABLE s2.X(col1 integer,col3 NUMERIC);

How to do it...

First, we will show you how to identify columns that are defined in different ways in different tables, using a query against the catalog. We will use an information_schema query, as follows:

SELECT
 table_schema
,table_name
,column_name
,data_type
  ||coalesce(' ' || text(character_maximum_length), '')
  ||coalesce(' ' || text(numeric_precision...

Identifying and removing duplicates

Relational databases work on the idea that items of data can be uniquely identified. However hard we try, there will always be bad data arriving from somewhere. This recipe shows you how to diagnose that and clean up the mess.

Getting ready

Let's start by looking at an example table, cust. It has a duplicate value in customerid:

CREATE TABLE cust (
 customerid BIGINT NOT NULL
,firstname  TEXT NOT NULL
,lastname   TEXT NOT NULL
,age       INTEGER NOT NULL);
INSERT INTO cust VALUES (1, 'Philip', 'Marlowe', 33);
INSERT INTO cust VALUES (2, 'Richard', 'Hannay', 37);
INSERT INTO cust VALUES (3, 'Harry', 'Palmer', 36);
INSERT INTO cust VALUES (4, 'Rick', 'Deckard', 4);
INSERT INTO cust VALUES (4, 'Roy', 'Batty', 41);
postgres=# SELECT * FROM cust ORDER...

Preventing duplicate rows

Preventing duplicate rows is one of the most important aspects of data quality for any database. PostgreSQL offers some useful features in this area, extending beyond most relational databases.

Getting ready

Identify the set of columns that you wish to make unique. Does this apply to all rows or just a subset of rows?

Let's start with our example table:

postgres=# SELECT * FROM new_cust;
 customerid
------------
          1
          2
          3
          4
(4 rows)

How to do it…

To prevent duplicate rows, we need to create a unique index that the database server can use to enforce the uniqueness of a particular set of columns. We can do this in the following three similar ways for basic data types:

  1. Create a primary...

Finding a unique key for a set of data

Sometimes, it can be difficult to find a unique set of key columns that describe the data. In this recipe, we will analyze the data in the database to allow us to identify the column(s) that together form a unique key. This is useful when a key is not documented, not defined, or has been defined incorrectly.

Getting ready

Let's start with a small table, where the answer is fairly obvious:

postgres=# select * from ord;

We assume that the output is as follows:

orderid  | customerid |  amt  
---------+------------+--------
   10677 |          2 |   5.50
    5019 |          3 | 277.44
    9748 |          3 |  77.17
(3 rows)

How to do it…

First of all...

Generating test data

DBAs frequently need to generate test data for a variety of reasons, whether it's for setting up a test database or just for generating a test case for a SQL performance issue.

How to do it...

To create a table of test data, we need the following:

  • Some rows
  • Some columns
  • Some order

The steps are as follows:

  1. First, generate a lot of rows of data. We use something named a set-returning function. You can write your own, though PostgreSQL includes a couple of very useful ones.
  2. You can generate a sequence of rows using a query like the following:
    postgres=# SELECT * FROM generate_series(1,5);
     generate_series
    -----------------
                   1
                   2
                   3
       ...

Randomly sampling data

DBAs may be asked to set up a test server and populate it with test data. Often, that server will be old hardware, possibly with smaller disk sizes. So, the subject of data sampling raises its head.

The purpose of sampling is to reduce the size of the dataset and improve the speed of later analysis. Some statisticians are so used to the idea of sampling that they may not even question whether its use is valid or if it might cause further complications.

The SQL standard way to perform sampling is by adding the TABLESAMPLE clause to the SELECT statement. 

How to do it…

In this section, we will take a random sample of a given collection of data (for example, a given table). First, you should realize that there isn't a simple tool to slice off a sample of your database. It would be neat if there were, but there isn't. You'll need to read all of this to understand why:

  1. We first...

Loading data from a spreadsheet

Spreadsheets are the most obvious starting place for most data stores. Studies within a range of businesses consistently show that more than 50% of smaller data stores are held in spreadsheets or small desktop databases. Loading data from these sources is a frequent and important task for many DBAs.

Getting ready

Spreadsheets combine data, presentation, and programs all into one file. That's perfect for power users wanting to work quickly. As with other relational databases, PostgreSQL is mainly concerned with the lowest level of data, so extracting just data from these spreadsheets can present some challenges.

We can easily handle spreadsheet data if that spreadsheet's layout follows a very specific form, as follows:

  • Each spreadsheet column becomes one column in one table.
  • Each row of the spreadsheet becomes one row in one table.
  • Data is only in one worksheet of the spreadsheet.
  • Optionally, the first...

Loading data from flat files

Loading data into your database is one of the most important tasks. You need to do this accurately and quickly. Here's how.

Getting ready

For basic loading, COPY works well for many cases, including CSV files, as shown in the last recipe.

If you want advanced functionality for loading, you may wish to try pgloader, which is commonly available in all main software distributions. At the time of writing, the current stable version is 3.6.3. There are many features, but it is stable, with very few new features in recent years.

How to do it...

To load data with pgloader, we need to understand our requirements, so let's break this down into a step-by-step process, as follows:

  1. Identify the data files and where they are located. Make sure that pgloader is installed in the location of the files.
  2. Identify the table into which you are loading, ensure that you have the permissions to load, and check...

Making bulk data changes using server-side procedures with transactions

In some cases, you'll need to make bulk changes to your data. In many cases, you need to scroll through the data making changes according to a complex set of rules. You have a few choices in that case:

  • Write a single SQL statement that can do everything.
  • Open a cursor and read the rows out, and then make changes with a client-side program.
  • Write a procedure that uses a cursor to read the rows and make changes using server-side SQL.

Writing a single SQL statement that does everything is sometimes possible, but if you need to do more than just use UPDATE, then it becomes difficult very quickly. The main difficulty is that the SQL statement isn't restartable, so if you need to interrupt it, you will lose all of your work.

Reading all the rows back to a client-side program can be very slow – if you need to write this kind of program, it is better to do it all...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
PostgreSQL 14 Administration Cookbook
Published in: Mar 2022Publisher: PacktISBN-13: 9781803248974
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Simon Riggs

Simon Riggs is the CTO of 2ndQuadrant, having contributed to PostgreSQL as a major developer and committer for 14 years. He has written and designed features for replication, performance, BI, management, and security. Under his guidance, 2ndQuadrant is now a leading developer of open source PostgreSQL, serving hundreds of clients in USA, Europe, and worldwide. Simon is a frequent speaker at many conferences on PostgreSQL Futures. He has worked as a database architect for 30 years.
Read more about Simon Riggs

author image
Gianni Ciolli

Gianni Ciolli is the Vice President for Solutions Architecture at EnterpriseDB (EDB). As a PostgreSQL consultant, he has driven many successful enterprise deployments for customers in every part of the globe.Gianni is respected worldwide as a popular speaker and trainer at many PostgreSQL conferences in Europe and abroad over the last 14 years. He has worked with free and open-source software since the 1990s as an active member of the community (Prato Linux User Group, and Italian PostgreSQL Users Group). Gianni has a Ph.D. in Mathematics from the University of Florence. He lives in London with his son. His other interests include music, drama, poetry and athletics.
Read more about Gianni Ciolli