(For more resources on Oracle, see here.)
In order to join two tables, we will utilize the basic structure of a SELECT statement; but, we must add a few qualifiers.
Understanding the structure and syntax of ANSI join statements
When we join two tables, we add a WHERE clause that qualifies that the common columns between the tables are equivalent. The following is the syntax tree for an ANSI-compliant join:
SELECT column1, column2, ...
FROM table1, table2
WHERE table1.common_column = table2.common_column;
While the join syntax is similar in many ways to a typical select statement, we notice some differences.
- As you might expect, since we are selecting data from two tables, both table names are specified in the FROM clause, separated by commas
- The syntax of the WHERE clause is different from what we've previously seen
- In our WHERE clause, we specify the condition that the common column from the first table must be equal to the common column in the second
In constructing this part of the statement, it is important to have first identified the common column between the tables that forms the inter-table relationship. This equivalence forms the bond between the two tables.
Examining ambiguous Cartesian joins
Before we look at some examples of typical join statements, it is important to discuss one type of join that is considered undesirable in most circumstances. A Cartesian join is a join between two tables that omits the WHERE clause. The result is known as a Cartesian product. A Cartesian product is formed when every row of one table is joined to every row of another table. An example of this is shown as follows:
The resulting number of rows shown has been truncated for the sake of brevity. It is fairly easy to see what is happening without displaying all 40 rows that are returned. In a Cartesian join, the query returns the first row selected from the project table, in this case Desktop Rollout, and joins it to every row selected from the award table; first, Salesperson of the year, then Technological paper winner, and so on. It then moves to the second row returned from the project table, Security awareness training, and again joins each row from the award table. It continues in this manner until every row in the project table has been joined to every row in the award table.
The Cartesian product returned from a Cartesian join is generally considered undesirable because it has little meaning. Since no relationship has been established between the two tables based on a common column, the data from the project table does not relate in any logical way to the award table. Such joins are said to be ambiguous, since no row has any particular relationship to any other row.
It is important to remember that even if the two tables share a common column, that relationship must be specified in the WHERE clause. Failure to do so will result in a Cartesian product.
A Cartesian join is generally said to produce an a times b product, where a and b are the number of rows in the two tables. In the next screenshot, the project table has five rows, while the award table has eight rows. We can therefore say that the number of rows returned by the Cartesian product of the two tables will be 5 x 8, or 40, rows. In this way, we can predict the number of rows returned by any Cartesian join as being the number of rows in the first table times the number of rows in the second.
SQL in the real world
Another practical reason that Cartesian joins are considered undesirable is the immense strain they can put on a system from a performance perspective. While the 40-row Cartesian product from our example may not seem significant, consider two tables with one million rows each. The resulting number of rows from such a Cartesian product would be 1,000,000 x 1,000,000, or 1x1012, rows—one trillion rows. Such a mistake can cause excessive resource usage on your database system to the point of affecting other users. In fact, when tuning SQL statements, one of the most common examples of improper code to watch out for is Cartesian joins.
Using equi joins—joins based on equivalence
The core of the RDBMS is the relationships that are formed between tables. The most common relationships are based on equivalence. In this section, we examine the concept of an equi join.
Implementing two table joins with a table-dot notation
To see an example of the kind of join that would be advantageous to an SQL programmer, let's return to the earlier request to display name, date of birth, and address information for all of the Companylink employees. We use two different queries to find the required information and note the inefficiency of the process. To get this information using a join, we issue the query shown in the following screenshot:
Let's deconstruct the statement one line at a time. The first line is simply a list of all the columns that we want to display. The first four are columns from the employee table and the last four come from the address table. The second line specifies the tables that are involved in the query. We are requesting columns from the employee and address tables, so those tables are listed. The third line contains the clause that actually performs the join. We have stated that a join requires the linking of a common column between the tables. The only column that is common between the employee and address tables is the employee_id column; this column forms the relationship between the two tables. It is this column that we use to execute the join. The join clause performs this joining by creating a condition that sets the values for the employee_id column in the employee table equal to the employee_id column in the address table. Thus, each row in the address table is joined to each row in the employee table, but only where the values in the common columns are equivalent. The clause, however, requires that we specify which columns are being referred to, since they have the same name. To clearly delineate them, we prefix each of the columns in the WHERE clause with the name of the originating table, followed by a dot (.). We refer to this as the table-dot notation. As a result, the previous statement could be read as: Display name and address information from the employee and address tables, where the employee_id column in the employee table is equivalent to the employee_id column in the address table.
To see the relationship more clearly, we could rewrite the preceding statement to include the columns that form the relationship. In the following example, we have reduced the number of columns returned (for clarity) and included both columns that form the relationship. The result shows how the employee_id values for each table match.
As you can see, the rows displayed essentially show two tables joined together. The first_name, last_name, and employee.employee_id columns all belong to the employee table. The address.employee_id, street_address, and city columns belong to the address table. Yet information from both tables can be displayed together, provided that we join them with a common column. These joins are categorized as equi joins; joins based on the equivalence of values between common columns.
It is crucial that we explicitly define the two columns that form the join in the WHERE clause. A failure to do so will generate an ORA-00918 error or a column ambiguously defined error.
The following example shows the previous statements rewritten to exclude the tabledot notation with the originating tables, and the resulting error. Again, this results because the columns have the same name in each table, yet we have not defined the originating tables.
Using two table joins with alias notation
Thus far, our join examples have used two tables. However, a join can be done with any number of tables, provided they have common columns between them. We will discuss multi-table joins later in the article, but using the table-dot notation with multi-table joins is considered by some to be cumbersome, since each table and several columns must be prefixed with the associated table name.
We use double quotation marks to present column headings that are different than the actual column name. It allows us not only to display a different column name, but also to utilize case sensitivity and whitespace. This type of alias is called a column alias. We now look at a table alias, a type of alias that allows us to reference a table using a different name. In the following screenshot, we have rewritten the join shown previously, this time using table aliases, or alias notation:
This example is very similar to the original statement, but there is one significant difference. In the FROM clause, we have added aliases for our two tables. These aliases are designated by the letters that follow each of the table names. For the employee table, the alias is e, and for the address table, it is a. Thus, in the WHERE clause, instead of prefixing the table names employee and address to our columns, we simply use e and a in place of them. These same aliases are also used in the SELECT clause, where employee.employee_id and address.employee_id are simply written as e.employee_id and a.employee_id, respectively. Note that there is nothing particularly significant about using the letters a and e as aliases. We could just as easily have used emp and addr as our aliases. The purpose is simply to reduce the amount of coding that has to be written. Many SQL developers feel that using aliases is an efficient and readable way to write joins, especially those that involve numerous tables.
SQL in the real world
Although both table-dot notation and alias notation are supported in ANSI-compliant joins, the organization you work for may decide that one is preferable to the other. As with many of the choices offered to a SQL programmer, an organization's coding standards may determine how code will be written. This isn't to say that one way is necessarily better than the other, but rather to support the idea of having standards for the code written in an organization. Code standards provide rules for writing and reading code that generally lead to better interoperability between programmers.
(For more resources on Oracle, see here.)
Understanding the row inclusiveness of outer joins
The joins we've seen thus far obey one simple rule—they only produce a corresponding row if there is a match between the values of a column that is common to both tables. Joins that meet this one-for-one matching criteria are categorized as inner joins. There are, however, certain circumstances in which we may wish to include values that do not match. Examine the following screenshot:
Look carefully at the results of this query. In real-world terms, we have requested the names of each employee and the project to which they are assigned. However, our query has returned rows for 12 employees, while our Companylink database contains rows for 16 employees. This means that the project name information for four employees was not included in the record set. Why would this be? This lack of inclusiveness occurs because not all employees have a value for the project_id column in the employee table. Instead, four of the employees have a null value in the project_id column. If you wish to see this, simply enter a select * from employee; query in SQL Developer. In essence, not all employees have been assigned to a project, thus their rows are not returned because there is no match of values.
However, if we wished to include these, for example, for a report that was inclusive of all employees, we could use an outer join. An outer join is a join that purposely includes null values along with matching values in the result set. To accomplish an outer join using ANSI-compliant syntax, we use a plus symbol within parentheses, or (+), as shown in the following example:
As we can see, the four employees who were excluded from the original join are now included, along with null values for their project_name. Outer joins can be signified by the terms right, left, or full. In the ANSI syntax for outer joins, which term we use is determined by the placement of the (+) symbol. This example is known as a left outer join, since the (+) symbol for inclusiveness is placed on the right side of the query condition.
The inclusiveness of outer joins is often required in data reporting. It prevents values from "falling through the cracks". A report using the query shown in the example could be used to point out that some employees have not yet been assigned to a project; the query shown in the example previous to it lacks this information.
Notice that the query that we just saw is identical to the original query, except for the addition of the (+) symbol next to the p.project_id column. This symbol simply instructs Oracle to include any rows associated with null values found in the project_id column.
When potential SQL programmers are learning to do outer joins, a common syntactical question raised is, "which side do I put the (+) on?". This refers to the question of where the (+) symbol is placed in the statement. Do we put the (+) with the e.project_id column or with the p.project_id column? This is an important question, since the placement of the (+) symbol in an outer join is critical. Fortunately, there is an easy rule of thumb to use in order to remember this. Always place the (+) symbol with the table that does not contain null values. In this case, that would be the project table. For instance, the project_id column in the employee table, or e.project_id, includes null values. The project_id column in the project table, p.project_id, contains no nulls. In order to properly execute an outer join of the two tables, we place the (+) symbol with the table that has no nulls, the project table and the p.project_id column. Since the project table contains all the values for project_id and the employee table lacks some of these values, we place the (+) with the column from the project table, p.project_id. In the following example, we show an example of improper placement of the (+) symbol. Here, we execute the same query as the original, correct outer join, but we place the (+) with the e.project_id column from the employee table. The (+) is paired with the wrong table. The resulting record set is only 12 rows instead of the inclusive 16 we were attempting to retrieve, since we have not properly instructed Oracle to retrieve rows with a null value. Always be aware of the placement of the (+) symbol in outer joins.
The following example demonstrates a right outer join, since the (+) symbol is on the left side of the condition.
Outer joins have three configurations—right, left, and full. We have seen examples of both right and left outer joins. In a full outer join, the (+) symbol is placed with both columns, allowing the join to include null values from either column in the resulting set of rows.
In the Oracle proprietary join syntax, outer joins are much easier to interpret. Right, left, and full outer joins are constructed using the RIGHT OUTER JOIN, LEFT OUTER JOIN, and FULL OUTER JOIN keywords, respectively. The (+) symbol is not used.
Retrieving data from multiple tables using n-1 join conditions
As we have seen, joins can be a very effective way to retrieve information from two different tables. However, situations can arise that require data to be retrieved from more than two tables. In these circumstances, we can use our previously-discussed join techniques to retrieve data from as many tables as we wish, provided that we can establish the proper relationships between them. For instance, say that we've been tasked with writing the SQL for a report that will retrieve the information for employees have created websites and blogs in Companylink. When we are asked to create joins over multiple tables, it is often advantageous to have an ERD, or Entity Relationship Diagram, to which to refer. ERDs are used to graphically represent the relationships between tables. Let's look at an ERD for our Companylink database. It is shown in the following diagram:
Study this diagram closely. The tables are each represented by their names inside of boxes. The lines between the tables indicate the type of relationship between them. Tables can have different kinds of relationships, although the most common are one-to-one and one-to-many. In a one-to-one relationship, there is a "one and only one" relation between each of the rows in the tables. An example of this is the relationship between the website table and the blog table, shown by a single solid line between the two. This indicates that for each row in the blog table, there is one and only one corresponding row in the website table. Thus, we can locate the common column in the two tables— blog_id—and join the two tables on that column.
The more prevalent type of relationship between tables in a relational model is the one-to-many relationship. We can see an example of this between the employee and email tables, denoted by the line and "crow's foot" (three diverging lines) terminating at the table. When we see this diagram we can interpret it as: for every row in the employee table, there can exist one or more corresponding rows in the email table. In simpler terms, we can say that based on this diagram, each employee can have one or more e-mail addresses. Using an ERD can be a powerful, efficient way to form our joins.
Let us return to the requirement given to us at the beginning of this section, that is, to write an SQL query for a report that will retrieve the information for employees who have created websites and blogs in Companylink. We can surmise that we will need to select data from the employee, website, and blog tables: a situation that will require a join. From our ERD, we can see that although relationships exist between employee and website, and website and blog, no direct relationship exists between the employee and blog tables. How, then, can we retrieve the requested data? Even though no direct relationship exists between the employee and blog tables, an indirect relationship does. This indirect relationship exists because employee relates to website and website relates to blog. In essence, we can get from employee to blog through the website table. Doing so will require a multi-table join, as shown in the following screenshot:
As we can see, this multi-table join is similar to our two-table joins, with two exceptions. First, our FROM clause contains three tables (the three targets of our query) instead of two. These three table names are separated by commas and denoted with aliases. Second, there is an additional join condition specified. Our first condition, the joining of employee and website, based on the employee_id column, is specified after the WHERE clause. Our second condition, the joining of website and blog, is specified following a Boolean AND operator. In short, we only retrieve rows if both join conditions are true.
Notice that the number of conditions, or joins, that must be specified is one less than the number of tables involved. This is always the case.
Thus, we can say that for any number of tables, n, that must be joined, there must exist n-1 join conditions. We can refer to this as the n-1 join rule. It is true no matter how many tables are involved. Thus, if we need to join 20 tables, 19 join conditions will be required.
In our example, even though we used the website table to get from the employee table to the blog table, it is not required that we include any columns from the intermediary table in our query. For instance, let's say that we receive the requirement, display all employees who have received awards and the name of the award they received. To begin, we refer back to our ERD. We recognize that we will need data from the employee table and the award table, so we locate them on the diagram. We see, however, that they have no direct relationship. Even though no direct relationship exists, we notice that the employee_award table exists between the two. Thus, we can use n-1 join conditions (in this case, two conditions) to join three tables, as shown in the following screenshot. We could interpret this query as, display all employees who have received awards and the name of the award they received.
Here, even though we utilize the employee_award table as a bridge between the tables that contain the data we want, we do not retrieve any columns from it. We are only required to make use of it in order to form a relationship between the employee and award tables. In fact, the employee_award table is interesting because it holds very little actual Companylink data, as we can see from the following screenshot:
As we can see, beside the ID values from the employee and award tables, the employee_award table only contains one valuable column—date_awarded. If we were to attempt to form a relationship between only the employee and award tables, we would refer to this as a many-to-many relationship; meaning more than one employee can have more than one award. In the relational paradigm, this is generally considered unacceptable due to the problems of forming direct relationships. For our model to be fully relational, we must resolve this many-to-many relationship with a bridging entity. In our case, this bridging entity is the employee_award table, which has the primary function of completing the relationship between employee and award.
SQL in the real world
In real-world development situations, you may not always have an Entity Relationship Diagram at your disposal. Even though an ERD is tremendously useful, the lack of one should not prevent you from being able to construct complex joins. Instead, simply look at the tables with your target data and follow the trail of common values between them. It is often helpful to sketch the tables and columns out on a piece of paper and draw your own relationships.
Working with less commonly-used joins—non-equi joins and self-joins
Our final look at ANSI-compliant joins will conclude with a brief discussion of two uncommon types of joins: non-equi joins and self-joins. These types of joins are less frequently used in actual development and are, sometimes, only considered necessary in an improperly constructed data model. Nevertheless, we review them here for the sake of completeness.
Although joins are typically constructed using equivalent values in common columns, which we have referred to as equi joins, we can join tables based on conditions of non-equality. A non-equi join is formed between tables where the join condition is based on any condition other than an equal sign, which can include <, >, !=, <>, and BETWEEN. An example of a non-equi join is shown in the following screenshot:
In this example, we have used a new technique—joining two instances of the same table. We can do this because we have separately aliased the employee table as e1 and e2. The two instances of the employee table are joined together based on two conditions. The first condition specifies an employee_id of 10, limiting the e1 instance of the table to only that row. The second condition, the non-equi join, returns all values with a signup_date in the e2 instance that is greater than the signup_date in instance e1. Since the e1 instance is reduced to only rows where employee_id equals 10, the query returns rows greater than that value. In real-world language, this query could be described as, Return employee information where the employee signed up for Companylink after employee #10. A non-equi join does not always have to use two instances of the same table to complete the join, as in our example. Some non-equi joins create a join between two tables without a common column utilizing the BETWEEN clause. However, the table data must be structured in a way that makes this possible.
A table can also be joined back on itself using a self-join. We've seen joins between different tables where a common column is used as the join condition. However, a self-join is constructed when one column of a table is joined to a column in the same table. Self-joins are sometimes used where a developer needs to iterate over a single table. An example of a self-join is shown in the following screenshot. Again, it uses a single table, aliased as two instances.
In this statement, the first instance of the employee table, e1, is limited by rows having an employee_id equal to 2. This is then joined to the e2 instance, where the project_id values are equivalent. In real language, this query could be stated as, Display the employee and project ID information for all employees who have the same project ID number as employee #2. A self-join can also be accomplished without aliasing the same table, provided that the table has a column that relates back to a different column in the same table. An example might be an employee table with a supervisor_id column that contains the employee_id of the supervisor. Since the supervisor is also an employee, the supervisor_id column could be related back to the employee_id column.
In this article, we've added the powerful capabilities of joins to our SQL repertoire. We've examined join techniques from two separate syntax families. We learned to write Cartesian joins, equi joins, non-equi joins, and self-joins in the ANSI syntax for both two-table and multi-table joins.
- An Overview of Oracle Advanced Pricing [Article]
- Getting Started with Oracle GoldenGate [Article]
- Oracle GoldenGate 11g: Performance Tuning [Article]
- Oracle Web Services Manager: Authentication and Authorization [Article]