Indexing in MySQL Admin

Exclusive offer: get 50% off this eBook here
MySQL 5.1 Plugin Development

MySQL 5.1 Plugin Development — Save 50%

Extend MySQL to suit your needs with this unique guide into the world of MySQL plugins

$26.99    $13.50
by Daniel Schneller Udo Schwedt | June 2010 | MySQL

In this article series by Daniel Schneller and Udo Schwedt, authors of MySQL Admin Cookbook, we will cover:

  • Adding indexes to tables
  • Adding a fulltext index
  • Creating a normalized text search column
  • Removing indexes from tables
  • Estimating InnoDB index space requirements
  • Using prefix primary keys
  • Choosing InnoDB primary key columns
  • Speeding up searches for (sub)domains
  • Finding duplicate indexes

Read the first part of this article series—Optimizing your MySQL Servers' performance using Indexes here.

Using prefix primary keys

In this example we will add indexes to two tables that are almost identical. The only difference will be the definition of their primary keys. You will see the difference in space consumption for secondary indexes between a regular full column primary key and a prefix primary key. The sample table structure and data are designed to demonstrate the effect very evidently. In real-world scenarios the effect will most certainly be less severe.

Getting ready...

Connect to the database server with your administrative account.

How to do it...

  1. Download the sample script for this article from the book's website and save it to your local disk. In the example below, we will assume it is stored in /tmp/idxsizeestimate_sample.sql.
  2. Create a new database and make it the default database:
    CREATE DATABASE pktests; USE pktests;
  3. Import the sample data from the downloaded file. When done, you will be presented with some statistics about the two tables loaded. Note that both tables have an Index Length of 0.
    SOURCE /tmp/idxsizeestimate_sample.sql;
  4. Now with the sample tables present, add an index to each of them:
    ALTER TABLE LongCharKey ADD INDEX IDX_PAY_10(Payload(10));
  5. ALTER TABLE LongCharKey10 ADD INDEX IDX_PAY_10(Payload(10));

  6. Display the data and index sizes of the tables now:
    SHOW TABLE STATUS LIKE 'LongCharKey%';
  7. Add another index to each table to make the difference even more evident:
    ALTER TABLE LongCharKey ADD INDEX IDX2_PAY_10(Payload(10));
    ALTER TABLE LongCharKey10 ADD INDEX IDX2_PAY_10(Payload(10));

  8. Display the data and index sizes of the tables again and compare with the previous values:
    SHOW TABLE STATUS LIKE 'LongCharKey%';




    Name

    Rows

    Data Length

    Index Length

    Index/Data Ratio

    LongCharKey

    50045

    30392320

    28868608

    94.99%

    LongCharKey10

    50045

    29949952

    3178496

    10.61%

     
    With the second index added, the difference in index length becomes even clearer.

     

How it works...

Executing the downloaded script will set up two tables with the following structures:

CREATE TABLE `LongCharKey` (
`LongChar` char(255) NOT NULL,
`Payload` char(255) DEFAULT NULL,
PRIMARY KEY (`LongChar`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
CREATE TABLE `LongCharKey10` (
`LongChar` char(255) NOT NULL,
`Payload` char(255) DEFAULT NULL,
PRIMARY KEY (`LongChar`(10))
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

The two tables are almost identical, except for the primary key definition. They are pre-filled with 50,000 records of sample data.

The tables are populated with exactly the same 50,000 records of pseudo-random data. The Payload column is filled with sequences of 255 random letters each. The LongChar column is filled with a sequential number in the first 10 characters and then filled up to use all remaining 245 character with the same sort of random data.
SELECT LEFT(LongChar,20), LEFT(Payload, 20) from LongCharKey LIMIT 5;



LEFT(LongChar,20)

LEFT(Payload, 20)

0000000000KEAFAYVEJD

RHSKMEJITOVBPOVAGOGM

0000000001WSSGKGMIJR

VARLGOYEONSLEJVTVYRP

0000000002RMNCFBJSTL

OVWGTTSHEQHJHTHMFEXV

0000000003SAQVOQSINQ

AHDYUXTAEWRSHCLJYSMW

0000000004ALHYUDSRBH

DPLPXJVERYHUOYGGUFOS

While the LongKeyChar table simply marks the whole LongChar column as a primary key with its entire 255 characters length, the LongCharKey10 table limits the primary key to the first 10 characters of that column. This is perfectly fine for this table, because the test data was crafted to be unique in this range.

Neither one of the two tables has any secondary indexes defined. Looking at some relevant table data shows they are equally big (some columns left out for brevity):
SHOW TABLE STATUS LIKE 'LongCharKey%';



Name

Rows

Data Length

Index Length

LongCharKey

50045

30392320

0

LongCharKey10

50045

29949952

0

With each index added, the Index Length for the first table will increase significantly, while for the second one its growth is much slower.

In case of the LongCharKey table, each secondary index record will carry around with it a complete copy of the LongChar column because it is the primary key without limitation. Assuming a single byte character encoding, this means every secondary index record is blown up in size by 255 bytes on top of the 10 bytes needed for the actual index entry. This means a whole kilobyte is spent just for the primary key reference for every 4 records!

In contrast to that, the primary key definition of the LongCharKey10 table only includes the leading 10 characters of the LongChar column, making the secondary index entry 245 bytes shorter and thereby explaining the much slower growth upon adding further indexes.

Choosing InnoDB primary key columns

In the article introduction we promised to shed some light on how to choose your InnoDB primary key columns sensibly. Be advised that choosing good primary key columns is not an exact science—there are multiple aspects that influence this decision. Depending on your needs and preconditions you will want to prioritize them differently from one table to the next. Consider the following as general advice rather than hard rules that must be obeyed unconditionally.

Getting ready

In order to make reasonable decisions on primary key columns, it is important to have a very clear understanding of what the data looks like in the table at hand. If you already have existing data that is to be stored in an InnoDB table—for example in MyISAM format—it can be helpful to compare it with the criteria below.

If you are planning a new schema, you might have to guess about some characteristics of the future data. As is often the case, the quality of your choices is directly proportional to how good those guesses are.

This recipe is less strict step-by-step instructions that must be followed from top to bottom and should be considered a list of properties a good primary key should have, even though you might decide some of them do not apply to your actual environment. As a rule of thumb, however, a column that fulfills all or most of the attributes described below is most probably a sensible choice for a primary key. See the How it works... section for details on the individual items.

How to do it...

  1. Identify unique attributes: This is an absolute (technical) requirement for primary keys in general. Any data attribute that is not strictly guaranteed to be free of duplicates cannot be used alone as a primary key.
  2. Identify immutable attributes: While not absolutely necessary, a good primary key is never changed once it has been assigned. For all intents and purposes, you should avoid columns that have even a small chance of being changed for existing records
  3. Use reasonably short keys: This is the "softest" criterion of all. In general, longer keys have negative impacts on overall database performance—the longer the worse. Also, consider a prefix primary key. See Using prefix primary keys earlier in this article for more information.
  4. Prefer single-column keys: Even though nothing prevents you from choosing a composite primary key (a combination of columns that together form the uniqueness), this can easily become a hassle to work with, especially when handling foreign keys.
  5. Consider the clustered index nature of the primary key: As InnoDB's primary key is also clustered, you should take this special nature into account as well. It can speed up read access a lot, if you often have to query for key ranges, because disk seek times will be minimized.
MySQL 5.1 Plugin Development Extend MySQL to suit your needs with this unique guide into the world of MySQL plugins
Published: August 2010
eBook Price: $26.99
Book Price: $44.99
See more
Select your format and quantity:

How it works...

In the following sections, we will try to shed some light on what each step of the recipe is concerned with in a little more detail.

Uniqueness

An absolute requirement for primary keys is their uniqueness. Every record in your table will have to have a distinct value for primary keys. Otherwise, neither MySQL nor any other database product for that matter could be sure about whether it was operating on exactly the right rows when executing your queries.

Usually, most entities you might want to store in a relational database have some sort of unique characteristics that might be a suitable Primary key. If they do not, you can always assign a so-called surrogate key for each record. Often this is some sort of unique numeric value, either generated by an application working on top of the database or MySQL itself using an AUTO_INCREMENT column.

Immutability

Primary key columns should generally be (virtually) immutable, that is, under no circumstances should you have to modify their values, once they are inserted into the database.

In our books example, the ISBN number cannot be changed once a book has been published. The same would apply for a car's chassis number.

Technically, of course, they can be changed after their creation. However, this will be very difficult to perform in practice, once the original value has been used to establish foreign key relationships between tables. In these cases, you will often have to revert to complicated and even unsafe methods (risking data inconsistencies) to perform the changes.

Moreover, as the primary key is stored as a clustered key in InnoDB, changing its value will require the whole record—including all columns—to be moved to its new location on disk, causing additional disk I/O.

Note that sometimes columns that may at first seem constant over time really are not.For example, consider a person's social security number. It is designed to be unique and can never change or be reassigned to a different human being. Consequentially, it would seem like a good choice for primary key in a table of people.

But consider that in most cases data will be entered into the database manually—be it through forms, text file imports, among others. In some form or another, someone will have typed it in through a keyboard.

Manual input is by definition an error prone process. So you might end up with a person's record that has two digits transposed in their primary key social security number without immediately knowing it. Gradually, this wrong value will spread through your database—it will be used in foreign key relationships, forming complex data structures. When you later find out about the error—for example, because another person who really owns that number needs to be inserted—then you are facing a real problem.

Unless you are absolutely and positively sure a value can never change once it has been assigned to a record, you should consider adding a new column to your table and use a surrogate key, for example, an auto-incrementing number.

Key length

There are several reasons for keys being as short as possible. InnoDB basically only uses one single large heap of memory—the buffer pool—for its caching purposes. It is used for both row and index data, which are stored as memory cached copies of individual pages straight from the tablespace data files. The shorter each key value is, the more of them fit into a single data page (the default size is 16 KB). For an index with 16 bytes per index value, a single page will contain about a thousand index entries. For an index with only 8 bytes per entry, twice as many values can be cached in the same amount of space. So to utilize the effects of memory-based caching, smaller indexes are better.

For the data record as a whole there might not be much of a difference between 8 or 16 bytes compared with the overall record length. But remember (or refer to the article introduction if you don't) that the primary key length is added to each secondary index's length again. For example, a secondary index on an 8 byte field will actually be 16 bytes long if the primary key also has 8 bytes per entry. A 16 KB data page would provide space for roughly 1,000 index entries in this scenario. If the primary key is 16 bytes long, it would only be sufficient for about 680 entries, reducing the effectiveness of cache memory.

Single column keys

Depending on the data you intend to store in an InnoDB table, you might consider using a composite primary key. This means that no single column's value alone uniquely identifies asingle record but only the combination of several independent columns allows uniqueness.From a technical point of view, this is perfectly feasible and might even be a good choice from a semantic point of view.

However, you should very carefully weigh the alternatives because composite keys can quickly become a burden. The more secondary tables define foreign key relationships with a table using a composite primary key, the more complicated your queries will become because whenever you join the two, you have to define a join condition on at least four columns. For more complex queries with multiple joins, this quickly becomes very hard to maintain and therefore, carries a great risk of errors that might be hard to find.

In addition, you also have to consider the increased size of each key entry and that the sizes of all the participating columns must be added.

As general advice, you should definitely consider using a surrogate key when you cannot find any candidate that fulfills the other criteria just discussed.

Clustered Index

As data is physically stored on disk in the order of the clustered key, similar key values end up in neighboring locations. This makes clustered indexes very efficient for queries that retrieve ranges of records by this key. If, for example, the clustered key is a timestamp of some sort, retrieving all records within a contiguous timespan is likely to require relatively little physical disk I/O because ideally all requested result rows are stored in the same data page, therefore only needing a single read operation (which might even be cached). Even if multiple pages had to be read, this will only require a sequential read operation, which leverages linear disk read performance best.

Unfortunately, InnoDB does not allow a non-primary key to be clustered—other DBMS do—so you have to weigh the alternatives and maybe live with a compromise when deciding on the primary key for your InnoDB tables.

Speeding up searches for (sub)domains

In a column with domain e-mail addresses, searching for all addresses of a given domain is a non-trivial task performance-wise. Given the following table structure, the only way to find all addresses @gmail.com is to use a LIKE query with a wildcard:

SELECT * FROM clients WHERE email LIKE '%@gmail.com';

Of course, storing the address and domain parts in separate columns would solve this particular problem. But as soon as you were asked for a quick way to find all clients with an e-mail address from a British provider, you would be out of luck again, resorting to:

SELECT * FROM clients WHERE maildomain LIKE '%.co.uk';

Both queries would cause a full table scan because no index can support the wildcard at the beginning of the search term.

In this recipe, you will be given a simple but effective approach to enable the use of indexesfor both of the problems just presented. Notice that you will need to make minor adjustments to the queries sent against the database. This might involve some code adjustments in your application.

Getting ready

To implement the ideas presented here, you will need to connect to the database server with your administrative account for the schema modifications. Apart from the database modifications, application program code changes will be necessary as well and you should contact an application developer.

How to do it...

  1. Identify which column is currently used to store domain-related data. In the example,we will be using the maildomain column of the clients table.
  2. Update this column and reverse the contents of the field like this:

    UPDATE clients SET maildomain=REVERSE(maildomain);

  3. If not already set up, add an index to the column:ALTER TABLE clients ADD INDEX IDXR_MAILDOMAIN(maildomain);
  4. Change all queries in your application as follows:

    Before

    SELECT name, maildomain FROM clients WHERE maildomain LIKE '%.co.uk';

    After:

    SELECT name, REVERSE(maildomain) AS maildomain FROM clients WHERE maildomain LIKE REVERSE('%.co.uk');

    The point here is to reverse the search condition as well as the column in the column list. SELECT statements using the star placeholder instead of column names need to be rewritten to reverse the maildomain column.

How it works...

Indexes are designed to speed up queries by sorting the relevant column contents, which makes finding records with a given search prefix easy.

Searching for all people whose name starts with an "S", for example, is supported by this technique. The more characters you provide the more specific the search gets, again supported ideally by an index.

Domain names are a different story, however, because those belonging together do not share a common prefix but suffix. There is no immediate way of telling MySQL to create an index supporting this kind of data.

The first idea that comes to mind to work around this would be to use a query along the lines of:

SELECT * FROM clients

WHERE REVERSE(maildomain) LIKE 'ku.oc.%';

Unfortunately, MySQL—in contrast to other DBMS—can neither use indexes in conjunction with functions like REVERSE() nor create an index based on a function in the first place. Instead,it resorts to full-table scans to find the results as soon as it encounters a function call applied to a column in a query's WHERE clause. In this case, the REVERSE() function is applied to the maildomain column.

With a minor adjustment to the way data is stored, this limitation can be alleviated, however: store the data backwards in the first place!

When inserting new data into the table, we reverse it first:

INSERT INTO clients (maildomain, …)

VALUES (REVERSE('example.co.uk'), …);

When retrieving data later, we just need to reapply the same function to get back at the original data:

SELECT REVERSE(maildomain) FROM clients

WHERE maildomain LIKE REVERSE('%.co.uk');

As now the query condition does not contain a function call on a column anymore, MySQL is happy to use an index on the maildomain column to speed up the search.

It might seem odd at first that now even with two calls to the REVERSE() function this query can in fact use an index.

The key point is that MySQL does not have to apply the function on any column data but only on the constant condition (the '%.co.uk' string)and later—when the rows have already been fetched—on the already retrieved reverse column content of maildomain. Both of these are not a problem for index use.

The query is really executed in two phases. In the first phase, MySQL will have a look at the condition and check if it can replace any function call with constants. So, when we write;

SELECT REVERSE(maildomain) FROM clients

WHERE maildomain LIKE REVERSE('%.co.uk');

After the first phase, internally the query looks like this:

SELECT REVERSE(maildomain) FROM clients

WHERE maildomain LIKE 'ku.oc.%';

In this query, there is no function call left in the condition. So the index on the maildomain column can be used, speeding up the execution as desired

There's more...

If your application typically issues queries that need to retrieve contiguous ranges of domains—as in the preceding example—you might consider using the reversed domain name as primary (and therefore clustered) key.

The advantage would be that the related records would be stored closely together on disk, in the same or adjacent data pages.

However, updating an existing table on its primary key column can be both very time consuming, as all data rows need to be physically rearranged, and sometimes complicated to do when foreign key constraints are in place.

MySQL Admin Cookbook 99 great recipes for mastering MySQL configuration and administration
Published: March 2010
eBook Price: $29.99
Book Price: $49.99
See more
Select your format and quantity:

Finding duplicate indexes

Over time database schemata are subject to changes such as index additions and deletions. It is not uncommon to end up with multiple indexes that are equivalent in terms of query execution but might be defined with different names or even different columns.

This duplication of indexes has negative consequences for your database:

  • Increased size: The more the indexes, the bigger the database.
  • Lower performance: Each index has to be updated on modifications of the respective table, wasting precious I/O and CPU resources.
  • Increased schema complexity: Schema maintenance and understanding of the tables and relationships gets more complicated.

For those reasons, you should be concerned about superfluous indexes.

In this recipe, we will present a way to quickly find out which indexes can be dropped from a table as they are functionally equivalent (if not necessarily formally identical) to another one.

Getting ready

In order to run the program presented here, you will need a Java Runtime Environment (JRE or just Java) installed. You can download it for free from http://www.java.com.

Download the Index Analyzer for MySQL from the book's website

You will also need login credentials with administrative privileges for the server and the database you want to analyze.

How to do it...

  1. Launch the downloaded application by double-clicking its icon. The connection window will appear.

  2. Enter the connection data for your MySQL server and specify the database to check. If you like, you can store these settings for later use.
  3. Hit the Connect button. The analysis will begin. Stand by—this might take a minute or two, depending on the number of tables, columns, and indexes in that database.
  4. When the analysis is complete, review the proposed changes the tool makes. Apart from the tree-like display, you can use the Generate SQL button to copy ALTER TABLE statements to either the clipboard or a file that will apply the changes suggested to the database.

How it works

The Index Analyzer for MySQL tool connects to your database and retrieves information about the indexes defined in the database you specified. It then checks for indexes that are redundant compared with one or more of the others. It will detect the following situations:

  1. Two indexes are completely identical.
  2. One index is a prefix of a second longer one. As MySQL can use the second one for the same queries (ignoring the superfluous columns) the shorter index is redundant.

An index on an InnoDB table is defined so that it ends with the primary key column(s). As MySQL internally appends the primary key columns, they should be removed from the explicit definition. The tree display's root node is the database you selected, followed by the tables with redundant indexes. For each table, one or more detail nodes describe the analysis results in terms of which index is made obsolete by which other.

Each node also contains a rough estimate on how much space could be saved by dropping the redundant index. Note that this is just a ballpark figure to get an idea. In the example earlier, the actual savings are lower according to the statistics MySQL offers via the SHOW TABLE STATUS command:



Table / Index

Index Size before

Estimated Savings

Actual Savings

dept_emp / emp_no

10MB

5.5MB

4.5MB

dept_manager/

emp_no

32k

384 bytes

16k

Salaries / emp_no

34.6MB

59.7MB

35MB

Titles / emp_no

11MB

13.5MB

11MB

All tables were defined with an extra index on the emp_no column, which was made obsolete by the primary key. Note that the difference between estimated and actual savings is most significant. This is because MySQL estimates are based on multiples of the data page size—16 KB—while the Index Analyzer application uses average column lengths.

There's more...

Apart from the Index Analyzer for MySQL available from this book's website, there are other tools available for the same purpose as well. If you do not want to, or cannot, install a Java Runtime Environment, you might be more content with Maatkit's mk-duplicate-key-checker. It is a free command-line tool based on Perl and can be used on a variety of platforms as well. You can get it from http://www.maatkit.org including the full documentation.

In this article series, we covered:

  • Adding indexes to tables
  • Adding a fulltext index
  • Creating a normalized text search column
  • Removing indexes from tables
  • Estimating InnoDB index space requirements
  • Using prefix primary keys
  • Choosing InnoDB primary key columns
  • Speeding up searches for (sub)domains
  • Finding duplicate indexes

About the Author :


Daniel Schneller

Daniel Schneller works as a software developer, database administrator, and general IT professional for an independent software vendor in the retail sector. After successfully graduating from the University of Cooperative Education in Heidenheim, Germany with a degree in Business Computer Science, he started his career as a professional software developer, focused on the Microsoft technology stack. In 2002 he started focusing on enterprise-level Java development and has since gained extensive knowledge and experience implementing large scale systems based on Java EE and relational databases, especially MySQL since version 4.0.

Currently he is mostly involved with the ongoing development of framework-level functionality, including customization and extension of an ORM-based persistence layer. He is involved in different open source projects such as FindBugs, Eclipse, and Checkstyle and infrequently blogs about Java, MySQL, Windows, Linux and other insanities at http://www.danielschneller.com.

Udo Schwedt

Udo Schwedt is head of the Java architecture team and deputy head of the Java development department at the IT service provider for Germany's market leader in the Do-It-Yourself sector. After finishing his studies at RWTH Aachen, Germany with a degree in Computer Science, he started his career as a professional Java developer in a leading software company in the banking sector. In 2003, he joined his current company as a framework developer, subsequently taking the role of a lead architect.

Both authors are involved in a large scale international software project, which encompasses development of a Java-based merchandise management software solution. This decentralized multi-platform environment is based on more than 500 decentralized MySQL server instances with more than 5,500 replication slaves. Daniel and Udo are responsible for configuration and management of this infrastructure.

Books From Packt


Amazon SimpleDB Developer Guide
Amazon SimpleDB Developer Guide

NHibernate 2.x Beginner's Guide
NHibernate 2.x Beginner's Guide

Drupal 7 First look
Drupal 7 First look

Mastering phpMyAdmin 3.1 for Effective MySQL Management
Mastering phpMyAdmin 3.1 for Effective MySQL Management

jQuery 1.3 with PHP
jQuery 1.3 with PHP

Pentaho 3.2 Data Integration: Beginner's Guide
Pentaho 3.2 Data Integration: Beginner's Guide

High Availability MySQL Cookbook
High Availability MySQL Cookbook

Moodle 1.9 Theme Design: Beginner's Guide
Moodle 1.9 Theme Design: Beginner's Guide

 


Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software