MySQL 5.1 Plugin: HTML Storage Engine—Reads and Writes

Exclusive offer: get 50% off this eBook here
MySQL 5.1 Plugin Development

MySQL 5.1 Plugin Development — Save 50%

Extend MySQL to suit your needs with this unique guide into the world of MySQL plugins

$26.99    $13.50
by Sergei Golubchik | August 2010 | MySQL Open Source Web Development

In this article by Sergei Golubchik, co-author of MySQL 5.1 Plugin Development, we will consider a complex example, a complete read-write storage engine, but with no support for indexes. Let's say, it will be an "html" engine—an engine that stores tables in HTML files. Such a file can be later loaded in a web browser and the table will be shown as an HTML table. Thus Reads and Writes presents a storage engine plugin that keeps table data in HTML tables and uses it to explain how to implement an updatable data stores.

(For more resources on MySQL, see here.)

An idea of the HTML engine

Ever thought about what your tables might look like? Why not represent a table as a <TABLE>? You would be able to see it, visually, in any browser. Sounds cool. But how could we make it work?

We want a simple engine, not an all-purpose Swiss Army Knife HTML-to-SQL converter, which means we will not need any existing universal HTML or XML parsers, but can rely on a fixed file format. For example, something like this:

<html><head><title>t1</title></head><body><table border=1>
<tr><th>col1</th><th>other col</th><th>more cols</th></tr>
<tr><td>data</td><td>more data</td><td>more data</td></tr>
<!-- this row was deleted ... -->
<tr><td>data</td><td>more data</td><td>more data</td></tr>
... and so on ...
</table></body></html>

But even then this engine is way more complex than the previous example, and it makes sense to split the code. The engine could stay, as usual, in the ha_html.cc file, the declarations in ha_html.h, and if we need any utility functions to work with HTML we can put them in the htmlutils.cc file.

Flashback

A storage engine needs to declare a plugin and an initialization function that fills a handlerton structure. Again, the only handlerton method that we need here is a create() method.

#include "ha_html.h"
static handler* html_create_handler(handlerton *hton,
TABLE_SHARE *table, MEM_ROOT *mem_root)
{
return new (mem_root) ha_html(hton, table);
}
static int html_init(void *p)
{
handlerton *html_hton = (handlerton *)p;
html_hton->create = html_create_handler;
return 0;
}
struct st_mysql_storage_engine html_storage_engine =
{ MYSQL_HANDLERTON_INTERFACE_VERSION };
mysql_declare_plugin(html)
{
MYSQL_STORAGE_ENGINE_PLUGIN,
&html_storage_engine,
"HTML",
"Sergei Golubchik",
"An example HTML storage engine",
PLUGIN_LICENSE_GPL,
html_init,
NULL,
0x0001,
NULL,
NULL,
NULL
}
mysql_declare_plugin_end;

Now we need to implement all of the required handler class methods. Let's start with simple ones:

const char *ha_html::table_type() const
{
return "HTML";
}
const char **ha_html::bas_ext() const
{
static const char *exts[] = { ".html", 0 };
return exts;
}
ulong ha_html::index_flags(uint inx, uint part, bool all_parts) const
{
return 0;
}
ulonglong ha_html::table_flags() const
{
return HA_NO_TRANSACTIONS | HA_REC_NOT_IN_SEQ | HA_NO_BLOBS;
}
THR_LOCK_DATA **ha_html::store_lock(THD *thd,
THR_LOCK_DATA **to, enum thr_lock_type lock_type)
{
if (lock_type != TL_IGNORE && lock.type == TL_UNLOCK)
lock.type = lock_type;
*to ++= &lock;
return to;
}

These methods are familiar to us. They say that the engine is called "HTML", it stores the table data in files with the .html extension, the tables are not transactional, the position for ha_html::rnd_pos() is obtained by calling ha_html::position(), and that it does not support BLOBs. Also, we need a function to create and initialize an HTML_SHARE structure:

static HTML_SHARE *find_or_create_share(
const char *table_name, TABLE *table)
{
HTML_SHARE *share;
for (share = (HTML_SHARE*)table->s->ha_data;
share; share = share->next)
if (my_strcasecmp(table_alias_charset,
table_name, share->name) == 0)
return share;
share = (HTML_SHARE*)alloc_root(&table->s->mem_root,
sizeof(*share));
bzero(share, sizeof(*share));
share->name = strdup_root(&table->s->mem_root, table_name);
share->next = (HTML_SHARE*)table->s->ha_data;
table->s->ha_data = share;
return share;
}

It is exactly the same function, only the structure is now called HTML_SHARE, not STATIC_SHARE.

Creating, opening, and closing the table

Having done the basics, we can start working with the tables. The first operation, of course, is the table creation. To be able to read, update, or even open the table we need to create it first, right? Now, the table is just an HTML file and to create a table we only need to create an HTML file with our header and footer, but with no data between them. We do not need to create any TABLE or Field objects, or anything else—MySQL does it automatically. To avoid repeating the same HTML tags over and over we will define the header and the footer in the ha_html.h file as follows:

#define HEADER1 "<html><head><title>"
#define HEADER2 "</title></head><body><table border=1>\n"
#define FOOTER "</table></body></html>"
#define FOOTER_LEN ((int)(sizeof(FOOTER)-1))

As we want a header to include a table name we have split it in two parts. Now, we can create our table:

int ha_html::create(const char *name, TABLE *table_arg,
HA_CREATE_INFO *create_info)
{
char buf[FN_REFLEN+10];
strcpy(buf, name);
strcat(buf, *bas_ext());

We start by generating a filename. The "table name" that the storage engine gets is not the original table name, it is converted to be a safe filename. All "troublesome" characters are encoded, and the database name is included and separated from the table name with a slash. It means we can safely use name as the filename and all we need to do is to append an extension. Having the filename, we open it and write our data:

FILE *f = fopen(buf, "w");
if (f == 0)
return errno;
fprintf(f, HEADER1);
write_html(f, table_arg->s->table_name.str);
fprintf(f, HEADER2 "<tr>");

First, we write the header and the table name. Note that we did not write the value of the name argument into the header, but took the table name from the TABLE_SHARE structure (as table_arg->s->table_name.str), because name is mangled to be a safe filename, and we would like to see the original table name in the HTML page title. Also, we did not just write it into the file, we used a write_html() function—this is our utility method that performs the necessary entity encoding to get a well-formed HTML. But let's not think about it too much now, just remember that we need to write it, it can be done later.

Now, we iterate over all fields and write their names wrapped in <th>...</th> tags. Again, we rely on our write_html() function here:

for (uint i = 0; i < table_arg->s->fields; i++) {
fprintf(f, "<th>");
write_html(f, table_arg->field[i]->field_name);
fprintf(f, "</th>");
}
fprintf(f, "</tr>");
fprintf(f, FOOTER);
fclose(f);
return 0;
}

Done, an empty table is created.

Opening it is easy too. We generate the filename and open the file just as in the create() method. The only difference is that we need to remember the FILE pointer to be able to read the data later, and we store it in fhtml, which has to be a member of the ha_html object:

int ha_html::open(const char *name, int mode,
uint test_if_locked)
{
char buf[FN_REFLEN+10];
strcpy(buf, name);
strcat(buf, *bas_ext());
fhtml = fopen(buf, "r+");
if (fhtml == 0)
return errno;

When parsing an HTML file we will often need to skip over known patterns in the text. Instead of using a special library or a custom pattern parser for that, let's try to use scanf()—it exists everywhere, has a built-in pattern matching language, and it is powerful enough for our purposes. For convenience, we will wrap it in a skip_html() function that takes a scanf() format and returns the number of bytes skipped. Assuming we have such a function, we can finish opening the table:

skip_html(fhtml, HEADER1 "%*[^<]" HEADER2 "<tr>");
for (uint i = 0; i < table->s->fields; i++) {
skip_html(fhtml, "<th>%*[^<]</th>");
}
skip_html(fhtml, "</tr>");
data_start = ftell(fhtml);

We skip the first part of the header, then "everything up to the opening angle bracket", which eats up the table name, and the second part of the header. Then we skip individual row headers in a loop and the end of row </tr> tag. In order not to repeat this parsing again we remember the offset where the row data starts. At the end we allocate an HTML_SHARE and initialize lock objects:

share = find_or_create_share(name, table);
if (share->use_count++ == 0)
thr_lock_init(&share->lock);
thr_lock_data_init(&share->lock,&lock,NULL);
return 0;
}

Closing the table is simple, and should not come as a surprise to us:

int ha_html::close(void)
{
fclose(fhtml);
if (--share->use_count == 0)
thr_lock_delete(&share->lock);
return 0;
}

MySQL 5.1 Plugin Development Extend MySQL to suit your needs with this unique guide into the world of MySQL plugins
Published: August 2010
eBook Price: $26.99
Book Price: $44.99
See more
Select your format and quantity:

Read more about this book

(For more resources on MySQL, see here.)

Reading data

An indexless table can be read using two access patterns—either sequential, with rnd_init() and rnd_next() methods, or a random one with position() and rnd_pos() methods. Let's start with the former:

int ha_html::rnd_next(unsigned char *buf)
{
fseek(fhtml, current_row_end, SEEK_SET);
for (;;) {

This is one of the most complex methods in our storage engine. Let's analyze it line by line. We started by positioning the stream at the end of the last read row, and will be looking for the first non-deleted row (remember that a row starts with a <tr> tag, while a deleted row starts with <!--).

But first we save the offset of the row that we will read, as it may be needed for position(), and check it against the end of data offset:

current_row_start= ftell(fhtml);
if (current_row_start >= data_end)
return HA_ERR_END_OF_FILE;

This check allows us to skip any processing and return an error at once if we have read all of the rows. It is a minor optimization for a read-only table, but for an updatable table it is of paramount importance. Imagine that we are doing a table scan and the same SQL statement adds new data at the end of the table (which can easily happen in an UPDATE statement). Our table scan should ignore all of the rows that were added in the same statement, and if we remember where the row data ends in the file, we can stop reading at that offset. For this to work, data_end needs to be recalculated at the beginning of every statement. Indeed, as the table may be growing, it is not enough to do it in the open() method. We will talk about statement boundaries later. On the other hand, data_start can never change, we can determine it only once.

if (skip_html(fhtml, "<tr>") == 4)
break;

Now, we can start reading the data. First, we read and skip over the <tr> tag. If it was successful, and we have skipped over four characters, we have found our row, and can break out of the loop.

if (skip_html(fhtml, "!--%*[^-]-->\n") >=7)
continue;

Otherwise we try to match a deleted row. As the opening angle bracket was already matched and consumed by the previous skip_html() we omit it from the pattern. If there was a match, we restart the loop.

return HA_ERR_CRASHED;
}

Otherwise we return an error, complaining that the table is corrupted.

my_bitmap_map *old_map =
dbug_tmp_use_all_columns(table, table->write_set);
my_ptrdiff_t offset = (my_ptrdiff_t)(buf-table->record[0]);
for (uint i = 0; i < table->s->fields; i++) {
Field *field = table->field[i];
field->move_field_offset(offset);

This loop over all fields and the two assignments before it are the same.

if (skip_html(fhtml, "<%*[td/]>") == 5)
field->set_null();
else {

Here we skip "an angle bracket, any sequence of characters t, d, /, and a closing angle bracket". If we have skipped five characters, it was a <td/> tag, which stands for a NULL value, otherwise it was an opening <td> tag, and we need to read the value of the field:

char *buf = (char*)
malloc(field->max_display_length() + 1);
int buf_len = read_html(fhtml, buf);
field->set_notnull();
field->store(buf, buf_len, &my_charset_bin);
skip_html(fhtml, "</td>");
free(buf);

To read the value we allocate a buffer, big enough to hold the string representation of the field value, and read the value using our read_html() function—a counterpart of write_html() that converts HTML entities back into characters. Then we mark the field as holding a NOT NULL value, and store the string value in the field. The field—an object of the Field class—automatically performs all of the necessary conversion from a string to a number, a date, or whatever the field type is. Then we skip the closing <\td> tag and free the buffer.

An inquisitive reader may have noticed that calling malloc() and free() for every field in every row during a table scan does not indicate performance-conscious programming. He would be right—it would be better to allocate the buffer once, big enough to hold a value of any field. It could have been done in the ha_html::open() method, because field lengths cannot change after a table is opened.

}
field->move_field_offset(-offset);
}
skip_html(fhtml, "</tr>\n");
dbug_tmp_restore_column_map(table->write_set, old_map);
current_row_end = ftell(fhtml);
return 0;
}

We finish reading a row by restoring field offsets and a write_set bitmap, skipping the closing </tr> tag, and remembering the offset where the row ended.

int ha_html::rnd_init(bool scan)
{
current_row_start = 0;
current_row_end = data_start;
return 0;
}

In this function, we need to prepare for a sequential table scan. All we need to do is to initialize current_row_start and current_row_end. As we start reading a new row from the current_row_end offset, we should set it here to the data_start offset, that is to the beginning of the very first row. Additionally, we reset current_row_start to indicate that no row has been read yet.

void ha_html::position(const uchar *record)
{
*(ulong*)ref = current_row_start;
}

This method is very simple. A unique row identifier, in our case, is the file offset to the row data. That is, all we need to do here is to store the offset of the current row at the ref pointer.

int ha_html::rnd_pos(uchar * buf, uchar *pos)
{
memcpy(&current_row_end, pos, sizeof(current_row_end));
return rnd_next(buf);
}

Reading a row at the given position is easy too. We only restore the position into current_row_end and let our rnd_next() method to do the rest of the job.

Updating the table

There are three primary methods that modify the table data. They are write_row(), delete_row(), and update_row(), which are used by the INSERT, DELETE, and UPDATE statements accordingly. In our engine, write_row() is the most complex one.

int ha_html::write_row(uchar *buf)
{
if (table->timestamp_field_type &
TIMESTAMP_AUTO_SET_ON_INSERT)
table->timestamp_field->set_time();
if (table->next_number_field && buf == table->record[0]) {
int error;
if ((error= update_auto_increment()))
return error;
}

Almost every engine's write_row() method starts with these lines. They update the values of the TIMESTAMP and AUTO_INCREMENT fields, if necessary. Strictly speaking, the second—AUTO_INCREMENT—block is not needed here. Our engine does not support indexes, that is, it can never have an AUTO_INCREMENT field.

fseek(fhtml, -FOOTER_LEN, SEEK_END);
fprintf(fhtml, "<tr>");

We write a new row at the end of the file. That is, we position the stream at the end, just before the footer, and start a new row with an opening <tr> tag.

my_bitmap_map *old_map =
dbug_tmp_use_all_columns(table, table->read_set);
my_ptrdiff_t offset = (my_ptrdiff_t)(buf-table->record[0]);
for (uint i = 0; i < table->s->fields; i++) {
Field *field = table->field[i];
field->move_field_offset(offset);

Now we iterate over the fields in a loop very similar to the one in rnd_next(), only it modifies the read_set bitmap, not the write_set.

if (field->is_null())
fprintf(fhtml, "<td/>");

If the field value is NULL we write it down as <td/> to be able to distinguish it from the empty string (which is written as <td><td/>), otherwise we take the field value and write it to the file:

else {
char tmp_buf[1024];
String tmp(tmp_buf, sizeof(tmp_buf), &my_charset_bin);
String *val = field->val_str(&tmp, &tmp);

The val_* family of Field methods—val_str(), val_int(), val_real(), and val_decimal() return the value of the Field, converted to a corresponding type. To store it in the HTML file, obviously, we want the value converted to a string, that is, we need to use val_str(). This method takes two arguments, pointers to String objects. The Field::val_str() method may use either the first or a second, depending on whether it needs a memory buffer or not. However, as a caller we can simply pass the same String in both arguments.

The class String is a utility class for working, well, with strings. An object of the String class represents a string as a pointer, string length in bytes, and string character set. It knows whether the string was allocated or not. It can reallocate it as the string grows and as expected, it will free the allocated memory on destruction. What is important for us is that it can start off from a fixed buffer and allocate the memory automatically if the string value does not fit. It means that if we create a buffer on the stack we can hope to avoid memory allocations in most cases. Indeed, the above declarations with tmp_buf and String tmp show a typical String usage pattern that can be seen everywhere in the MySQL code.

fprintf(fhtml, "<td>");
write_html(fhtml, val->c_ptr());
fprintf(fhtml, "</td>");

Here we have used the String::c_ptr() method. It returns the string value as a zero terminated string, appropriate for passing to the fprintf() function.

}
field->move_field_offset(-offset);
}
dbug_tmp_restore_column_map(table->read_set, old_map);
if (fprintf(fhtml, "</tr>\n" FOOTER) < 6 + FOOTER_LEN)
return errno;
else
return 0;
}

Having written all of the fields, we restore field offsets and read_set, write the closing </tr> tag and the file footer, and return. Just in case the disk was full we verify that the footer was written in whole.

int ha_html::delete_row(const uchar *buf)
{
assert(current_row_start);
fseek(fhtml, current_row_start, SEEK_SET);
fprintf(fhtml, "<!--");
fseek(fhtml, current_row_end-4, SEEK_SET);
fprintf(fhtml, "-->\n");
return 0;
}

Compared to write_row(), our delete_row() method is much simpler. MySQL can only call delete_row() for the current row. And because we know where it starts—at the current_row_start offset, we can even assert this fact—and we know where it ends, we can easily comment the complete row out, "deleting" it from the HTML table.

int ha_html::update_row(const uchar *old_data,
uchar *new_data)
{
assert(current_row_start);
if (table->timestamp_field_type &
TIMESTAMP_AUTO_SET_ON_UPDATE)
table->timestamp_field->set_time();
delete_row(old_data);
table->timestamp_field_type = (timestamp_auto_set_type)
(table->timestamp_field_type &
~TIMESTAMP_AUTO_SET_ON_INSERT);
return write_row(new_data);
}

The update_row() method is simple too. Just like write_row() it starts by updating the TIMESTAMP field, but after that we can simply delete the old row and insert the new one at the end of the table. We just need to remember to remove the TIMESTAMP_AUTO_SET_ON_INSERT bit to avoid unnecessary TIMESTAMP updates; MySQL will restore it for the next statement automatically. Because we have remembered the original "end of table" offset at the beginning of the statement, we do not need to worry that our table scan will read newly inserted data.

Summary

In this article, we have learned the methods needed to create a storage engine with a full read/write support.


Further resources on this subject:


MySQL 5.1 Plugin Development Extend MySQL to suit your needs with this unique guide into the world of MySQL plugins
Published: August 2010
eBook Price: $26.99
Book Price: $44.99
See more
Select your format and quantity:

About the Author :


Sergei Golubchik

Sergei Golubchik started on modifying MySQL source code in 1998, and has continued doing it as a MySQL AB employee since 2000. Working professionally with the MySQL sources he had a chance to get to know and extend almost every part of the server code – from the SQL core to the utility functions. He was one of the primary architects of the Plugin API. After working for ten years in the ever-growing MySQL AB, and later in Sun Microsystems as a Principal Software Developer, he resigned to join a small startup company that works on MariaDB – an extended version of the MySQL server, where he continues to do what he likes the most – hack on MySQL, architecting and developing the MySQL/MariaDB Plugin API, making it even more powerful, safer, and easier to use.
He works and lives in Germany, near Cologne, with his lovely wife and two kids.

Books From Packt


MySQL Admin Cookbook
MySQL Admin Cookbook

Mastering phpMyAdmin 3.1 for Effective MySQL Management
Mastering phpMyAdmin 3.1 for Effective MySQL Management

High Availability MySQL Cookbook
High Availability MySQL Cookbook

MySQL for Python: Database Access Made Easy
MySQL for Python: Database Access Made Easy

Creating your MySQL Database: Practical Design Tips and Techniques
Creating your MySQL Database: Practical Design Tips and Techniques

Alfresco 3 Web Content Management
Alfresco 3 Web Content Management

WordPress Top Plugins
WordPress Top Plugins

TYPO3 4.3 Multimedia Cookbook
TYPO3 4.3 Multimedia Cookbook


No votes yet

Post new comment

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
w
h
Y
t
h
x
Enter the code without spaces and pay attention to upper/lower case.
Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software