|Read more about this book|
(For more resources on Search Engine, see here.)
Sphinx is a full-text search engine. So, before going any further, we need to understand what full-text search is and how it excels over the traditional searching.
What is full-text search?
Full-text search is one of the techniques for searching a document or database stored on a computer. While searching, the search engine goes through and examines all of the words stored in the document and tries to match the search query against those words. A complete examination of all the words (text) stored in the document is undertaken and hence it is called a full-text search.
Full-text search excels in searching large volumes of unstructured text quickly and effectively. It returns pages based on how well they match the user's query.
To understand the difference between a normal search and full-text search, let's take an example of a MySQL database table and perform searches on it.
It is assumed that MySQL Server and phpMyAdmin are already installed on your system.
Time for action – normal search in MySQL
- Open phpMyAdmin in your browser and create a new database called myblog.
- Select the myblog database:
- Create a table by executing the following query:
CREATE TABLE `posts` (
`id` INT NOT NULL AUTO_INCREMENT PRIMARY KEY ,
`title` VARCHAR( 255 ) NOT NULL ,
`description` TEXT NOT NULL ,
`created` DATETIME NOT NULL ,
`modified` DATETIME NOT NULL
) ENGINE = MYISAM;
Queries can be executed from the SQL page in phpMyAdmin. You can find the link to that page in the top menu.
- Populate the table with some records:
INSERT INTO `posts`(`id`, `title`, `description`, `created`,
(1, 'PHP scripting language', 'PHP is a web scripting language
originally created by Rasmus Lerdorf', NOW(), NOW()),
(2, 'Programming Languages', 'There are many languages available
to cater any kind of programming need', NOW(), NOW()),
(3, 'My Life', 'This post is about my life which in a sense is
beautiful', NOW(), NOW()),
(4, 'Life on Mars', 'Is there any life on mars?', NOW(), NOW());
- Next, run the following queries against the table:
SELECT * FROM posts WHERE title LIKE 'programming%';
The above query returns row 2.
SELECT * FROM posts WHERE description LIKE '%life%';
The above query return rows 3 and 4.
SELECT * FROM posts WHERE description LIKE '%scripting language%';
The above query returns row 1.
SELECT * FROM posts WHERE description LIKE '%beautiful%' OR
description LIKE '%programming%';
The above query returns rows 2 and 3.
To administer MySQL database, I highly recommend using a GUI interface tool like phpMyAdmin (http://www.phpmyadmin.net). All the above mentioned queries can easily be executed
What just happened?
We first created a table posts to hold some data. Each post has a title and a description. We then populated the table with some records.
With the first SELECT query we tried to find all posts where the title starts with the word programming. This correctly gave us the row number 2. But what if you want to search for the word anywhere in the field and not just at that start? For this we fired the second query, wherein we searched for the word life anywhere in the description of the post. Again this worked pretty well for us and as expected we got the result in the form of row numbers 3 and 4.
Now what if we wanted to search for multiple words? For this we fired the third query where we searched for the words scripting language. As row 1 has those words in its description, it was returned correctly.
Until now everything looked fine and we were able to perform searches without any hassle. The query gets complex when we want to search for multiple words and those words are not necessarily placed consecutively in a field, that is, side by side. One such example is shown in the form of our fourth query where we tried to search for the words programming and beautiful in the description of the posts. Since the number of words we need to search for increases, this query gets complicated, and moreover, slow in execution, since it needs to match each word individually.
The previous SELECT queries and their output also don't give us any information about the relevance of the search terms with the results found. Relevance can be defined as a measure of how closely the returned database records match the user's search query. In other words, how pertinent the result set is to the search query.
Relevance is very important in the search world because users want to see the items with highest relevance at the top of their search results. One of the major reasons for the success of Google is that their search results are always sorted by relevance.
MySQL full-text search
This is where full-text search comes to the rescue. MySQL has inbuilt support for full-text search and you only need to add FULLTEXT INDEX to the field against which you want to perform your search.
Continuing the earlier example of the posts table, let's add a full-text index to the description field of the table. Run the following query:
ALTER TABLE `posts` ADD FULLTEXT (
The query will add an INDEX of type FULLTEXT to the description field of the posts table.
Only MyISAM Engine in MySQL supports the full-text indexes.
Now to search for all the records which contain the words programming or beautiful anywhere in their description, the query would be:
SELECT * FROM posts WHERE
MATCH (description) AGAINST ('beautiful programming');
This query will return rows 2 and 3, and the returned results are sorted by relevance. One more thing to note is that this query takes less time than the earlier query, which used LIKE for matching.
By default, the MATCH() function performs a natural language search, it attempts to use natural language processing to understand the nature of the query and then search accordingly.
Full-text search in MySQL is a big topic in itself and we have only seen the tip of the iceberg. For a complete reference, please refer to the MySQL manual at http://dev.mysql.com/doc/.
Advantages of full-text search
The following points are some of the major advantages of full-text search:
- It is quicker than traditional searches as it benefits from an index of words that is used to look up records instead of doing a full table scan
- It gives results that can be sorted by relevance to the searched phrase or term, with sophisticated ranking capabilities to find the best documents or records
- It performs very well on huge databases with millions of records
- It skips the common words such as the, an, for, and so on
When to use a full-text search?
- When there is a high volume of free-form text data to be searched
- When there is a need for highly optimized search results
- When there is a demand for flexible search querying
|Read more about this book|
(For more resources on Search Engine, see here.)
Overview of Sphinx
Sphinx is an external solution for database search, which means that it runs outside the main database used for your application. It takes data from the database and creates indexes that are stored on a file system. These indexes are highly optimized for searching and your application uses an API to search the indexes.
Sphinx interacts with the database using a data source driver which comes along with Sphinx. You need to specify which data source driver should be used by Sphinx in its configuration file.
As shown at the beginning of this article, Sphinx is shipped with some binary programs which were installed at /usr/local/sphinx/bin directory. Let's take a look at two principal programs that are used by Sphinx for indexing and searching purposes.
- indexer: This program is used for indexing and re-indexing full-text indexes. By default, Sphinx reads the configuration file at /usr/local/sphinx/etc/sphinx.conf to know what and how to index.
- searchd: This is the daemon used for searching the indexes. It requires a client to access the Sphinx API. There are a number of searchd client API implementations available for Sphinx.
Enough talking about Sphinx, now let's see it in action…
Time for action – Sphinx in action
Let's see an example of how Sphinx works. We will create an index and then search it using the Sphinx command line utility as well as the PHP client implementation. So let's begin:
- Firstly, create a MySQL database named test, if it is not already there:
CREATE DATABASE test;Sphinx ships with a sample configuration file and a sample database table to be used for demo purposes. The SQL for the table is located at /usr/local/sphinx/etc/example.sql and it contains the following SQL:
You can copy the SQL and paste it in your phpMyAdmin interface to run the SQL or execute the following command to import the SQL from the command line in Linux:
DROP TABLE IF EXISTS test.documents;
CREATE TABLE test.documents
id INTEGER PRIMARY KEY NOT NULL AUTO_INCREMENT,
group_id INTEGER NOT NULL,
group_id2 INTEGER NOT NULL,
date_added DATETIME NOT NULL,
title VARCHAR(255) NOT NULL,
content TEXT NOT NULL
REPLACE INTO test.documents ( id, group_id, group_id2, date_added,
title, content ) VALUES
( 1, 1, 5, NOW(), 'test one', 'this is my test document number
one. also checking search within phrases.' ),
( 2, 1, 6, NOW(), 'test two', 'this is my test document number
( 3, 2, 7, NOW(), 'another doc', 'this is another group' ),
( 4, 2, 8, NOW(), 'doc number four', 'this is to test groups'
DROP TABLE IF EXISTS test.tags;
CREATE TABLE test.tags
docid INTEGER NOT NULL,
tagid INTEGER NOT NULL,
INSERT INTO test.tags VALUES
(1,1), (1,3), (1,5), (1,7),
(2,6), (2,4), (2,2),
$ mysql -u root < /usr/local/sphinx/etc/example.sql
- Next, create the configuration file (you may need the permissions to create the file):
$ cd /usr/local/sphinx/etcNow edit sphinx.conf in your favorite editor (you may need to change the permissions of the file to be able to modify it).
$ cp sphinx-min.conf.dist sphinx.conf
The first block of the file looks something like this:
type = mysql
sql_host = localhost
sql_user = test
sql_db = test
sql_port = 3306 # optional, default is 3306
sql_query = \
SELECT id, group_id, UNIX_TIMESTAMP(date_added)
AS date_added, title, content \
sql_attr_uint = group_id
sql_attr_timestamp = date_added
sql_query_info = SELECT * FROM documents WHERE id=$id
- Change the value of sql_host, sql_user, sql_pass and sql_db as per your system:
sql_host = localhostIf you have not installed Sphinx at /usr/local/sphinx then you will need to modify the paths of the following options as well:
sql_user = myuser
sql_pass = mypass
sql_db = test
- Now run the indexer:
$ /usr/local/sphinx/bin/indexer --allThis will give output as shown in the following screenshot:
If you have installed Sphinx at a location other than /usr/local/sphinx, then you need to use the -c /path/to/sphinx.conf option in the previous command.
- Next, let's query the index to see if it works:
$ /usr/local/sphinx/bin/search test
To query the index from our PHP scripts, we first need to start the searchd daemon:
To run searchd commands, you need to be the root user. You can either switch to root user using the su - command, or you could prefix all searchd commands with sudo.
- Now, go to the directory where you extracted the Sphinx tarball during installation and run the command as shown here:
$ cd /path/to/sphinx-0.9.9The command will output the search results, which confirms that searchd is working properly and we can search from our applications using the client API.
$ php api/test.php test
What just happened?
We created an index from the data stored in a MySQL table. We then used Sphinx's search utility to search for the test term in the index. The results showed that Sphinx is working properly and that the index we created was fine.
The major difference between search results by MySQL and Sphinx is that Sphinx does not return the actual data but only the document id. Using these document IDs, we need to fetch the actual data (from its source) to display it. Along with the document id, Sphinx also returns all the attributes and weight of each document. The higher the weight, the higher the relevance of that document with the search query.
We then used the PHP implementation of the Sphinx Client API to search for the same test term, but this time from within a PHP script.
Data to be indexed
The first thing we did was to create a MySQL database and then import the sample data into it. This gave us the data as shown in the following screenshot:
Creating the Sphinx configuration file
Sphinx creates an index based on the options defined in the Sphinx configuration file sphinx.conf. This file is divided into different sections:
- source: This section holds all the settings related to the source of the data to be indexed, which in our case is a MySQL database.
- index: This section holds options which tell Sphinx where and how to save the index. These options are used during indexing-time.
- indexer: This section holds options for the indexer program.
- searchd: This section holds the options used while searching the index.
In this article we will not go into great detail about all the options used in the configuration file. However, a few options to look for are:
- sql_*: These options are there to tell Sphinx about different MySQL settings; such as username, password, the database to use, and the port to use.
- sql_query: This option holds the query that will be fired in order to get the data from the MySQL database.
Once the configuration file is ready, the index can be created by issuing the following command:
$ /usr/local/sphinx/bin/indexer -all
During the indexing operation, some information is displayed in the console such as what configuration file is being used by the indexer, how many documents were found, how much time it took to index, and other related information.
To run indexer commands, you need to be the root user. You can either switch to root user using the su - command, or you could prefix all indexer commands with sudo.
Searching the index
Sphinx provides a command-line utility search which comes in handy to quickly query the index that we created earlier. However, this utility should only be used for testing purposes. In the production environment one should always use the searchd and its client API implementation.
$ /usr/local/sphinx/bin/search test
The output of the search command gives us the results that matched the search term test. The result shows us the document id and weight, amongst other information for the queried term.
Similar information is displayed when we use the PHP client API to search.
Have a go hero
We created a very basic example to see how Sphinx works; however, you can extend and explore this by:
- Adding a few more records to the documents table
- Re-indexing the documents table
- Searching with different search phrases and examining the returned results and their weights
Why use Sphinx for full-text searching?
If you're looking for a good Database Management System (DBMS), there are plenty of options available with support for full-text indexing and searches, such as MySQL, PostgreSQL, and SQL Server. There are also external full-text search engines, such as Lucene and Solr. Let's see the advantages of using Sphinx over the DBMS's full-text searching capabilities and other external search engines:
- It has a higher indexing speed. It is 50 to 100 times faster than MySQL FULLTEXT and 4 to 10 times faster than other external search engines.
- It also has higher searching speed since it depends heavily on the mode, Boolean vs. phrase, and additional processing. It is up to 500 times faster than MySQL FULLTEXT in cases involving a large result set with GROUP BY. It is more than two times faster in searching than other external search engines available.
- As mentioned earlier, relevancy is among the key features one expects when using a search engine, and Sphinx performs very well in this area. It has phrase-based ranking in addition to classic statistical BM25 ranking.
- Last but not the least, Sphinx has better scalability. It can be scaled vertically (utilizing many CPUs, many HDDs) or horizontally (utilizing many servers), and this comes out of the box with Sphinx. One of the biggest known Sphinx cluster has over 3 billion records with more than 2 terabytes of size.
In one of his presentations, Andrew Aksyonoff (creator of Sphinx) presented the following benchmarking results. Approximately 3.5 Million records with around 5 GB of text were used for the purpose.
|Indexing time, min||1627||176||84|
|Index size, MB||3011||6328||2850|
|Match all, ms/q||286||30||22|
|Match phrase, ms/q||3692||29||21|
|Match bool top-20, ms/q||24||29||13|
Apart from a basic search, there are many features that make Sphinx a better solution for searching. These features include multivalve attributes, tokenizing settings, wordforms, HTML processing, geosearching, ranking, and many others.
In this article:
- We learned how to check whether Sphinx was installed properly or not. We saw the directory structure Sphinx creates to store its binary files, configuration files, and other data.
- We then learned what full-text search is and what its advantages over normal search are. We also saw how full-text search has been implemented in MySQL with an example. We saw the syntax of an SQL query used to search a full-text indexed field in MySQL.
- We have also seen why to use Sphinx, an external search engine, instead of database's native full-text support. We saw how Sphinx excels in many ways, and outperforms most of the databases and external search engines available today.
- Lastly we saw how to create an index using the indexer utility, and then how to search that index from the command line as well as other applications using client API implementations.
- Sphinx: Index Searching [Article]
- Search Engine Optimization in Joomla! [Article]
- Blogger: Improving Your Blog with Google Analytics and Search Engine Optimization [Article]
- Drupal 6 Search Engine Optimization [Book]