Getting Started with Sphinx Search

Exclusive offer: get 50% off this eBook here
Sphinx Search Beginner's Guide

Sphinx Search Beginner's Guide — Save 50%

Implement full-text search with lightning speed and accuracy using Sphinx

$23.99    $12.00
by Abbas Ali | March 2011 | Open Source

Sphinx is a full-text search engine (generally standalone) which provides fast, relevant, efficient full-text search functionality to third-party applications. It was especially created to facilitate searches on SQL databases and integrates very well with scripting languages; such as PHP, Python, Perl, Ruby, and Java. This article demonstrates some basic usage of Sphinx in order to test its installation. It also discusses full-text search and gives the reader an overview of Sphinx.

In this article by Abbas Ali, author of Sphinx Search Beginner's Guide, we will take a dive into full-text search and look at different advantages of it. We will then see how Sphinx utilizes full-text search and also learn about indexer, search and searchd utilities that come along with Sphinx. We will also see a very basic example of how Sphinx works.

 

Sphinx Search Beginner's Guide

Sphinx Search Beginner's Guide

Implement full-text search with lightning speed and accuracy using Sphinx

        Read more about this book      

(For more resources on Search Engine, see here.)

Full-text search

Sphinx is a full-text search engine. So, before going any further, we need to understand what full-text search is and how it excels over the traditional searching.

What is full-text search?

Full-text search is one of the techniques for searching a document or database stored on a computer. While searching, the search engine goes through and examines all of the words stored in the document and tries to match the search query against those words. A complete examination of all the words (text) stored in the document is undertaken and hence it is called a full-text search.

Full-text search excels in searching large volumes of unstructured text quickly and effectively. It returns pages based on how well they match the user's query.

Traditional search

To understand the difference between a normal search and full-text search, let's take an example of a MySQL database table and perform searches on it.

It is assumed that MySQL Server and phpMyAdmin are already installed on your system.

Time for action – normal search in MySQL

  1. Open phpMyAdmin in your browser and create a new database called myblog.

    Sphinx Search Beginner's Guide

  2. Select the myblog database:

    Sphinx Search Beginner's Guide

  3. Create a table by executing the following query:
    CREATE TABLE `posts` (
    `id` INT NOT NULL AUTO_INCREMENT PRIMARY KEY ,
    `title` VARCHAR( 255 ) NOT NULL ,
    `description` TEXT NOT NULL ,
    `created` DATETIME NOT NULL ,
    `modified` DATETIME NOT NULL
    ) ENGINE = MYISAM;

    Queries can be executed from the SQL page in phpMyAdmin. You can find the link to that page in the top menu.

  4. Populate the table with some records:
    INSERT INTO `posts`(`id`, `title`, `description`, `created`,
    `modified`) VALUES
    (1, 'PHP scripting language', 'PHP is a web scripting language
    originally created by Rasmus Lerdorf', NOW(), NOW()),
    (2, 'Programming Languages', 'There are many languages available
    to cater any kind of programming need', NOW(), NOW()),
    (3, 'My Life', 'This post is about my life which in a sense is
    beautiful', NOW(), NOW()),
    (4, 'Life on Mars', 'Is there any life on mars?', NOW(), NOW());
  5. Next, run the following queries against the table:
    SELECT * FROM posts WHERE title LIKE 'programming%';

    Sphinx Search Beginner's Guide

    The above query returns row 2.

    SELECT * FROM posts WHERE description LIKE '%life%';

    Sphinx Search Beginner's Guide

    The above query return rows 3 and 4.

    SELECT * FROM posts WHERE description LIKE '%scripting language%';

    Sphinx Search Beginner's Guide

    The above query returns row 1.

    SELECT * FROM posts WHERE description LIKE '%beautiful%' OR
    description LIKE '%programming%';

    Sphinx Search Beginner's Guide

    The above query returns rows 2 and 3.

phpMyAdmin
To administer MySQL database, I highly recommend using a GUI interface tool like phpMyAdmin (http://www.phpmyadmin.net). All the above mentioned queries can easily be executed

What just happened?

We first created a table posts to hold some data. Each post has a title and a description. We then populated the table with some records.

With the first SELECT query we tried to find all posts where the title starts with the word programming. This correctly gave us the row number 2. But what if you want to search for the word anywhere in the field and not just at that start? For this we fired the second query, wherein we searched for the word life anywhere in the description of the post. Again this worked pretty well for us and as expected we got the result in the form of row numbers 3 and 4.

Now what if we wanted to search for multiple words? For this we fired the third query where we searched for the words scripting language. As row 1 has those words in its description, it was returned correctly.

Until now everything looked fine and we were able to perform searches without any hassle. The query gets complex when we want to search for multiple words and those words are not necessarily placed consecutively in a field, that is, side by side. One such example is shown in the form of our fourth query where we tried to search for the words programming and beautiful in the description of the posts. Since the number of words we need to search for increases, this query gets complicated, and moreover, slow in execution, since it needs to match each word individually.

The previous SELECT queries and their output also don't give us any information about the relevance of the search terms with the results found. Relevance can be defined as a measure of how closely the returned database records match the user's search query. In other words, how pertinent the result set is to the search query.

Relevance is very important in the search world because users want to see the items with highest relevance at the top of their search results. One of the major reasons for the success of Google is that their search results are always sorted by relevance.

MySQL full-text search

This is where full-text search comes to the rescue. MySQL has inbuilt support for full-text search and you only need to add FULLTEXT INDEX to the field against which you want to perform your search.

Continuing the earlier example of the posts table, let's add a full-text index to the description field of the table. Run the following query:

ALTER TABLE `posts` ADD FULLTEXT (
`description`
);

The query will add an INDEX of type FULLTEXT to the description field of the posts table.

Only MyISAM Engine in MySQL supports the full-text indexes.

Now to search for all the records which contain the words programming or beautiful anywhere in their description, the query would be:

SELECT * FROM posts WHERE
MATCH (description) AGAINST ('beautiful programming');

Sphinx Search Beginner's Guide

This query will return rows 2 and 3, and the returned results are sorted by relevance. One more thing to note is that this query takes less time than the earlier query, which used LIKE for matching.

By default, the MATCH() function performs a natural language search, it attempts to use natural language processing to understand the nature of the query and then search accordingly.

Full-text search in MySQL is a big topic in itself and we have only seen the tip of the iceberg. For a complete reference, please refer to the MySQL manual at http://dev.mysql.com/doc/.

Advantages of full-text search

The following points are some of the major advantages of full-text search:

  • It is quicker than traditional searches as it benefits from an index of words that is used to look up records instead of doing a full table scan
  • It gives results that can be sorted by relevance to the searched phrase or term, with sophisticated ranking capabilities to find the best documents or records
  • It performs very well on huge databases with millions of records
  • It skips the common words such as the, an, for, and so on

When to use a full-text search?

  • When there is a high volume of free-form text data to be searched
  • When there is a need for highly optimized search results
  • When there is a demand for flexible search querying

 

Sphinx Search Beginner's Guide Implement full-text search with lightning speed and accuracy using Sphinx
Published: March 2011
eBook Price: $23.99
Book Price: $39.99
See more
Select your format and quantity:

 

        Read more about this book      

(For more resources on Search Engine, see here.)

Overview of Sphinx

Sphinx is an external solution for database search, which means that it runs outside the main database used for your application. It takes data from the database and creates indexes that are stored on a file system. These indexes are highly optimized for searching and your application uses an API to search the indexes.

Sphinx interacts with the database using a data source driver which comes along with Sphinx. You need to specify which data source driver should be used by Sphinx in its configuration file.

Primary programs

As shown at the beginning of this article, Sphinx is shipped with some binary programs which were installed at /usr/local/sphinx/bin directory. Let's take a look at two principal programs that are used by Sphinx for indexing and searching purposes.

  • indexer: This program is used for indexing and re-indexing full-text indexes. By default, Sphinx reads the configuration file at /usr/local/sphinx/etc/sphinx.conf to know what and how to index.
  • searchd: This is the daemon used for searching the indexes. It requires a client to access the Sphinx API. There are a number of searchd client API implementations available for Sphinx.

Enough talking about Sphinx, now let's see it in action…

Time for action – Sphinx in action

Let's see an example of how Sphinx works. We will create an index and then search it using the Sphinx command line utility as well as the PHP client implementation. So let's begin:

  1. Firstly, create a MySQL database named test, if it is not already there:
    CREATE DATABASE test;

    Sphinx ships with a sample configuration file and a sample database table to be used for demo purposes. The SQL for the table is located at /usr/local/sphinx/etc/example.sql and it contains the following SQL:

    DROP TABLE IF EXISTS test.documents;

    CREATE TABLE test.documents
    (
    id INTEGER PRIMARY KEY NOT NULL AUTO_INCREMENT,
    group_id INTEGER NOT NULL,
    group_id2 INTEGER NOT NULL,
    date_added DATETIME NOT NULL,
    title VARCHAR(255) NOT NULL,
    content TEXT NOT NULL
    );

    REPLACE INTO test.documents ( id, group_id, group_id2, date_added,
    title, content ) VALUES
    ( 1, 1, 5, NOW(), 'test one', 'this is my test document number
    one. also checking search within phrases.' ),
    ( 2, 1, 6, NOW(), 'test two', 'this is my test document number
    two' ),
    ( 3, 2, 7, NOW(), 'another doc', 'this is another group' ),
    ( 4, 2, 8, NOW(), 'doc number four', 'this is to test groups'
    );

    DROP TABLE IF EXISTS test.tags;

    CREATE TABLE test.tags
    (
    docid INTEGER NOT NULL,
    tagid INTEGER NOT NULL,
    UNIQUE(docid,tagid)
    );

    INSERT INTO test.tags VALUES
    (1,1), (1,3), (1,5), (1,7),
    (2,6), (2,4), (2,2),
    (3,15),
    (4,7), (4,40);

    You can copy the SQL and paste it in your phpMyAdmin interface to run the SQL or execute the following command to import the SQL from the command line in Linux:

    $ mysql -u root < /usr/local/sphinx/etc/example.sql
  2. Next, create the configuration file (you may need the permissions to create the file):
    $ cd /usr/local/sphinx/etc
    $ cp sphinx-min.conf.dist sphinx.conf

    Now edit sphinx.conf in your favorite editor (you may need to change the permissions of the file to be able to modify it).
    The first block of the file looks something like this:

    source src1
    {
    type = mysql

    sql_host = localhost
    sql_user = test
    sql_pass =
    sql_db = test
    sql_port = 3306 # optional, default is 3306

    sql_query = \
    SELECT id, group_id, UNIX_TIMESTAMP(date_added)
    AS date_added, title, content \
    FROM documents

    sql_attr_uint = group_id
    sql_attr_timestamp = date_added

    sql_query_info = SELECT * FROM documents WHERE id=$id
    }

  3. Change the value of sql_host, sql_user, sql_pass and sql_db as per your system:
    sql_host = localhost
    sql_user = myuser
    sql_pass = mypass
    sql_db = test

    If you have not installed Sphinx at /usr/local/sphinx then you will need to modify the paths of the following options as well:

    • path
    • log
    • query_log
    • pid_file
  4. Now run the indexer:
    $ /usr/local/sphinx/bin/indexer --all

    This will give output as shown in the following screenshot:

    Sphinx Search Beginner's Guide

    If you have installed Sphinx at a location other than /usr/local/sphinx, then you need to use the -c /path/to/sphinx.conf option in the previous command.

  5. Next, let's query the index to see if it works:
    $ /usr/local/sphinx/bin/search test

    Sphinx Search Beginner's Guide

    To query the index from our PHP scripts, we first need to start the searchd daemon:

    $ /usr/local/sphinx/bin/searchd

    Sphinx Search Beginner's Guide

    To run searchd commands, you need to be the root user. You can either switch to root user using the su - command, or you could prefix all searchd commands with sudo.

  6. Now, go to the directory where you extracted the Sphinx tarball during installation and run the command as shown here:
    $ cd /path/to/sphinx-0.9.9
    $ php api/test.php test

    The command will output the search results, which confirms that searchd is working properly and we can search from our applications using the client API.

    Sphinx Search Beginner's Guide

What just happened?

We created an index from the data stored in a MySQL table. We then used Sphinx's search utility to search for the test term in the index. The results showed that Sphinx is working properly and that the index we created was fine.

The major difference between search results by MySQL and Sphinx is that Sphinx does not return the actual data but only the document id. Using these document IDs, we need to fetch the actual data (from its source) to display it. Along with the document id, Sphinx also returns all the attributes and weight of each document. The higher the weight, the higher the relevance of that document with the search query.

We then used the PHP implementation of the Sphinx Client API to search for the same test term, but this time from within a PHP script.

Data to be indexed

The first thing we did was to create a MySQL database and then import the sample data into it. This gave us the data as shown in the following screenshot:

Sphinx Search Beginner's Guide

Creating the Sphinx configuration file

Sphinx creates an index based on the options defined in the Sphinx configuration file sphinx.conf. This file is divided into different sections:

  • source: This section holds all the settings related to the source of the data to be indexed, which in our case is a MySQL database.
  • index: This section holds options which tell Sphinx where and how to save the index. These options are used during indexing-time.
  • indexer: This section holds options for the indexer program.
  • searchd: This section holds the options used while searching the index.

In this article we will not go into great detail about all the options used in the configuration file. However, a few options to look for are:

  • sql_*: These options are there to tell Sphinx about different MySQL settings; such as username, password, the database to use, and the port to use.
  • sql_query: This option holds the query that will be fired in order to get the data from the MySQL database.

Once the configuration file is ready, the index can be created by issuing the following command:

$ /usr/local/sphinx/bin/indexer -all

During the indexing operation, some information is displayed in the console such as what configuration file is being used by the indexer, how many documents were found, how much time it took to index, and other related information.

To run indexer commands, you need to be the root user. You can either switch to root user using the su - command, or you could prefix all indexer commands with sudo.

Searching the index

Sphinx provides a command-line utility search which comes in handy to quickly query the index that we created earlier. However, this utility should only be used for testing purposes. In the production environment one should always use the searchd and its client API implementation.

$ /usr/local/sphinx/bin/search test

The output of the search command gives us the results that matched the search term test. The result shows us the document id and weight, amongst other information for the queried term.

Similar information is displayed when we use the PHP client API to search.

Have a go hero

We created a very basic example to see how Sphinx works; however, you can extend and explore this by:

  • Adding a few more records to the documents table
  • Re-indexing the documents table
  • Searching with different search phrases and examining the returned results and their weights

Why use Sphinx for full-text searching?

If you're looking for a good Database Management System (DBMS), there are plenty of options available with support for full-text indexing and searches, such as MySQL, PostgreSQL, and SQL Server. There are also external full-text search engines, such as Lucene and Solr. Let's see the advantages of using Sphinx over the DBMS's full-text searching capabilities and other external search engines:

  • It has a higher indexing speed. It is 50 to 100 times faster than MySQL FULLTEXT and 4 to 10 times faster than other external search engines.
  • It also has higher searching speed since it depends heavily on the mode, Boolean vs. phrase, and additional processing. It is up to 500 times faster than MySQL FULLTEXT in cases involving a large result set with GROUP BY. It is more than two times faster in searching than other external search engines available.
  • As mentioned earlier, relevancy is among the key features one expects when using a search engine, and Sphinx performs very well in this area. It has phrase-based ranking in addition to classic statistical BM25 ranking.
  • Last but not the least, Sphinx has better scalability. It can be scaled vertically (utilizing many CPUs, many HDDs) or horizontally (utilizing many servers), and this comes out of the box with Sphinx. One of the biggest known Sphinx cluster has over 3 billion records with more than 2 terabytes of size.

In one of his presentations, Andrew Aksyonoff (creator of Sphinx) presented the following benchmarking results. Approximately 3.5 Million records with around 5 GB of text were used for the purpose.

 

MySQL Lucene Sphinx
Indexing time, min 1627 176 84
Index size, MB 3011 6328 2850
Match all, ms/q 286 30 22
Match phrase, ms/q 3692 29 21
Match bool top-20, ms/q 24 29 13

Apart from a basic search, there are many features that make Sphinx a better solution for searching. These features include multivalve attributes, tokenizing settings, wordforms, HTML processing, geosearching, ranking, and many others.

Summary

In this article:

  • We learned how to check whether Sphinx was installed properly or not. We saw the directory structure Sphinx creates to store its binary files, configuration files, and other data.
  • We then learned what full-text search is and what its advantages over normal search are. We also saw how full-text search has been implemented in MySQL with an example. We saw the syntax of an SQL query used to search a full-text indexed field in MySQL.
  • We have also seen why to use Sphinx, an external search engine, instead of database's native full-text support. We saw how Sphinx excels in many ways, and outperforms most of the databases and external search engines available today.
  • Lastly we saw how to create an index using the indexer utility, and then how to search that index from the command line as well as other applications using client API implementations.

Further resources on this subject:


Sphinx Search Beginner's Guide Implement full-text search with lightning speed and accuracy using Sphinx
Published: March 2011
eBook Price: $23.99
Book Price: $39.99
See more
Select your format and quantity:

About the Author :


Abbas Ali

Abbas Ali has over 6 years of experience in PHP Development and is a Zend Certified PHP 5 Engineer. A Mechanical Engineer by education, Abbas turned to software development just after finishing his engineering degree. He is a member of the core development team for the Coppermine Photo Gallery, an open source project which is one of the most popular photo gallery applications in the world.

Fascinated with both machines and knowledge, Abbas is always learning new programming techniques. He got acquainted with Sphinx in 2009 and has been using it in most of his commercial projects ever since. He loves open source and believes in contributing back to the community.

Abbas is married to Tasneem and has a cute little daughter, Munira. He lives at Nagpur (India) since his birth and is in no rush to move to any other city in the world. In his free time he loves to watch movies and television. He is also an amateur photographer and cricketer.

Abbas is currently working as Chief Operating Officer and Technical Manager at SANIsoft Technologies Private Limited, Nagpur, India. The company specializes in development of large, high performance and scalable PHP applications.

Books From Packt


Drupal E-commerce with Ubercart 2.x
Drupal E-commerce with Ubercart 2.x

TYPO3 4.2 E-Commerce
TYPO3 4.2 E-Commerce

PHP 5 E-commerce Development
PHP 5 E-commerce Development

Django 1.2 E-commerce
Django 1.2 E-commerce

PostgreSQL 9.0 High Performance
PostgreSQL 9.0 High Performance

WordPress 3.0 Search Engine Optimization
WordPress 3.0 Search Engine Optimization

OpenCart 1.4 Beginner's Guide
OpenCart 1.4 Beginner's Guide

Solr 1.4 Enterprise Search Server
Solr 1.4 Enterprise Search Server


Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software