Search is by far the most important feature of an application where data is stored and retrieved. If it hadn't been for search, Google wouldn't exist, so we can imagine the importance of search in the computing world.
Search can be found in the following types of applications:
For desktop applications, search is a quick way of locating files. Most desktop applications are not data-oriented, that is, they are not meant to organize and display information. They are rather meant to perform certain tasks, making search a secondary feature.
When using a web application, more often than not, the search becomes a means to navigate the website and look for things that we are interested in, things which are otherwise hidden deep inside the site's structure. Search becomes more important if the web application is full of rich-text content such as blogs, articles, knowledge bases, and so on; where a user needs the search functionality to find a particular piece of information.
In this chapter we will:
Discuss different ways to search for data
See how Sphinx helps us in achieving our goal
Learn how to install Sphinx
So let's get on with it...
Searching can be done in different ways but here we will take a look at the two most commonly used methods.
Whenever your application is dealing with some kind of data, a database is generally involved. There are many databases (both free and commercial) available in the market. Here are a few of the free and open source database servers available:
A live database is one that is actively updated with the latest version of data. At times you may use one database for reading and another for writing, and in such cases you will sync both the databases occasionally. We cannot call such a database 'live', because when reading from one database, while data is being written to the other database, you won't be reading the latest data.
On the other hand, whenever reading from and writing to the database takes place in real-time, we call it a live database.
Let's take an example to understand how search works in the case of a live database.
Assume that we have two database tables in our MySQL database:
users
addresses
The users table holds data such as your name, e-mail, and password. The addresses table holds the addresses belonging to users. Each user can have multiple addresses. So the users and the addresses table are related to each other.
Let's say we want to search for users based on their name and address. The entered search term can be either the name or part of the address. While performing a search directly on the database, our MySQL query would look something like:
SELECT u.id, u.name
FROM users
AS u LEFT JOIN addresses AS a ON u.id = a.user_id
WHERE u.name LIKE '%search_term%'
OR a.address LIKE '%search_term%' GROUP BY u.id;
The given query will directly search the specified database tables and get the results. The main advantage of using this approach is that we are always performing a search on the latest version of the available data. Hence, if a new user's data has been inserted just before you initiated the search, you will see that user's data in your search results if it matches your search query.
However, one major disadvantage of this approach is that an SQL query to perform such a search is fired every time a search request comes in, and this becomes an issue when the number of records in the users table increases. With each search query, two tables are joined. This adds overhead and further hinders the performance of the query.
In this approach, a query is not fired directly on a database table. Rather, an index is created from the data stored in the database. This index contains data from all the related tables. The index can itself be stored in a database or on a file system.
The advantage of using this approach is that we need not join tables in SQL queries each time a search request comes in, and the search request would not scan every row stored in the database. The search request is directed towards the index which is highly optimized for searching.
The disadvantage would be the additional storage required to store the index and the time required to build the index. However, these are traded off for the time saved during an actual search request.
No, we will not discuss The Great Sphinx of Giza here, we're talking about the other Sphinx, popular in the computing world. Sphinx stands for SQL Phrase Index.
Sphinx is a full-text search engine (generally standalone) which provides fast, relevant, efficient full-text search functionality to third-party applications. It was especially created to facilitate searches on SQL databases and integrates very well with scripting languages; such as PHP, Python, Perl, Ruby, and Java.
At the time of writing this book, the latest stable release of Sphinx was v0.9.9.
Some of the major features of Sphinx include (taken from http://sphinxsearch.com):
High indexing speed (up to 10 MB/sec on modern CPUs)
High search speed (average query is under 0.1 sec on 2 to 4 GB of text collection)
High scalability (up to 100 GB of text, up to 100 Million documents on a single CPU)
Supports distributed searching (since v.0.9.6)
Supports MySQL (MyISAM and InnoDB tables are both supported) and PostgreSQL natively
Supports phrase searching
Supports phrase proximity ranking, providing good relevance
Supports English and Russian stemming
Supports any number of document fields (weights can be changed on the fly)
Supports document groups
Supports stopwords, that is, that it indexes only what's most relevant from a given list of words
Supports different search modes ("match extended", "match all", "match phrase" and "match any" as of v.0.9.5)
Generic XML interface which greatly simplifies custom integration
Pure-PHP (that is, NO module compiling and so on) search client API
Back in 2001, there weren't many good solutions for searching in web applications. Andrew Aksyonoff, a Russian developer, was facing difficulties in finding a search engine with features such as good search quality (relevance), high searching speed, and low resource requirements - for example, disk usage and CPU.
He tried a few available solutions and even modified them to suit his needs, but in vain. Eventually he decided to come up with his own search engine, which he later named Sphinx.
After the first few releases of Sphinx, Andrew received good feedback from users. Over a period of time, he decided to continue developing Sphinx and founded Sphinx Technologies Inc.
Today Andrew is the primary developer for Sphinx, along with a few others who joined the wagon. At the time of writing, Sphinx was under heavy development, with regular releases.
Sphinx is a free and open source software which can be distributed or modified under the terms of the GNU General Public License (GPL) as published by the Free Software Foundation, either version 2 or any later version.
However, if you intend to use or embed Sphinx in a project but do not want to disclose the source code as required by GPL, you will need to obtain a commercial license by contacting Sphinx Technologies Inc. at http://sphinxsearch.com/contacts.html
Enough talking, let's get on to some real action. The first step is to install Sphinx itself.
Sphinx was developed and tested mostly on UNIX based systems. All modern UNIX based operating systems with an ANSI compliant compiler should be able to compile and run Sphinx without any issues. However, Sphinx has also been found running on the following operating systems without any issues.
Linux (Kernel 2.4.x and 2.6.x of various distributions)
Microsoft Windows 2000 and XP
FreeBSD 4.x, 5.x, 6.x
NetBSD 1.6, 3.0
Solaris 9, 11
Mac OS X
Note: The Windows version of Sphinx is not meant to be used on production servers. It should only be used for testing and debugging. This is the primary reason that all examples given in this book will be for Linux-based systems.
If you intend to install Sphinx on a UNIX based system, then you need to check the following:
C++ compiler (GNU GCC works fine)
A make program (GNU make works fine)
The XML libraries libexpat1 (name may be different on non Ubuntu distro) and libexpat1-dev (If you intend to use the xmlpipe2 data source)
1. Download the latest stable version of the sphinx source from http://sphinxsearch.com/downloads.html.
2. Extract it anywhere on your file system and go inside the extracted
sphinx
directory:$ tar -xzvf sphinx-0.9.9.tar.gz $ cd sphinx-0.9.9
3. Run the
configure
utility:$ ./configure --prefix=/usr/local/sphinx
4. Build from the source:
$ make
5. Install the application (run as root):
$ make install
We downloaded the latest release of Sphinx and extracted it using the tar
command. We then ran the configure
command which gets the details of our machine and also checks for all dependencies. If any of the dependency is missing, it will throw an error. We will take a look at possible dependency issues in a while.
Once we are done with configure
, the make
command will build (compile) the source code. After that, make install
will actually install the binaries to respective location as specified in --prefix
option to the configure
.
There are many options that can be passed to the configure
command but we will take a look at a few important ones:
--prefix=/path:
This option specifies the path to install the sphinx binaries. In this book it is assumed that sphinx was configured with--prefix=/usr/local/sphinx
so it is recommended that you configure your path with the same prefix.--with-mysql=/path:
Sphinx needs to know where to find MySQL's include and library files. It auto-detects this most of the time but if for any reason it fails, you can supply the path here.--with-pgsql=/path:
Same as-with-mysql
but for PostgreSQL.
Most of the common errors you would find while configuring sphinx are related to missing MySQL include files.

This can be caused either because Sphinx's auto detection for MySQL include path failed, or MySQL's devel package has not been installed on your machine. If MySQL's devel package is not installed, you can install it using the Software Package Manager (apt or yum) of your operating system. In case of Ubuntu, the package is called libmysqlclient16-dev
.
Note
If you intend to use Sphinx without MySQL then you can use the configure option --without-mysql
.
You need to follow pretty much the same steps if PostgreSQL include files are missing. In this book we will be primarily using MySQL for all examples.
Listed next are a few errors or issues that may arise during Sphinx's installation make can sometimes fail with the following error:
/bin/sh: g++: command not found
make[1]: *** [libsphinx_a-sphinx.o] Error 127
This may be because of a missing gcc-c++
package. Try installing it.
At times you might get compile-time errors like:
sphinx.cpp:67: error: invalid application of `sizeof' to
incomplete type `Private::SizeError<false>'
To fix the above error try editing sphinx.h
and replace off_t
with DWORD
in a typedef
for SphOffset_t
.
#define STDOUT_FILENO fileno(stdout)
#else
typedef DWORD SphOffset_t;
#endif
One drawback of doing this would be that you won't be able to use full-text indexes larger than 2 GB.
1. Download the Win32 binaries of Sphinx from http://www.sphinxsearch.com/downloads.html. Choose the binary depending on whether you want MySQL support, or PostgreSQL support, or both.
2. Extract the downloaded ZIP to any suitable location. Let's assume it is extracted to
C:\>sphinx
.3. Install the searched system as a Windows service by issuing the following command in the Command Prompt:
C:\sphinx\bin\searchd -install -config C:\sphinx\sphinx.conf -servicename SphinxSearch
This will install
searchd
as a service but it won't be started yet. Before starting the Sphinx service we need to create thesphinx.conf
file and create indexes. This will be done in the next few chapters.
Installing Sphinx on windows is a straight-forward task. We have pre-compiled binaries for the windows platform, which can be used directly.
After extracting the ZIP, we installed the Sphinx service. We need not install anything else since binaries for indexer
and search are readily available in the C:\sphinx\bin
directory.
The use of binaries to create indexes and the use of the searchd service to search will be covered in the next few chapters.
1. Download the latest stable version of the sphinx source from http://sphinxsearch.com/downloads.html.
$ tar -xzvf sphinx-0.9.9.tar.gz $ cd sphinx-0.9.9
2. Run the configure utility:
$ ./configure -prefix=/usr/local/sphinx
3. If you are on a 64 bit Mac then use the following command to configure:
LDFLAGS="-arch x86_64" ./configure --prefix=/usr/local/sphinx $ make $ sudo make install
4. Next, run the make command:
$ make
5. Finally, run the following command to complete your configuration:
$ sudo make install
We downloaded the Sphinx source and extracted it using the tar
command. We then configured Sphinx and built it using the make
command. The options to configure are the same as we used while installing Sphinx in Linux.
The only notable difference between installation on Linux and Mac is that if your Mac is 64 bit, your configure command is changed slightly as given above.
In this chapter:
We saw the different ways to perform search
We got to know about Sphinx and how it helps in performing searches
We took a look at some of Sphinx's features and its brief history
We learned how to install Sphinx on different operating systems
By now you should have installed Sphinx on your system and laid the foundation for Chapter 2, Getting Started, where we will get started with Sphinx and some basic usage.