Sphinx: Index Searching

Exclusive offer: get 50% off this eBook here
Sphinx Search Beginner's Guide

Sphinx Search Beginner's Guide — Save 50%

Implement full-text search with lightning speed and accuracy using Sphinx

$23.99    $12.00
by Abbas Ali | March 2011 | Open Source

Search is by far the most important feature of an application where data is stored and retrieved. If it hadn't been for search, Google wouldn't exist, so we can imagine the importance of search in the computing world. A reliable search engine like Sphinx Search can be the difference between running a successful and unsuccessful business.

In this article by Abbas Ali, author of Sphinx Search Beginner's Guide, we will learn how to use the Sphinx API to issue search queries from your PHP applications. We will examine different query syntaxes and learn about weighting, sorting, and grouping our search results.

 

Sphinx Search Beginner's Guide

Sphinx Search Beginner's Guide

Implement full-text search with lightning speed and accuracy using Sphinx

        Read more about this book      

(For more resources on Search Engine, see here.)

Client API implementations for Sphinx

Sphinx comes with a number of native searchd client API implementations. Some third-party open source implementations for Perl, Ruby, and C++ are also available.

All APIs provide the same set of methods and they implement the same network protocol. As a result, they more or less all work in a similar fashion, they all work in a similar fashion.

All examples in this article are for PHP implementation of the Sphinx API. However, you can just as easily use other programming languages. Sphinx is used with PHP more widely than any other language.

Search using client API

Let's see how we can use native PHP implementation of Sphinx API to search. We will add a configuration related to searchd and then create a PHP file to search the index using the Sphinx client API implementation for PHP.

Time for action – creating a basic search script

  1. Add the searchd config section to /usr/local/sphinx/etc/sphinx-blog.conf:

    source blog {
    # source options
    }

    index posts {
    # index options
    }

    indexer {
    # indexer options
    }

    # searchd options (used by search daemon)
    searchd
    {
    listen = 9312
    log = /usr/local/sphinx/var/log/searchd.log
    query_log = /usr/local/sphinx/var/log/query.log
    max_children = 30
    pid_file = /usr/local/sphinx/var/log/searchd.pid
    }

  2. Start the searchd daemon (as root user):
    $ sudo /usr/local/sphinx/bin/searchd -c /usr/local/sphinx/etc/
    sphinx-blog.conf

    Sphinx Search

  3. Copy the sphinxapi.php file (the class with PHP implementation of Sphinx API) from the sphinx source directory to your working directory:
    $ mkdir /path/to/your/webroot/sphinx
    $ cd /path/to/your/webroot/sphinx
    $ cp /path/to/sphinx-0.9.9/api/sphinxapi.php ./
  4. Create a simple_search.php script that uses the PHP client API class to search the Sphinx-blog index, and execute it in the browser:
    <?php
    require_once('sphinxapi.php');
    // Instantiate the sphinx client
    $client = new SphinxClient();
    // Set search options
    $client->SetServer('localhost', 9312);
    $client->SetConnectTimeout(1);
    $client->SetArrayResult(true);

    // Query the index
    $results = $client->Query('php');

    // Output the matched results in raw format
    print_r($results['matches']);
  5. The output of the given code, as seen in a browser, will be similar to what's shown in the following screenshot:

    Sphinx Search

What just happened?

Firstly, we added the searchd configuration section to our sphinx-blog.conf file. The following options were added to searchd section:

  • listen: This options specifies the IP address and port that searchd will listen on. It can also specify the Unix-domain socket path. This options was introduced in v0.9.9 and should be used instead of the port (deprecated) option. If the port part is omitted, then the default port used is 9312.

    Examples:

    • listen = localhost
    • listen = 9312
    • listen = localhost:9898
    • listen = 192.168.1.25:4000
    • listen = /var/run/sphinx.s
  • log: Name of the file where all searchd runtime events will be logged. This is an optional setting and the default value is "searchd.log".
  • query_log: Name of the file where all search queries will be logged. This is an optional setting and the default value is empty, that is, do not log queries.
  • max_children: The maximum number of concurrent searches to run in parallel. This is an optional setting and the default value is 0 (unlimited).
  • pid_file: Filename of the searchd process ID. This is a mandatory setting. The file is created on startup and it contains the head daemon process ID while the daemon is running. The pid_file becomes unlinked when the daemon is stopped.

Once we were done with adding searchd configuration options, we started the searchd daemon with root user. We passed the path of the configuration file as an argument to searchd. The default configuration file used is /usr/local/sphinx/etc/sphinx.conf.

After a successful startup, searchd listens on all network interfaces, including all the configured network cards on the server, at port 9312. If we want searchd to listen on a specific interface then we can specify the hostname or IP address in the value of the listen option:

listen = 192.168.1.25:9312

The listen setting defined in the configuration file can be overridden in the command line while starting searchd by using the -l command line argument.

There are other (optional) arguments that can be passed to searchd as seen in the following screenshot:

Sphinx Search

searchd needs to be running all the time when we are using the client API. The first thing you should always check is whether searchd is running or not, and start it if it is not running.

We then created a PHP script to search the sphinx-blog index. To search the Sphinx index, we need to use the Sphinx client API. As we are working with a PHP script, we copied the PHP client implementation class, (sphinxapi.php) which comes along with Sphinx source, to our working directory so that we can include it in our script. However, you can keep this file anywhere on the file system as long as you can include it in your PHP script.

Throughout this article we will be using /path/to/webroot/sphinx as the working directory and we will create all PHP scripts in that directory. We will refer to this directory simply as webroot.

We initialized the SphinxClient class and then used the following class methods to set upthe Sphinx client API:

  • SphinxClient::SetServer($host, $port)—This method sets the searchd hostname and port. All subsequent requests use these settings unless this method is called again with some different parameters. The default host is localhost and port is 9312.
  • SphinxClient::SetConnectTimeout($timeout)—This is the maximum time allowed to spend trying to connect to the server before giving up.
  • SphinxClient::SetArrayResult($arrayresult)—This is a PHP client APIspecific method. It specifies whether the matches should be returned as an array or a hash. The Default value is false, which means that matches will be returned in a PHP hash format, where document IDs will be the keys, and other information (attributes, weight) will be the values. If $arrayresult is true, then the matches will be returned in plain arrays with complete per-match information.

After that, the actual querying of index was pretty straightforward using the SphinxClient::Query($query) method. It returned an array with matched results, as well as other information such as error, fields in index, attributes in index, total records found, time taken for search, and so on. The actual results are in the $results['matches'] variable. We can run a loop on the results, and it is a straightforward job to get the actual document's content from the document ID and display it.

Matching modes

When a full-text search is performed on the Sphinx index, different matching modes can be used by Sphinx to find the results. The following matching modes are supported by Sphinx:

  • SPH_MATCH_ALL—This is the default mode and it matches all query words, that is, only records that match all of the queried words will be returned.
  • SPH_MATCH_ANY—This matches any of the query words.
  • SPH_MATCH_PHRASE—This matches query as a phrase and requires a perfect match.
  • SPH_MATCH_BOOLEAN—This matches query as a Boolean expression.
  • SPH_MATCH_EXTENDED—This matches query as an expression in Sphinx internal query language.
  • SPH_MATCH_EXTENDED2—This matches query using the second version of Extended matching mode. This supersedes SPH_MATCH_EXTENDED as of v0.9.9.
  • SPH_MATCH_FULLSCAN—In this mode the query terms are ignored and no text-matching is done, but filters and grouping are still applied.

Time for action – searching with different matching modes

  1. Create a PHP script display_results.php in your webroot with the following code:

    <?php
    // Database connection credentials
    $dsn ='mysql:dbname=myblog;host=localhost';
    $user = 'root';
    $pass = '';

    // Instantiate the PDO (PHP 5 specific) class
    try {
    $dbh = new PDO($dsn, $user, $pass);
    } catch (PDOException $e){
    echo'Connection failed: '.$e->getMessage();
    }
    // PDO statement to fetch the post data
    $query = "SELECT p.*, a.name FROM posts AS p " .
    "LEFT JOIN authors AS a ON p.author_id = a.id " .
    "WHERE p.id = :post_id";
    $post_stmt = $dbh->prepare($query);

    // PDO statement to fetch the post's categories
    $query = "SELECT c.name FROM posts_categories AS pc ".
    "LEFT JOIN categories AS c ON pc.category_id = c.id " .
    "WHERE pc.post_id = :post_id";
    $cat_stmt = $dbh->prepare($query);

    // Function to display the results in a nice format
    function display_results($results, $message = null)
    {
    global $post_stmt, $cat_stmt;
    if ($message) {
    print "<h3>$message</h3>";
    }
    if (!isset($results['matches'])) {
    print "No results found<hr />";
    return;
    }
    foreach ($results['matches'] as $result) {
    // Get the data for this document (post) from db
    $post_stmt->bindParam(':post_id',
    $result['id'],
    PDO::PARAM_INT);
    $post_stmt->execute();
    $post = $post_stmt->fetch(PDO::FETCH_ASSOC);

    // Get the categories of this post
    $cat_stmt->bindParam(':post_id',
    $result['id'],
    PDO::PARAM_INT);
    $cat_stmt->execute();
    $categories = $cat_stmt->fetchAll(PDO::FETCH_ASSOC);

    // Output title, author and categories
    print "Id: {$posmt['id']}<br />" .
    "Title: {$post['title']}<br />" .
    "Author: {$post['name']}";
    $cats = array();
    foreach ($categories as $category) {
    $cats[] = $category['name'];
    }
    if (count($cats)) {
    print "<br />Categories: " . implode(', ', $cats);
    }
    print "<hr />";
    }
    }

  2. Create a PHP script search_matching_modes.php in your webroot with the following code:
    <?php
    // Include the api class
    Require('sphinxapi.php');
    // Include the file which contains the function to display results
    require_once('display_results.php');

    $client = new SphinxClient();
    // Set search options
    $client->SetServer('localhost', 9312);
    $client->SetConnectTimeout(1);
    $client->SetArrayResult(true);

    // SPH_MATCH_ALL mode will be used by default
    // and we need not set it explicitly
    display_results(
    $client->Query('php'),
    '"php" with SPH_MATCH_ALL');

    display_results(
    $client->Query('programming'),
    '"programming" with SPH_MATCH_ALL');

    display_results(
    $client->Query('php programming'),
    '"php programming" with SPH_MATCH_ALL');

    // Set the mode to SPH_MATCH_ANY
    $client->SetMatchMode(SPH_MATCH_ANY);

    display_results(
    $client->Query('php programming'),
    '"php programming" with SPH_MATCH_ANY');

    // Set the mode to SPH_MATCH_PHRASE
    $client->SetMatchMode(SPH_MATCH_PHRASE);

    display_results(
    $client->Query('php programming'),
    '"php programming" with SPH_MATCH_PHRASE');

    display_results(
    $client->Query('scripting language'),
    '"scripting language" with SPH_MATCH_PHRASE');

    // Set the mode to SPH_MATCH_FULLSCAN
    $client->SetMatchMode(SPH_MATCH_FULLSCAN);

    display_results(
    $client->Query('php'),
    '"php programming" with SPH_MATCH_FULLSCAN');
  3. Execute search_matching_modes.php in a browser (http://localhost/sphinx/search_matching_modes.php).

 

Sphinx Search Beginner's Guide Implement full-text search with lightning speed and accuracy using Sphinx
Published: March 2011
eBook Price: $23.99
Book Price: $39.99
See more
Select your format and quantity:

 

        Read more about this book      

(For more resources on Search Engine, see here.)

What just happened?

The first thing we did was created a script, display_results.php, which connects to the database and gathers additional information on related posts. This script has a function, display_results() that outputs the Sphinx results returned in a nice format. The code is pretty much self explanatory.

Next, we created the PHP script that actually performs the search. We used the following matching modes and queried using different search terms:

  • SPH_MATCH_ALL (Default mode which doesn't need to be explicitly set)
  • SPH_MATCH_ANY
  • SPH_MATCH_PHRASE
  • SPH_MATCH_FULLSCAN

Let's see what the output of each query was and try to understand it:

display_results(
$client->Query('php'),
'"php" with SPH_MATCH_ALL');

display_results(
$client->Query('programming'),
'"programming" with SPH_MATCH_ALL');

The output for these two queries can be seen in the following screenshot:

Sphinx Search

The first two queries returned all posts containing the words "php" and "programming" respectively. We got posts with id 2 and 5 for "php", and 5 and 8 for "programming".

The third query was for posts containing both words, that is "php programming", and it returned the following result:

Sphinx Search

This time we only got posts with id 5, as this was the only post containing both the words of the phrase "php programming".

We used SPH_MATCH_ANY to search for any words of the search phrase:

// Set the mode to SPH_MATCH_ANY
$client->SetMatchMode(SPH_MATCH_ANY);

display_results(
$client->Query('php programming'),
'"php programming" with SPH_MATCH_ANY');

The function call returns the following output (results):

Sphinx Search

As expected, we got posts with ids 5,2, and 8. All these posts contain either "php" or "programming" or both.

Next, we tried our hand at SPH_MATCH_PHRASE, which returns only those records that match the search phrase exactly, that is, all words in the search phrase appear in the same order and consecutively in the index:

// Set the mode to SPH_MATCH_PHRASE
$client->SetMatchMode(SPH_MATCH_PHRASE);

display_results(
$client->Query('php programming'),
'"php programming" with SPH_MATCH_PHRASE');

display_results(
$client->Query('scripting language'),
'"scripting language" with SPH_MATCH_PHRASE');

The previous two function calls return the following results:

Sphinx Search

The query"php programming" didn't return any results because there were no posts that match that exact phrase. However, a post with id 2 matched the next query: "scripting language".

The last matching mode we used was SPH_MATCH_FULLSCAN. When this mode is used the search phrase is completely ignored, (in our case "php" was ignored) and Sphinx returns all records from the index:

// Set the mode to SPH_MATCH_FULLSCAN
$client->SetMatchMode(SPH_MATCH_FULLSCAN);

display_results(
$client->Query('php'),
'"php programming" with SPH_MATCH_FULLSCAN');

The function call returns the following result (for brevity only a part of the output is shown in the following image):

Sphinx Search

SPH_MATCH_FULLSCAN mode is automatically used if empty string is passed to the SphinxClient::Query() method.

SPH_MATCH_FULLSCAN matches all indexed documents, but the search query still applies all the filters when sorting and grouping. However, the search query will not perform any full-text searching. This is particularly useful in cases where we only want to apply filters and don't want to perform any full-text matching (For example, filtering all blog posts by categories).

Boolean query syntax

Boolean mode queries allow expressions to make use of a complex set of Boolean rules to refine their searches. These queries are very powerful when applied to full-text searching. When using Boolean query syntax, certain characters have special meaning, as given in the following list:

  • &: Explicit AND operator
  • |: OR operator
  • -: NOT operator
  • !: NOT operator (alternate)
  • (): Grouping

Let's try to understand each of these operators using an example.

Time for action – searching using Boolean query syntax

  1. Create a PHP script search_boolean_mode.php in your webroot with the following code:
    <?php
    // Include the api class
    require_once('sphinxapi.php');
    // Include the file which contains the function to display results
    require_once('display_results.php');

    $client = new SphinxClient();
    // Set search options
    $client->SetServer('localhost', 9312);
    $client->SetConnectTimeout(1);
    $client->SetArrayResult(true);

    display_results(
    $client->Query('php programming'),
    '"php programming" (default mode)');

    // Set the mode to SPH_MATCH_BOOLEAN
    $client->SetMatchMode(SPH_MATCH_BOOLEAN);

    // Search using AND operator
    display_results(
    $client->Query('php & programming'),
    '"php & programming"');

    // Search using OR operator
    display_results(
    $client->Query('php | programming'),
    '"php | programming"');

    // Search using NOT operator
    display_results(
    $client->Query('php -programming'),
    '"php -programming"');

    // Search by grouping terms
    display_results(
    $client->Query('(php & programming) | (leadership & success)'),
    '"(php & programming) | (leadership & success)"');

    // Demonstrate how OR precedence is higher than AND
    display_results(
    $client->Query('development framework | language'),
    '"development framework | language"');

    // This won't work
    display_results($client->Query('-php'), '"-php"');

    Execute the script in a browser (the output shown in next section).

What just happened?

We created a PHP script to see how different Boolean operators work. Let's understand the working of each of them.

The first search query, "php programming", did not use any operator. There is always an implicit AND operator, so "php programming" query actually means: "php & programming". In second search query we explicitly used the & (AND) operator. Thus the output of both the queries were exactly same, as shown in the following screenshot:

Sphinx Search

Our third search query used the OR operator. If either of the terms get matched whilst using OR, the document is returned. Thus "php | programming" will return all documents that match either "php" or "programming", as seen in the following screenshot:

Sphinx Search

The fourth search query used the NOT operator. In this case, the word that comes just after the NOT operator should not be present in the matched results. So "php –programming" will return all documents that match "php" but do not match "programming" We get results as seen in the following screenshot:

Sphinx Search

Next, we used the grouping operator. This operator is used to group other operators. We searched for "(php & programming) | (leadership & success)", and this returned all documents which matched either; "php" and "programming" or "leadership" and "success", as seen in the next screenshot:

Sphinx Search

After that, we fired a query to see how OR has a precedence higher than AND. The query "development framework | language" is treated by Sphinx as "(development) & (framework | language)". Hence we got documents matching "development & framework" and "development & language", as shown here:

Sphinx Search

Lastly, we saw how a query like "-php" does not return anything. Ideally it should have returned all documents which do not match "php", but for technical and performance reasons such a query is not evaluated. When this happens we get the following output:

Sphinx Search

Extended query syntax

Apart from the Boolean operators, there are some more specialized operators and modifiers that can be used when using the extended matching mode.

Let's understand this with an example.

Time for action – searching with extended query syntax

  1. Create a PHP script search_extended_mode.php in your webroot with following code:
    <?php
    // Include the api class
    Require_once('sphinxapi.php');
    // Include the file which contains the function to display results
    Require_once('display_results.php');

    $client = new SphinxClient();
    // Set search options
    $client->SetServer('localhost', 9312);
    $client->SetConnectTimeout(1);
    $client->SetArrayResult(true);

    // Set the mode to SPH_MATCH_EXTENDED2
    $client->SetMatchMode(SPH_MATCH_EXTENDED2);

    // Returns documents whose title matches "php" and
    // content matches "significant"
    display_results(
    $client->Query('@title php @content significant'),
    'field search operator');

    // Returns documents where "development" comes
    // before 8th position in content field
    display_results(
    $client->Query('@content[8] development'),
    'field position limit modifier');

    // Returns only those documents where both title and content
    // matches "php" and "namespaces"
    display_results(
    $client->Query('@(title,content) php namespaces'),
    'multi-field search operator');

    // Returns documents where any of the field
    // matches "games"
    display_results(
    $client->Query('@* games'),
    'all-field search operator');

    // Returns documents where "development framework"
    // phrase matches exactly
    display_results(
    $client->Query('"development framework"'),
    'phrase search operator');

    // Returns documents where there are three words
    // between "people" and "passion"
    display_results(
    $client->Query('"people passion"~3'),
    'proximity search operator');

    // Returns documents where any of the
    // two words from the phrase matches
    display_results(
    $client->Query('"people development passion framework"/2'),
    'quorum search operator');
  2. Execute the script in a browser (the output is explained in the next section).

What just happened?

For using extended query syntax, we set the match mode to SPH_MATCH_EXTENDED2:

$client->SetMatchMode(SPH_MATCH_EXTENDED2);

The first operator we used was field search operator. Using this operator we can tell Sphinx which fields to search against (instead of searching against all fields). In our example we searched for all documents whose title matches "php" and whose content matches "significant". As an output, we got posts (documents) with the id 5, which was the only document that satisfied this matching condition as shown below:

@title php @content significant

The search for that term returns the following result:

Sphinx Search

Following this we used field position limit modifier. The modifier instructs Sphinx to select only those documents where "development" comes before the 8th position in the content field, that is, it limits the search to the first eight positions within given field.

@content[8] development

And we get the following result:

Sphinx Search

Next, we used the multiple field search operator. With this operator you can specify which fields (combined) should match the queried terms. In our example, documents are only matched when both title and content matches "php" and "namespaces".

@(title,content) php namespaces

This gives the following result:

Sphinx Search

The all-field search operator was used next. In this case the query is matched against all fields.

@* games

This search term gives the following result:

Sphinx Search

The phrase search operator works exactly same as when we set the matching mode to SPH_MATCH_PHRASE. This operator implicitly does the same. So, a search for the phrase "development framework" returns posts with id 7, since the exact phrase appears in its content.

"development framework"

The search term returns the following result:

Sphinx Search

Next we used the proximity search operator. The proximity distance is specified in words, adjusted for word count, and applies to all words within quotes. So, "people passion"~3 means there must be a span of less than five words that contain both the words "people" and "passion". We get the following result:

Sphinx Search

The last operator we used is called as a quorum operator. In this, Sphinx returns only those documents that match the given threshold of words. "people development passion framework"/2 matches those documents where at least two words match out of the four words in the query. Our query returns the following result:

Sphinx Search

Using what we have learnt above, you can create complex search queries by combining any of the previously listed search operators. For example:

@title programming "photo gallery" -(asp|jsp) @* opensource

The query means that:

  • The document's title field should match 'programming'
  • The same document must also contain the words 'photo' and 'gallery' adjacently in any of the fields
  • The same document must not contain the words 'asp' or 'jsp'
  • The same document must contain the word 'opensource' in any of its fields

There are few more operators in extended query syntax and you can see their examples at http://sphinxsearch.com/docs/manual-0.9.9.html#extended-syntax.

Summary

In this article, we saw how to use the Sphinx API to search from within your application. With the index, in this article:

  • We wrote different search queries
  • We saw how PHP's implementation of the Sphinx client API can be used in PHP applications to issue some powerful search queries

Further resources on this subject:


Sphinx Search Beginner's Guide Implement full-text search with lightning speed and accuracy using Sphinx
Published: March 2011
eBook Price: $23.99
Book Price: $39.99
See more
Select your format and quantity:

About the Author :


Abbas Ali

Abbas Ali has over 6 years of experience in PHP Development and is a Zend Certified PHP 5 Engineer. A Mechanical Engineer by education, Abbas turned to software development just after finishing his engineering degree. He is a member of the core development team for the Coppermine Photo Gallery, an open source project which is one of the most popular photo gallery applications in the world.

Fascinated with both machines and knowledge, Abbas is always learning new programming techniques. He got acquainted with Sphinx in 2009 and has been using it in most of his commercial projects ever since. He loves open source and believes in contributing back to the community.

Abbas is married to Tasneem and has a cute little daughter, Munira. He lives at Nagpur (India) since his birth and is in no rush to move to any other city in the world. In his free time he loves to watch movies and television. He is also an amateur photographer and cricketer.

Abbas is currently working as Chief Operating Officer and Technical Manager at SANIsoft Technologies Private Limited, Nagpur, India. The company specializes in development of large, high performance and scalable PHP applications.

Books From Packt


Drupal E-commerce with Ubercart 2.x
Drupal E-commerce with Ubercart 2.x

TYPO3 4.2 E-Commerce
TYPO3 4.2 E-Commerce

PHP 5 E-commerce Development
PHP 5 E-commerce Development

Django 1.2 E-commerce
Django 1.2 E-commerce

PostgreSQL 9.0 High Performance
PostgreSQL 9.0 High Performance

WordPress 3.0 Search Engine Optimization
WordPress 3.0 Search Engine Optimization

OpenCart 1.4 Beginner's Guide
OpenCart 1.4 Beginner's Guide

Solr 1.4 Enterprise Search Server
Solr 1.4 Enterprise Search Server


No votes yet

Post new comment

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
D
G
t
p
a
v
Enter the code without spaces and pay attention to upper/lower case.
Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software