Your message has been sent.
This article has been saved to your account.
Go to my account
This article has been emailed to your Kindle.
Send this article
The scenario addressed in this article is a typical one. You have got a database-backed web application that you are building, and it needs to have a text-search capability. It is an increasingly common requirement. There are a variety of options and variations at your disposal to implement text search. This article is mostly about two of them, using the text search features in your database versus using Apache Solr – an open-source search platform. Solr is built on the established Apache Lucene. In this article by David Smiley, we are going to talk about database based text search in general and not a particular one because there are many databases and they vary widely. Likewise, this article should also be useful when considering one of Solr's competitors.
Text Search Features
If you think that text search is just a basic thing and nothing more than returning results that matched words in a user query, then think again! There are many technical details that a good search implementation will give you control over to affect how well this fundamental capability works, like text analysis and relevancy ranking. But there are also a variety of ancillary features to look for that make a big difference such as result highlighting and faceting.
- Text analysis: This is the processing of the original text into indexed terms, and there's a lot to it. Being able to configure the tokenization of words could mean that searching for “Mac” will be found if the word “MacBook” is in the text. And then there's synonym processing so that users can search for similar words. You might want both a common language dictionary and also hand-picked ones for your data. There's the ability to smartly handle desired languages instead of the pervasive English. And then there's stemming which normalizes word variations so that for example “working” and “work” can be indexed the same. Yet another variation of text analysis is phonetic indexing to find words that sound-like the search query.
- Relevancy ranking: This is the logic behind ranking the search results that closest match the query. In Lucene/Solr, there are a variety of factors in an overall equation with the opportunity to adjust factors based on matching certain fields, certain documents, or using field values in a configurable equation. By comparison, the commercial Endeca platform allows configuration of a variety of matching rules that behaves more like a strict sort.
- Query features & syntax: From boolean logic to grouping to phrases to fuzzy-searches, to score boosting... there are a variety of queries that you might perform and combine. Many apps would prefer to hide this from users but some may wish to expose it for “advanced” searches.
- Result highlighting: Displaying a text snippet of a matched document containing the matched word in context is clearly very useful. We have all seen this in Google.
- Query spell correction (i.e. “did you mean”): Instead of unhelpfully returning no results (or very few), the search implementation should be able to try and suggest variation(s) of the search that will yield more results. This feature is customarily based on the actual indexed words and not a language dictionary. The variations might be based on the so-called edit-distance which is basically the number of alterations needed, or it might be based on phonetic matching.
- Faceted navigation: This is a must-have feature which enables search results to include aggregated values for designated fields that users can subsequently choose to filter the results on. It is commonly used on e-commerce sites to the left of the results to navigate products by various attributes like price ranges, vendors, etc.
- Term-suggest (AKA search auto-complete): As seen on Google, as you start typing a word in the search box, it suggests possible completions of the word. These are relevancy sorted and filtered to those that are also found with any words prior to the word you are typing.
- Sub-string indexing: In some cases, it is needed to match arbitrary sub-strings of words instead of being limited to complete words. Unlike what happens with an SQL like clause, the data is indexed in such a way for this to be quick.
- Geo-location search: Given the coordinates to a location on the globe with records containing such coordinates, you should be able to search for matching records from a user-specified coordinate. An extension to Solr allows a radial based search with appropriate ranking, but it is also straight-forward to box the search based on a latitude & longitude.
- Field/facet suggestions: The Endeca search platform can determine that your search query matches some field values used for faceting and then offer a convenient filter for them. For example, given a data set of employees, the search box could have a pop-up suggestion that the word in the search box matches a department code and then offer the choice of navigating to those matching records. This can be easier and faster than choosing facets to filter on, especially if there are a great number of facet-able fields. Solr doesn't have this feature but it would not be a stretch to implement it based on its existing foundation.
- Clustering: This is another aid to navigating search results besides faceting. Search result clustering will dynamically divide the results into multiple groups called clusters, based on statistical correlation of terms in common. It is a bit of an exotic feature, but is useful with lots of results with lots of text information and after any faceted navigation is done if applicable.
So that's quite a list and there are other features you may find too. This should give you a list of features to look for in whatever you choose. Some features are obviously more important to you than others.
How NOT to implement text search: the SQL “like” clause
Perhaps “back in the day” you implemented search by simply adding like “%searchword%” in the where clause of your SQL (the author is guilty as charged!) But of course, this has serious problems such as:
- It is very slow, especially given a data set of decent size and any amount of load on the server. A database does a simple brute-force search.
- There is no concept of relevancy (e.g. a match score). A record simply matched or not. You are forced to sort on one of your columns.
- It is too literal. Even if the search is case insensitive, any extra punctuation can screw it up, or it may match parts of words when you only wanted to match whole words.
So the bottom line is don't do it! There are smarter approaches to this problem. Probably, the only situation you would do this is if you had a particular database column holding a limited number of short values and you have it indexed. Searches should go quickly and it's very easy to implement this approach.
eBook Price: $26.99
Book Price: $44.99
Databases with Built In Text Indexing Features
In relatively recent database history, it is becoming common for databases to have built-in text indexing features. It is tempting to use this feature of your database since, after all, you are already using your database so why not use it for text search too? I'll get to that in a bit. This is not a one-sided article; there are reasons why you might prefer to stick with your database's text search solution. Here are the notable reasons:
Since you are already using your database, choosing to use its text indexing features represents no changes, architecturally speaking. There is no additional system that needs to be installed and maintained. There is, however, some non-trivial education necessary on your part to understand and use the text indexing features but that is also true of Solr; perhaps, more so with Solr.
Probably the biggest obstacle to using Solr for the text index is that you need to synchronize data between your database and Solr, assuming it isn't going to be in Solr alone – an option discussed later. If it is satisfactory to fully create the Solr index on occasion, perhaps nightly, then this should be relatively easy. Getting a little more complex is augmenting that scheme to synchronize just changes (create, update, delete) on a more frequent basis. Perhaps, the most difficult is attempting to update the index automatically when changes occur in the database. Solr does not yet support “near real time search” which will allow changes to be search-able almost immediately after a change is submitted to Solr. However, depending on data size and performance targets, changes can often be search-able less than a minute after if you need it this fast.
Solr includes a module it calls the “Data Import Handler”, that is mostly for pulling in data from databases. It can even handle updates if the records contain a date. But getting data to Solr “just-in-time” will in all but one circumstance require some work on your part to coordinate. The exception here is using the “acts_as_solr” plugin available for the Ruby-on-Rails framework which synchronizes data model changes automatically. The closest thing in the Java landscape is Hibernate-Search which is an extension to Hibernate that uses Lucene. Since it is not based on Solr, some notable features like faceting are not available.
No relational schema mapping
Solr has a flat schema and does not have any relational search capabilities. It does at least have multi-valued field support which in most database schemas is done with additional tables. Your relational database schema is of course relational, and you will have to devise a mapping to put data into Solr. Often this is a straight-forward mapping, perhaps requiring some de-normalization (i.e. in-lining related data causing some duplication). However there can be cases that are particularly difficult or impossible to map when there is a one or many-many relationship with multiple search able fields. If you suspect challenges with mapping your relational database schema to Solr then be sure to pose your scenario to email@example.com for advice.
Why use Solr instead of database-search
On the surface of things, it might seem that these benefits would relegate Lucene & Solr to niche uses but that is not the case. Here's why:
You probably already know that the SQL standard largely diverges with different dialects across database vendors. Text search features are especially different from vendor to vendor because it is not governed by SQL or any other standard. Even if SQL standardized the query syntax, it wouldn't be enough because it wouldn't govern tokenization, stop words, and various other configuration aspects to search. And consider that databases are becoming increasingly commoditized and standardized, such that more and more applications support many databases. Object Relational Mapping (ORM) frameworks like Hibernate and ActiveRecord help make this possible. It's an easier proposition to add on Lucene or Solr to such database-agnostic systems than to propose that the system contort itself to each vendor's text-index features, perhaps in a least-common denominator way.
This is the main reason to avoid database search. To a database, text-search simply isn't as important compared to all the other things a database does. But this is what Solr is. You can count on your database for the basics: fast term search with some basic query types and relevancy ranked results, but you probably have very little control over text-analysis or the extensive advanced query features already mentioned. Furthermore, Solr has a variety of extensibility points to customize it to your needs and also, it is open-source if you'd prefer to tweak existing functionality if it's close to what you want. This happens frequently in the Solr community, it is not just a theoretical point. Conversely, modifying one's open-source database to suit a project's needs is unheard of (well, rare and certainly requiring more effort than modifying Solr). And if you're a Java programmer and your database is not Java based, Solr is all the more approachable to modification.
Faceting is a head-liner feature in Solr because it is a powerful method of navigating search results and because it's not a widely available capability, especially in open-source. It may not seem hard at first but it turns out that it's very difficult to derive the right SQL queries to generate facets from a database. And if you're an SQL pro and figured it out, it will be an anticlimactic accomplishment when in all likelihood you discover that your queries run slowly--especially for sizable data sets. Comparatively, Solr includes highly optimized code to both optimally filter on chosen facets and to generate facets with their counts. By the way, this is a capability in Solr, not Lucene, thus making this a reason to use Solr instead of other Lucene-based solutions (e.g. Hibernate Search, Compass, ...).
If your scenario is such that you are getting vast amounts of text from different sources (not databases under your direct control) and you want to do text-search; a database would simply be inappropriate. With Solr, you can choose to only index the data without storing it, thus saving perhaps 75% of data storage requirements that a database would require. Solr is also optimized for fast, very fast, queries (indexing is no slouch either) and it's fairly straight-forward to scale out to more servers, whereas your database is built for on-line transaction processing (OLTP) usage or data warehousing.
Should you use Solr instead of a Database?
You might consider this to be a radical idea depending on your application but consider the idea of using Solr instead of your database. For most situations, this is not a good idea but it can be perfectly fine for some. The problem here is that Solr is not a general purpose data storage engine, ready to be queried in any way you might see fit. Relational databases are well suited to the vast majority of data storage needs and can be queried in arbitrary ways thanks to SQL. Of course there are limits to this generalization but it is basically true. Solr is comparatively purpose-built for search. However, if your needs are satisfied within the confines of Solr then some of the arguments made for a database-only implementation hold true for using Solr alone too. Namely, you have a simpler architecture and you needn't worry about synchronizing data. The most ideal candidates for this are those where the data comes from other systems that can simply be re-gathered. Of course this option doesn't have to be an either-or decision. Some of your data might reside in your database-alone, while some could reside in Solr-alone. Consider asking the Solr community at firstname.lastname@example.org in help making this decision.
Should You Use a Database (with Text Indexing) or Solr?
By now, you should have some ideas as to what both options offer. If the database based index is of interest to you, it would be prudent for you to investigate the extent of your database's features, beyond the basics. I had to make some broad generalizations here. The lists of features offered by Solr in this article should give you a sense of what sort of things to look for in your database's text search capability. If you are leaning towards just using your database, then I would like you to ask yourself: is search a very important part of your application? Maybe it isn't now but do you think it will be? To get great search, your database's text search just won't do; it just doesn't have Solr's extensive feature-set nor under-the-hood access to tweak things from analysis to scoring—which you will do if search is indeed important to you. Of course most applications have search, and if yours is or will be one of them, I don't mean to ask whether having it in the first place is an option. I am making a point about priorities and focus. If it's not a very important feature to your application and the cost-benefit doesn't seem to be worth it, then you've made up your mind. There is no one-size-fits-all solution to text indexing.
This article is a must read for all those who are debating about using the text search features in your database versus using Apache Solr – an open-source search platform to implement the text search for the project. This article also talks about the various important factors to ponder before making the final call.
If you have read this article you may be interested to view :
- Faceting in Solr 1.4 Enterprise Search Server
- Indexing Data in Solr 1.4 Enterprise Search Server: Part1
- Indexing Data in Solr 1.4 Enterprise Search Server: Part2
eBook Price: $26.99
Book Price: $44.99
About the Author :
Born to code, David Smiley is a senior software engineer, book author, conference speaker, and instructor. He has 12 years of experience in the defense industry at MITRE, specializing in Java and Web technologies. David is the principal author of "Solr 1.4 Enterprise Search Server", the first book on Solr, published by PACKT in 2009. He also developed and taught a two-day course on Solr for MITRE. David plays a lead technical role in a large-scale Solr project in which he has implemented geospatial search based on geohash prefixes, wildcard ngram query parsing, searching multiple multi-valued fields at coordinated positions, part-of-speech search using Lucene payloads, and other things. David consults as a Solr expert on numerous projects for MITRE and its government sponsors. He has contributed code to Lucene and Solr and is active in the open-source community. Prior to his Solr work, David first used Lucene back in 2000, as well as Hibernate-Search and Compass since then. He also used the competing Endeca commercial product, too, but hopes to never use it again.