You're reading from Mastering Apache Solr 7.x
Nowadays, the search engine plays an important role in any search application. End users always expect accurate, efficient, and fast results from searches. The job of a search engine is to fulfill the search requirement in an easy and faster way. To achieve the expected level of search accuracy, Solr executes multiple processes sequentially behind the scenes: it examines the input string, normalizes the text, generates the token stream, builds indexes, and so on. The set of all of these processes is called text analysis. Let's explore text analysis in detail.
Text analysis is a Solr mechanism that takes place in two phases:
- During index time, optimize the input terms, feeding the information, generates the token stream and builds the indexes
- During query time, optimize the query terms, generates the token stream, matches with the term generated at index time, and provides results
Let’s dive deeper and understand:
- How exactly Solr works to build...
We have seen an overview of text analysis. Now let's dive deeper and understand the core processes running behind the scenes of analysis. As we have seen previously, the analyzer, tokenizer and filter are the three main components Solr uses for text analysis. Let's explore an analyzer.
An analyzer examines the text of fields and generates a token stream. Normally, only fields of type solr.TextField
will specify an analyzer. An analyzer is defined as a child element of the <fieldType>
element in the managed-schema.xml
file. Here is a simple analyzer configuration:
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100"> <analyzer class="org.apache.lucene.analysis.core.WhitespaceAnalyzer"/> </fieldType>
Here, we have defined a single <analyzer>
element. This is the simplest way to define an analyzer. We've already understood the positionIncrementGap
attribute, which adds a space between multi-value...
We have previously seen that an analyzer may be a single class or a set of defined tokenizer and filter classes.
The analyzer executes the analysis process in two steps:
- Tokenization (parsing): Using configured tokenizer classes
- Filtering (transformation): Using configured filter classes
We can also do preprocessing on a character stream before tokenization; we can do this with the help of CharFilters
(we will see this later in the chapter). An analyzer knows its configured field, but a tokenizer doesn't have any idea about the field. The job of the tokenizer is only to read from a character stream, apply a tokenization mechanism based on its behavior, and produce a new sequence of a token stream.
We have seen that the analyzer uses a series of tokenizer and filter classes together to transform the input string into a token string, which will be used by Solr in indexing. The job of the filter is different from the tokenizer. The tokenizer mostly splits the input string at some delimiters and generates a token stream. The filter transforms this stream into some other form and generates a new token stream. The input for a filter will be a token stream, not an input string, unlike what we were passing at the time of tokenization. The entire token stream generated through tokenization will be passed to the first filter class in the list. Let's cover filters in detail.
So far, we have concentrated on Solr text analysis (analyzers, tokenizers, and filters) irrespective of any language. Solr support multiple language search and this feature puts Solr at the top of the list of search engines. Let's understand how Solr works for multiple language search.
So far all the examples we have covered are in English. The tokenization and filtering rules for English are very simple and straightforward, such as splitting at white spaces or any other delimiters, stemming, and so on. But once we start focusing on other languages, these rules may differ. Solr is already prepared to meet multiple analysis search requirements such as stemmers, synonyms filters, stop word filters, character query correction capabilities normalization, language identifiers, and so on. Some languages require their own tokenizers for complexity of parsing the language, some require their own stemming filters, and some require multiple filters as per the language...
Phonetic matching algorithms are used to match different spellings that are pronounced similarly by encoding them. Some examples are Sandeep
and Sandip
; Taylor
, Tailer
, and Tailor
; and so on. Solr provides several filters for phonetic matching.
Beider-Morse Phonetic Matching (BMPM) helps you search for personal names or surnames. It is a very intelligent algorithm compared to soundex, metaphone, caverphone, and so on. Its purpose is to match names that are phonetically equivalent to the expected name. BMPM does not split spellings and does not generate false hits. It extracts names that are phonetically equivalent.
It executes these steps to extract names that are phonetically equivalent:
- Determines the language from the spelling of the name
- Applies phonetic rules to identify the language and translates the name into phonetic alphabets
- In the case of a language not identified from the name, it applies generic phonetics...
In this chapter, we saw an overview of text analysis, analyzers, tokenizers, filters, and how to configure an analyzer along with tokenizers and filters. We also saw the implementation approach for putting tokenizers and filters together. Then we moved on to multiple search. Here we explored how Solr determines a language, two approaches to creating separate fields and separate indexes per language for multiple-language search, and the pros and cons of each approach. Finally, we understood Solr phonetic matching mechanics using the BMPM algorithm.
In the next chapter, we will see how to do indexing using client API, upload data using index handlers, upload data using Apache Tika with Solr Cell, and detect languages while indexing.