Reader small image

You're reading from  ElasticSearch Cookbook

Product typeBook
Published inDec 2013
Reading LevelBeginner
PublisherPackt
ISBN-139781782166627
Edition1st Edition
Languages
Right arrow
Author (1)
Alberto Paro
Alberto Paro
author image
Alberto Paro

Alberto Paro is an engineer, manager, and software developer. He currently works as technology architecture delivery associate director of the Accenture Cloud First data and AI team in Italy. He loves to study emerging solutions and applications, mainly related to cloud and big data processing, NoSQL, Natural language processing (NLP), software development, and machine learning. In 2000, he graduated in computer science engineering from Politecnico di Milano. Then, he worked with many companies, mainly using Scala/Java and Python on knowledge management solutions and advanced data mining products, using state-of-the-art big data software. A lot of his time is spent teaching how to effectively use big data solutions, NoSQL data stores, and related technologies.
Read more about Alberto Paro

Right arrow

Chapter 5. Search, Queries, and Filters

In this chapter, we will cover the following topics:

  • Executing a search

  • Sorting a search

  • Highlighting results

  • Executing a scan query

  • Suggesting a correct query

  • Counting

  • Deleting by query

  • Matching all the documents

  • Querying/filtering for term

  • Querying/filtering for terms

  • Using a prefix query/filter

  • Using a Boolean query/filter

  • Using a range query/filter

  • Using span queries

  • Using the match query

  • Using the IDS query/filter

  • Using the has_child query

  • Using the top_children query

  • Using a regexp query/filter

  • Using exists and missing filters

  • Using and/or/not filters

  • Using the geo_bounding_box filter

  • Using the geo_polygon filter

  • Using the geo_distance filter

Introduction


After having the mappings set and the data inserted in the indices, now we can enjoy the search.

In this chapter we will cover the different types of search queries and filters, validate queries, return highlights and limiting fields. This chapter is the core part of the book; and in this chapter the user will understand the difference between query and filter and how to improve quality and speed in search. ElasticSearch allows usage of a rich DSL that covers all common needs: from standard term query to complex GeoShape filtering.

This chapter is divided in to two parts: the first part shows some API calls related search, the second part goes in deep with the query DSL.

To prepare a good base for searching, in online code there are scripts to prepare indices and data for the next recipes.

Highlighting results


ElasticSearch performs a good job on finding results also in large text documents. Thus, for searching text in very large blocks it's very useful, but to improve the user experience it is sometimes required to show the abstract part: a small portion of the text that has matched the query. The highlight functionality in ElasticSearch is designed to do this job.

Getting ready

You need a working ElasticSearch cluster and an index populated with the script available in online code.

How to do it...

For searching and highlighting the results, we need to perform the following steps:

  1. From command line, we can execute a search with a highlight section as follows:

    curl -XGET 'http://127.0.0.1:9200/test-index/_search?from=0&size=10' -d '
    {
    "query": {"query_string": {"query": "joe"}}, 
    "highlight": {
    "pre_tags": ["<b>"], 
    "fields": {
    "parsedtext": {"order": "score"}, 
    "name": {"order": "score"}}, 
       "post_tags": ["</b>"]}}'
  2. If everything is all right, the command will...

Executing a scan query


Every time a query is executed, the results are calculated and returned to the user. In ElasticSearch there isn't standard order for records, pagination on a big block of values can bring inconsistencies between results due to added and deleted documents. The scan query tries to resolve these kinds of problems by giving a special cursor that allows to uniquely iterate all the documents. It's often used to back up documents or reindex them.

Getting ready

You need a working ElasticSearch cluster and an index populated with the script available in online code.

How to do it...

For executing a scan query, we need to perform the following steps:

  1. From command line, we can execute a search of type scan as follows:

    curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search?search_type=scan&scroll=10m&size=50' -d '{"query":{"match_all":{}}}'
    
  2. If everything is all right, the command will return the following result:

    {
      "_scroll_id" : "c2Nhbjs1OzQ1Mzp4d1Ftcng0NlNCYUpVOXh4c0ZiYll3OzQ1Njp4d1Ftcng0NlNCYUpVOXh4c0ZiYll3OzQ1Nzp4d1Ftcng0NlNCYUpVOXh4c0ZiYll3OzQ1NDp4d1Ftcng0NlNCYUpVOXh4c0ZiYll3OzQ1NTp4d1Ftcng0NlNCYUpVOXh4c0ZiYll3OzE7dG90YWxfaGl0czozOw...

Suggesting a correct query


It's very common for users to commit a typewriting error or to need help to complete the words. This scenario is solved by ElasticSearch with the suggest functionality.

Getting ready

You need a working ElasticSearch cluster and an index populated with the script available in online code.

How to do it...

For suggesting a correct term by query, we need to perform the following steps:

  1. From command line, we can execute the following suggest call:

    curl -XGET 'http://127.0.0.1:9200/test-index/_suggest' -d ' {
      "suggest1" : {
        "text" : "we find tester",
        "term" : {
          "field" : "parsedtext"
        }
      }
    }'
    
  2. The result returned by ElasticSearch, if everything is all right, should be as follows:

    {
        "_shards": {
            "failed": 0,
            "successful": 5,
            "total": 5
        },
        "suggest1": [
            {
                "length": 2,
                "offset": 0,
                "options": [],
                "text": "we"
            },
            {
                "length": 4,
               ...

Counting


It is often required to return only the count of the matched results and not the results themselves.

There are a lot of scenarios involving counting, some of them are as follows:

  • To return a number (for example, how many posts for a blog, how many comments for a post)

  • Validating if some items are available: are there posts? are there comments?

Getting ready

You need a working ElasticSearch cluster and an index populated with the script available in online code.

How to do it...

For executing a counting query, we need to perform the following steps:

  1. From command line, we will execute the following count query:

    curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_count' -d '{"match_all":{}}'
    
  2. The result returned by ElasticSearch, if everything is all right, should be as follows:

    {
      "count" : 3,
      "_shards" : {
        "total" : 5,
        "successful" : 5,
        "failed" : 0
      }
    }

The result is composed by the count result (a long type) and the shards status at the time of the query.

How it works.....

Deleting by query


In the previous chapter we saw how to delete a document. Deleting a document is very fast but we need to know the document ID.

ElasticSearch provides a call to delete all the documents that match a query.

Getting ready

You need a working ElasticSearch cluster and an index populated with the script available in online code.

How to do it...

For executing a delete by query, we need to perform the following steps:

  1. From command line, we need to execute the following query:

    curl -XDELETE 'http://127.0.0.1:9200/test-index/test-type/_query' -d '{"match_all":{}}'
    
  2. The result returned by ElasticSearch, if everything is all right, should be as follows:

    {
      "ok" : true,
      "_indices" : {
        "test-index" : {
          "_shards" : {
            "total" : 5,
            "successful" : 5,
            "failed" : 0
          }
        }
      }
    }
  3. The result is composed by the ok result (a Boolean type) and the shards status at the time of the delete by query.

How it works...

The query is interpreted as it is done for searching...

Matching all the documents


One of the most used queries, usually in conjunction with a filter, is Match All Query. This kind of query allows to returns all the documents.

Getting ready

You need a working ElasticSearch cluster and an index populated with the script available in online code.

How to do it...

For executing the match_all query, we need to perform the following steps:

  1. From command line, we execute the following query:

    curl -XPOST 'http://127.0.0.1:9200/test-index/test-type/_search' -d '{"query":{"match_all":{}}}'
    
  2. The result returned by ElasticSearch, if everything is all right, should be as follows:

    {
      "took" : 52,
      "timed_out" : false,
      "_shards" : {
        "total" : 5,
        "successful" : 5,
        "failed" : 0
      },
      "hits" : {
        "total" : 3,
        "max_score" : 1.0,
        "hits" : [ {
          "_index" : "test-index",
          "_type" : "test-type",
          "_id" : "1",
          "_score" : 1.0, "_source" : {"position": 1, "parsedtext": "Joe Testere nice guy", "name": "Joe Tester", "uuid": "11111...

Querying/filtering for term


Searching or filtering for a particular term is very frequent. Term query and filter work with exact values and are generally very fast.

Getting ready

You need a working ElasticSearch cluster and an index populated with the script available in online code.

How to do it...

For executing a term query/filter, we need to perform the following steps:

  1. We execute a term query, from command line as follows:

    curl -XPOST 'http://127.0.0.1:9200/test-index/test-type/_search' -d '{
        "query": {
            "term": {
                "uuid": "33333"
            }
        }
    }'
  2. The result returned by ElasticSearch, if everything is all right, should be as follows:

    {
      "took" : 58,
      "timed_out" : false,
      "_shards" : {
        "total" : 5,
        "successful" : 5,
        "failed" : 0
      },
      "hits" : {
        "total" : 1,
        "max_score" : 0.30685282,
        "hits" : [ {
          "_index" : "test-index",
          "_type" : "test-type",
          "_id" : "3",
          "_score" : 0.30685282, "_source" : {"position": 3, "parsedtext...

Querying/filtering for terms


The previous type of search works very well for single term search. If you want to achieve a multiterm search, you can process in two ways: by using an and/or filter or using the multiterm query.

Getting ready

You need a working ElasticSearch cluster and an index populated with the script available in online code.

How to do it...

For executing a terms query/filter, we need to perform the following steps:

  1. We execute a terms query, from command line as follows:

    curl -XPOST 'http://127.0.0.1:9200/test-index/test-type/_search' -d '{
        "query": {
            "terms": {
                "uuid": ["33333", "32222"]
            }
        }
    }'

    The result returned by ElasticSearch, is the same as the previous recipe.

  2. If you want use the terms query in a filter. The query should be as follows:

    curl -XPOST 'http://127.0.0.1:9200/test-index/test-type/_search' -d '{
        "query": {
            "filtered": {
                "filter": {
                    "terms": {
                        "uuid": ["33333", "32222...

Using a prefix query/filter


The prefix query/filter is used only when the starting part of a term is known. It allows completing truncated or partial terms.

Getting ready

You need a working ElasticSearch cluster and an index populated with the script available in online code.

How to do it...

For executing a prefix query/filter, we need to perform the following steps:

  1. We execute a prefix query, from command line as follows:

    curl -XPOST 'http://127.0.0.1:9200/test-index/test-type/_search' -d '{
        "query": {
            "prefix": {
                "uuid": "333"
            }
        }
    }'
  2. The result, returned by ElasticSearch, is the same as the previous recipe.

  3. If you want use the terms query in a filter, the query should be as follows:

    curl -XPOST 'http://127.0.0.1:9200/test-index/test-type/_search' -d '{
        "query": {
            "filtered": {
                "filter": {
                    "prefix": {
                        "uuid": "333"
                    }
                },
                "query": {
                    "match_all": {...

Using a Boolean query/filter


Every person using a search engine must have sometimes used the syntax with minus (-) and plus (+) to include or exclude some query terms. The Boolean query/filter allows programmatically defining some queries to include or exclude or optionally include (should) in the query.

This kind of query/filter is one of the most important ones, because it allows to aggregate a lot of simple queries/filters that we will see in this chapter to build a big complex one.

Getting ready

You need a working ElasticSearch cluster and an index populated with the script available in online code.

How to do it...

For executing a Boolean query/filter, we need to perform the following steps:

  1. We execute a Boolean query, from command line as follows:

    curl -XPOST 'http://127.0.0.1:9200/test-index/test-type/_search' -d '{
        "query": {
            "bool" : {
            "must" : {
                "term" : { "parsedtext" : "joe" }
            },
            "must_not" : {
                "range" : {
                    "position...

Using a range query/filter


Searching/filtering by range is a very common scenario in a real world application. Some standard cases are as follows:

  • Filtering by range numeric value (that is, Price, size, ages, and so on)

  • Filtering by date (that is, events of 03/07/12 can be a range query from 03/07/12 00:00:00 and 03/07/12 24:59:59)

  • Filtering by term (that is, from A to D)

Getting ready

You need a working ElasticSearch cluster, an index "test" (refer to the next chapter to learn how to create an index), and basic knowledge of JSON.

How to do it...

For executing a range query/filter, we need to perform the following step:

  1. Considering the sample data of previous examples which contains an integer field position. Using it to execute a query for filtering positions between 3 and 5, we will have:

    curl -XPOST 'http://127.0.0.1:9200/test-index/test-type/_search' -d '{
        "query": {
            "filtered": {
                "filter": {
                    "range" : {
                      "position" : { 
                  ...

Using span queries


The big difference between standard systems (SQL, but also many NoSQL technologies such as MongoDB, Riak, or CouchDB) and ElasticSearch is the number of facilities to express text queries.

The span query family is a group of queries that control a sequence of text tokens. They allow defining the following queries:

  • Exact phrase query

  • Exact fragment query (that is, Take off, give up)

  • Partial exact phrase with a slop parameter (other tokens between the searched terms, that is, "the man" with slop 2 can also match "the strong man", "the old wise man", and so on)

Getting ready

You need a working ElasticSearch cluster.

How to do it...

For executing span queries, we need to perform the following steps:

  1. The main element in span queries is the span_term parameter whose usage is similar to the term of standard query.

    One or more span_term parameters can be aggregated to formulate a span query.

    The span_first query defines a query in which the span_term parameter in the first token or near...

Using the match query


ElasticSearch provides a helper to build complex span queries, which depends on simple preconfigured settings. This helper is called the match query.

Getting ready

You need a working ElasticSearch cluster.

How to do it...

For executing a match query, we need to perform the following steps:

  1. The standard usage of a match query simply requires the field name and the query text. For example:

    curl -XPOST 'http://127.0.0.1:9200/test-index/test-type/_search' -d '{
        "query": {
            "match" : {
                "parsedtext" : "nice guy",
                "operator": "and"
            }
        }
    }'
  2. If you need to execute the same query as a phrase query, the type from match changes to match_phrase, as given in the following code:

    curl -XPOST 'http://127.0.0.1:9200/test-index/test-type/_search' -d '{
        "query": {
            "match_phrase" : {
                "parsedtext" : "nice guy"
            }
        }
    }'
  3. An extension of the previous query used in text completion or the "search as you type" functionality...

Using the IDS query/filter


The IDs query and filter allow matching documents by their IDs.

Getting ready

You need a working ElasticSearch cluster.

How to do it...

For executing IDS queries/filters, we need to perform the following steps:

  1. The ids query for fetching IDs 1, 2, 3 of the test-type type is in the following form:

    curl -XPOST 'http://127.0.0.1:9200/test-index/test-type/_search' -d '{
        "query": {
                "ids" : {
                    "type" : "test-type",
                    "values" : ["1", "2", "3"]
                }
            }
        }
    }'
  2. The same query can be converted in a filter query similar to the following one:

    curl -XPOST 'http://127.0.0.1:9200/test-index/test-type/_search' -d '{
        "query": {
            "filtered": {
                "filter": {
                    "ids" : {
                        "type" : "test-type",
                        "values" : ["1", "2", "3"]
                    }
                },
                "query": {
                    "match_all": {}
                }
            }
        }
    }'

How it works...

Query...

Using the has_child query/filter


ElasticSearch does not provide only simple documents, but it lets you define a hierarchy based on parent and child. The has_child query allows querying for parent documents of which the children verify some queries.

Getting ready

You need a working ElasticSearch cluster and the data populated with the populate script.

How to do it...

For executing the has_child query/filter, we need to perform the following steps:

  1. We need to search the test-type parents of which the test-type2 children have a term in the field value as value1. We can create this kind of query as follows:

    curl -XPOST 'http://127.0.0.1:9200/test-index/test-type/_search' -d '{
        "query": {
                "has_child" : {
                    "type" : "test-type2",
                    "query" : {
                        "term" : {
                            "value" : "value1"
                        }
                    }
                }
            }
        }
    }'
  2. If scoring is not important, it's better to reformulate the query as a...

Using the top_children query


In the previous recipe, the has_child query consumes a large amount of memory because it requires to fetch IDs of all the children. To bypass this limitation in huge data contexts the top_children query allows fetching only the top children results.

Getting ready

You need a working ElasticSearch cluster and the data populated with the populate script.

How to do it...

For executing the top_children query, we need to perform the following steps:

  1. We need to search the test-type parents of which the test-type2 top children have a term in the field value as value1. We can create this kind of query as follows:

    curl -XPOST 'http://127.0.0.1:9200/test-index/test-type/_search' -d '{
        "query": {
                "top_children" : {
                    "type" : "test-type2",
                    "query" : {
                        "term" : {
                            "value" : "value1"
                        }
                    },
                    "score" : "max",
                    "factor" : 5,
        ...

Using the has_parent query/filter


In the previous recipes, we have seen the has_child query. ElasticSearch provides a query to search a child based on the has_parent parent query.

Getting ready

You need a working ElasticSearch cluster and the data populated with the populate script.

How to do it...

For executing the has_parent query/filter, we need to perform the following steps:

  1. We want to search the test-type2 children of which the test-type parents have a term joe in the parsedtext field. We can create this kind of query as follows:

    curl -XPOST 'http://127.0.0.1:9200/test-index/test-type2/_search' -d '{
        "query": {
                "has_parent" : {
                    "type" : "test-type",
                    "query" : {
                        "term" : {
                            "parsedtext" : "joe"
                        }
                    }
                }
            }
        }
    }'
  2. If scoring is not important, it's better to reformulate the query as a filter in the following way:

    curl -XPOST 'http://127.0.0.1:9200/test...

Using a regexp query/filter


In the previous recipes we have seen different terms queries (terms, fuzzy, and prefix), another powerful terms query is the regexp (regular expression) one.

Getting ready

You need a working ElasticSearch cluster and the data populated with the populate script.

How to do it...

For executing the regexp query/filter, we need to perform the following steps:

  1. We can execute a regexp term query from command line as follows:

    curl -XPOST 'http://127.0.0.1:9200/test-index/test-type/_search' -d '{
        "query": {
            "regexp": {
                "parsedtext": "j.*",
                "flags" : "INTERSECTION|COMPLEMENT|EMPTY"
            }
        }
    }'
  2. If scoring is not important, it's better to reformulate the query as a filter in the following way:

    curl -XPOST 'http://127.0.0.1:9200/test-index/test-type/_search' -d '{
        "query": {
            "filtered": {
                "filter": {
                    "regexp": {
                        "parsedtext": "j.*"
                    }
                },
                "query...

Using exists and missing filters


One of the main characteristics of ElasticSearch is schemaless storage. So due to its schemaless nature two kinds of filters are required to check if a field exists in a document (the exists filter) or if it is missing (the missing filter).

Getting ready

You need a working ElasticSearch cluster and the data populated with the populate script.

How to do it...

For executing existing and missing filters, we need to perform the following steps:

  1. To search all the test-type documents that have a field called parsedtext the query will be as follows:

    curl -XPOST 'http://127.0.0.1:9200/test-index/test-type/_search' -d '{
        "query": {
            "filtered": {
                "filter": {
                    "exists": {
                        "field":"parsedtext"
                    }
                },
                "query": {
                    "match_all": {}
                }
            }
        }
    }'
  2. To search all the test-type documents that do not have a field called parsedtext, the query will be as follows...

Using and/or/not filters


While building complex queries, some typical Boolean operation filters are required, as they allow to construct complex filter relations as in traditional relational world.

Every DSL query cannot be completed if there isn't an and, or, and not filter.

Getting ready

You need a working ElasticSearch cluster and the data populated with the populate script.

How to do it...

For executing and/or/not, we need to perform the following steps:

  1. Searching documents with parsedtext equal to joe and uuid equal to 11111 is done by using the following code:

    curl -XPOST 'http://127.0.0.1:9200/test-index/test-type/_search' -d '{
        "query": {
            "filtered": {
                "filter": {
                    "and": [
                    {
                        "term": {
                            "parsedtext":"joe"    
                        }
                        
                    },
                    {
                        "term": {
                            "uuid":"11111"
                        }
                ...

Using the geo_bounding_box filter


One of the most common operations in geolocalization is searching for a box (square).

Getting ready

You need a working ElasticSearch cluster and the data populated with the geo populate script.

How to do it...

A search to filter documents related to a bounding box of dimensions (40.03 , 72.0) and (40.717 * 70.99) can be done by using a similar query as follows:

curl -XGET http://127.0.0.1:9200/test-mindex/_search -d '{
    "query": {
        "filtered": {
            "filter": {
                "geo_bounding_box": {
                    "pin.location": {
                        "bottom_right": {
                            "lat": 40.03,
                            "lon": 72.0
                        },
                        "top_left": {
                            "lat": 40.717,
                            "lon": 70.99
                        }
                    }
                }
            },
            "query": {
                "match_all": {}
   ...

Using the geo_polygon filter


The previous recipe, Using the geo_bounding_box filter shows how to filter on a square section, which is the most common case. ElasticSearch provides a way to filter the user defined polygonal shapes via the geo_polygon filter.

Getting ready

You need a working ElasticSearch cluster and the data populated with the geo populate script.

How to do it...

Searching documents in which pin.location is part of a triangle (a shape made up of three GeoPoints), is done by using a similar query as follows:

curl -XGET http://127.0.0.1:9200/test-mindex/_search -d '{
    "query": {
        "filtered": {
            "filter": {
                "geo_bounding_box": {
                    "pin.location": {
                        "points": [
                            {
                                "lat": 50,
                                "lon": -30
                            },
                            {
                                "lat": 30,
                          ...

Using the geo_distance filter


When you are working with geo locations, one of the common tasks is to filter results based on its distance from a location. The geo_distance filter is used to achieve this goal.

Getting ready

You need a working ElasticSearch cluster and the data populated with the geo populate script.

How to do it...

Searching documents in which pin.location is 200km distant from the lat value 40 and the lon value 70, is done using a similar query as follows:

curl -XGET 'http://127.0.0.1:9200/test-mindex/_search -d '{
    "query": {
        "filtered": {
            "filter": {
                "geo_distance": {
                    "pin.location": {
                        "lat": 40,
                        "lon": 70
                    },
                    "distance": "200km",
                    "optimize_bbox": "memory"
                }
            },
            "query": {
                "match_all": {}
            }
        }
    }
}'

How it works...

As we discussed in the...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
ElasticSearch Cookbook
Published in: Dec 2013Publisher: PacktISBN-13: 9781782166627
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Alberto Paro

Alberto Paro is an engineer, manager, and software developer. He currently works as technology architecture delivery associate director of the Accenture Cloud First data and AI team in Italy. He loves to study emerging solutions and applications, mainly related to cloud and big data processing, NoSQL, Natural language processing (NLP), software development, and machine learning. In 2000, he graduated in computer science engineering from Politecnico di Milano. Then, he worked with many companies, mainly using Scala/Java and Python on knowledge management solutions and advanced data mining products, using state-of-the-art big data software. A lot of his time is spent teaching how to effectively use big data solutions, NoSQL data stores, and related technologies.
Read more about Alberto Paro