Mastering ElasticSearch


Mastering ElasticSearch
eBook: $32.99
Formats: PDF, PacktLib, ePub and Mobi formats
$28.04
save 15%!
Print + free eBook + free PacktLib access to the book: $87.98    Print cover: $54.99
$54.99
save 37%!
Free Shipping!
UK, US, Europe and selected countries in Asia.
Also available on:
Overview
Table of Contents
Author
Reviews
Support
Sample Chapters
  • Learn about Apache Lucene and ElasticSearch design and architecture to fully understand how this great search engine works
  • Design, configure, and distribute your index, coupled with a deep understanding of the workings behind it
  • Learn about the advanced features in an easy to read book with detailed examples that will help you understand and use the sophisticated features of ElasticSearch

Book Details

Language : English
Paperback : 386 pages [ 235mm x 191mm ]
Release Date : October 2013
ISBN : 178328143X
ISBN 13 : 9781783281435
Author(s) : Rafał Kuć, Marek Rogoziński
Topics and Technologies : All Books, Big Data and Business Intelligence, Open Source

Table of Contents

Preface
Chapter 1: Introduction to ElasticSearch
Chapter 2: Power User Query DSL
Chapter 3: Low-level Index Control
Chapter 4: Index Distribution Architecture
Chapter 5: ElasticSearch Administration
Chapter 6: Fighting with Fire
Chapter 7: Improving the User Search Experience
Chapter 8: ElasticSearch Java APIs
Chapter 9: Developing ElasticSearch Plugins
Index
  • Chapter 1: Introduction to ElasticSearch
    • Introducing Apache Lucene
      • Getting familiar with Lucene
      • Overall architecture
      • Analyzing your data
        • Indexing and querying
      • Lucene query language
        • Understanding the basics
        • Querying fields
        • Term modifiers
        • Handling special characters
    • Introducing ElasticSearch
      • Basic concepts
        • Index
        • Document
        • Mapping
        • Type
        • Node
        • Cluster
        • Shard
        • Replica
        • Gateway
      • Key concepts behind ElasticSearch architecture
      • Working of ElasticSearch
        • The boostrap process
        • Failure detection
        • Communicating with ElasticSearch
    • Summary
    • Chapter 2: Power User Query DSL
      • Default Apache Lucene scoring explained
        • When a document is matched
        • The TF/IDF scoring formula
          • The Lucene conceptual formula
          • The Lucene practical formula
        • The ElasticSearch point of view
      • Query rewrite explained
        • Prefix query as an example
        • Getting back to Apache Lucene
        • Query rewrite properties
      • Rescore
        • Understanding rescore
        • Example Data
        • Query
        • Structure of the rescore query
        • Rescore parameters
        • To sum up
      • Bulk Operations
        • MultiGet
        • MultiSearch
      • Sorting data
        • Sorting with multivalued fields
        • Sorting with multivalued geo fields
        • Sorting with nested objects
      • Update API
        • Simple field update
        • Conditional modifications using scripting
        • Creating and deleting documents using the Update API
      • Using filters to optimize your queries
        • Filters and caching
          • Not all filters are cached by default
          • Changing ElasticSearch caching behavior
          • Why bother naming the key for the cache?
          • When to change the ElasticSearch filter caching behavior
        • The terms lookup filter
          • How does it work?
          • Performance considerations
          • Loading terms from inner objects
          • Terms lookup filter cache settings
      • Filter and scopes in ElasticSearch faceting mechanism
        • Example data
        • Faceting and filtering
        • Filter as a part of the query
        • The Facet filter
        • Global scope
      • Summary
      • Chapter 3: Low-level Index Control
        • Altering Apache Lucene scoring
          • Available similarity models
          • Setting per-field similarity
        • Similarity model configuration
          • Choosing the default similarity model
          • Configuring the chosen similarity models
            • Configuring TF/IDF similarity
            • Configuring Okapi BM25 similarity
            • Configuring DFR similarity
            • Configuring IB similarity
        • Using codecs
          • Simple use cases
          • Let's see how it works
          • Available posting formats
          • Configuring the codec behavior
            • Default codec properties
            • Direct codec properties
            • Memory codec properties
            • Pulsing codec properties
            • Bloom filter-based codec properties
        • NRT, flush, refresh, and transaction log
          • Updating index and committing changes
            • Changing the default refresh time
          • The transaction log
            • The transaction log configuration
          • Near Real Time GET
        • Looking deeper into data handling
          • Input is not always analyzed
          • Example usage
          • Changing the analyzer during indexing
          • Changing the analyzer during searching
          • The pitfall and default analysis
        • Segment merging under control
          • Choosing the right merge policy
            • The tiered merge policy
            • The log byte size merge policy
            • The log doc merge policy
          • Merge policies configuration
            • The tiered merge policy
            • The log byte size merge policy
            • The log doc merge policy
          • Scheduling
            • The concurrent merge scheduler
            • The serial merge scheduler
            • Setting the desired merge scheduler
        • Summary
        • Chapter 4: Index Distribution Architecture
          • Choosing the right amount of shards and replicas
            • Sharding and over allocation
            • A positive example of over allocation
            • Multiple shards versus multiple indices
            • Replicas
          • Routing explained
            • Shards and data
            • Let's test routing
              • Indexing with routing
            • Indexing with routing
              • Querying
            • Aliases
            • Multiple routing values
          • Altering the default shard allocation behavior
            • Introducing ShardAllocator
            • The even_shard ShardAllocator
            • The balanced ShardAllocator
            • The custom ShardAllocator
            • Deciders
              • SameShardAllocationDecider
              • ShardsLimitAllocationDecider
              • FilterAllocationDecider
              • ReplicaAfterPrimaryActiveAllocationDecider
              • ClusterRebalanceAllocationDecider
              • ConcurrentRebalanceAllocationDecider
              • DisableAllocationDecider
              • AwarenessAllocationDecider
              • ThrottlingAllocationDecider
              • RebalanceOnlyWhenActiveAllocationDecider
              • DiskThresholdDecider
          • Adjusting shard allocation
            • Allocation awareness
              • Forcing allocation awareness
            • Filtering
              • But what those properties mean?
            • Runtime allocation updating
              • Index-level updates
              • Cluster-level updates
            • Defining total shards allowed per node
              • Inclusion
              • Requirements
              • Exclusion
            • Additional shard allocation properties
          • Query execution preference
            • Introducing the preference parameter
          • Using our knowledge
            • Assumptions
              • Data volume and queries specification
            • Configuration
              • Node-level configuration
              • Indices configuration
              • The directories layout
              • Gateway configuration
              • Recovery
              • Discovery
              • Logging slow queries
              • Logging garbage collector work
              • Memory setup
              • One more thing
            • Changes are coming
              • Reindexing
              • Routing
              • Multiple Indices
          • Summary
          • Chapter 5: ElasticSearch Administration
            • Choosing the right directory implementation – the store module
              • Store type
                • The simple file system store
                • The new IO filesystem store
                • The MMap filesystem store
                • The memory store
                • The default store type
            • Discovery configuration
              • Zen discovery
                • Multicast
                • Unicast
                • Minimum master nodes
                • Zen discovery fault detection
              • Amazon EC2 discovery
                • EC2 plugin's installation
                • Gateway and recovery configuration
                • Gateway recovery process
                • Configuration properties
                • Expectations on nodes
              • Local gateway
                • Backing up the local gateway
              • Recovery configuration
                • Cluster-level recovery configuration
                • Index-level recovery settings
            • Segments statistics
              • Introducing the segments API
                • The response
              • Visualizing segments information
            • Understanding ElasticSearch caching
              • The filter cache
                • Filter cache types
                • Index-level filter cache configuration
                • Node-level filter cache configuration
              • The field data cache
                • Index-level field data cache configuration
                • Node-level field data cache configuration
                • Filtering
              • Clearing the caches
                • Index, indices, and all caches clearing
                • Clearing specific caches
                • Clearing fields-related caches
            • Summary
            • Chapter 6: Fighting with Fire
              • Knowing the garbage collector
                • Java memory
                  • The life cycle of Java object and garbage collections
                • Dealing with garbage collection problems
                  • Turning on logging of garbage collection work
                  • Using JStat
                  • Creating memory dumps
                  • More information on garbage collector work
                  • Adjusting garbage collector work in ElasticSearch
                • Avoiding swapping on Unix-like systems
              • When it is too much for I/O – throttling explained
                • Controlling I/O throttling
                • Configuration
                  • Throttling type
                  • Maximum throughput per second
                  • Node throttling defaults
                  • Configuration example
              • Speeding up queries using warmers
                • Reason for using warmers
                • Manipulating warmers
                  • Using the PUT Warmer API
                  • Adding warmers during index creation
                  • Adding warmers to templates
                  • Retrieving warmers
                  • Deleting warmers
                  • Disabling warmers
                • Testing the warmers
                  • Querying without warmers present
                  • Querying with warmer present
              • Very hot threads
                • Hot Threads API usage clarification
                • Hot Threads API response
              • Real-life scenarios
                • Slower and slower performance
                • Heterogeneous environment and load imbalance
                • My server is under fire
              • Summary
              • Chapter 7: Improving the User Search Experience
                • Correcting user spelling mistakes
                  • Test data
                  • Getting into technical details
                    • Suggesters
                    • Using the _suggest REST endpoint
                    • Including suggestions requests in a query
                    • The term suggester
                    • The phrase suggester
                  • Completion suggester
                    • The logic behind completion suggester
                    • Using completion suggester
                • Improving query relevance
                  • The data
                  • The quest for improving relevance
                    • The standard query
                    • The Multi match query
                    • Phrases comes into play
                    • Let's throw the garbage away
                    • And now we boost
                    • Making a misspelling-proof search
                    • Drill downs with faceting
                • Summary
                • Chapter 8: ElasticSearch Java APIs
                  • Introducing the ElasticSearch Java API
                  • The code
                  • Connecting to your cluster
                    • Becoming the ElasticSearch node
                    • Using the transport connection method
                    • Choosing the right connection method
                  • Anatomy of the API
                  • CRUD operations
                    • Fetching documents
                      • Handling errors
                    • Indexing documents
                    • Updating documents
                    • Deleting documents
                  • Querying ElasticSearch
                    • Preparing a query
                    • Building queries
                      • Using the match all documents query
                      • The match query
                      • Using the geo shape query
                    • Paging
                    • Sorting
                    • Filtering
                    • Faceting
                    • Highlighting
                    • Suggestions
                    • Counting
                    • Scrolling
                  • Performing multiple actions
                    • Bulk
                    • The delete by query
                    • Multi GET
                    • Multi Search
                • Percolator
                  • ElasticSearch 1.0 and higher
                • The explain API
                • Building JSON queries and documents
                • The administration API
                  • The cluster administration API
                    • The cluster and indices health API
                    • The cluster state API
                    • The update settings API
                    • The reroute API
                    • The nodes information API
                    • The node statistics API
                    • The nodes hot threads API
                    • The nodes shutdown API
                    • The search shards API
                  • The Indices administration API
                    • The index existence API
                    • The Type existence API
                    • The indices stats API
                    • Index status
                    • Segments information API
                    • Creating an index API
                    • Deleting an index
                    • Closing an index
                    • Opening an index
                    • The Refresh API
                    • The Flush API
                    • The Optimize API
                    • The put mapping API
                    • The delete mapping API
                    • The gateway snapshot API
                    • The aliases API
                    • The get aliases API
                    • The aliases exists API
                    • The clear cache API
                    • The update settings API
                    • The analyze API
                    • The put template API
                    • The delete template API
                    • The validate query API
                    • The put warmer API
                    • The delete warmer API
                • Summary
                  • Chapter 9: Developing ElasticSearch Plugins
                    • Creating the Apache Maven project structure
                      • Understanding the basics
                      • Structure of the Maven Java project
                      • The idea of POM
                      • Running the build process
                      • Introducing the assembly Maven plugin
                    • Creating a custom river plugin
                      • Implementation details
                        • Implementing the URLChecker class
                        • Implementing the JSONRiver class
                        • Implementing the JSONRiverModule class
                        • Implementing the JSONRiverPlugin class
                        • Informing ElasticSearch about the JSONRiver plugin class
                      • Testing our river
                        • Building our river
                        • Installing our river
                        • Initializing our river
                        • Checking if our JSON river works
                    • Creating custom analysis plugin
                      • Implementation details
                        • Implementing TokenFilter
                        • Implementing the TokenFilter factory
                        • Implementing custom analyzer
                        • Implementing analyzer provider
                        • Implementing analysis binder
                        • Implementing analyzer indices component
                        • Implementing analyzer module
                        • Implementing analyzer plugin
                        • Informing ElasticSearch about our custom analyzer
                      • Testing our custom analysis plugin
                        • Building our custom analysis plugin
                        • Installing the custom analysis plugin
                        • Checking if our analysis plugin works
                    • Summary

                    Rafał Kuć

                    Rafał Kuć is a born team leader and software developer. He currently works as a consultant and a software engineer at Sematext Group, Inc., where he concentrates on open source technologies such as Apache Lucene and Solr, Elasticsearch, and Hadoop stack. He has more than 12 years of experience in various branches of software, from banking software to e-commerce products. He focuses mainly on Java but is open to every tool and programming language that will make the achievement of his goal easier and faster. Rafał is also one of the founders of the solr.pl site, where he tries to share his knowledge and help people with the problems they face with Solr and Lucene. Also, he has been a speaker at various conferences around the world, such as Lucene Eurocon, Berlin Buzzwords, ApacheCon, and Lucene Revolution.

                    Rafał began his journey with Lucene in 2002, and it wasn't love at first sight. When he came back to Lucene in late 2003, he revised his thoughts about the framework and saw the potential in search technologies. Then, Solr came along and this was it. He started working with Elasticsearch in the middle of 2010. Currently, Lucene, Solr, Elasticsearch, and information retrieval are his main points of interest.

                    Rafał is also the author of Apache Solr 3.1 Cookbook, and the update to it, Apache Solr 4 Cookbook. Also, he is the author of the previous edition of this book and Mastering ElasticSearch. All these books have been published by Packt Publishing.


                    Marek Rogoziński

                    Marek Rogoziński is a software architect and consultant with more than 10 years of experience. He has specialized in solutions based on open source search engines such as Solr and Elasticsearch, and also the software stack for Big Data analytics including Hadoop, HBase, and Twitter Storm.

                    He is also the cofounder of the solr.pl site, which publishes information and tutorials about Solr and the Lucene library. He is also the co-author of some books published by Packt Publishing.

                    Currently, he holds the position of the Chief Technology Officer in a new company, designing architecture for a set of products that collect, process, and analyze large streams of input data.

                    Code Downloads

                    Download the code and support files for this book.


                    Submit Errata

                    Please let us know if you have found any errors not listed on this list by completing our errata submission form. Our editors will check them and add them to this list. Thank you.


                    Errata

                    - 2 submitted: last submission 19 May 2014

                    Errata type: Technical PgNo: 322

                    It is: Plugin definition class, which extends the AbstractPlugin class from the
                    org.elasticsearch.plugins package; we will call it  JSONRiverModule

                     

                    It should be :Plugin definition class, which extends the AbstractPlugin class from the
                    org.elasticsearch.plugins package; we will call it  JSONRiverPlugin

                    Errata type: Technical |  PgNo: 327

                     

                    The AbstractRiverComponentclassallows us to use ElasticSearch logging capabilities and settings without the need of having them initialized and the Riverinterface forces us to implement the start() and stop()methods which are called when the river is starting (the start()method) and when it is being stopped (the stop()method).

                     

                    This Should be:

                    The AbstractRiverComponentclassallows us to use ElasticSearch logging capabilities and settings without the need of having them initialized and the Riverinterface forces us to implement the start() and close()methods which are called when the river is starting (the start()method) and when it is being stopped (the close()method).

                    Sample chapters

                    You can view our sample chapters and prefaces of this title on PacktLib or download sample chapters in PDF format.

                    Frequently bought together

                    Mastering ElasticSearch +    Haskell Data Analysis Cookbook =
                    50% Off
                    the second eBook
                    Price for both: ₨566.20

                    Buy both these recommended eBooks together and get 50% off the cheapest eBook.

                    What you will learn from this book

                    • Understand how Apache Lucene works
                    • Use and configure different scoring models to alter default scoring mechanism
                    • Exploit query rescore to recalculate the score of top N documents
                    • Choose the right amount of shards and replicas for your deployment
                    • Use shards allocation wisely and understand its internals
                    • Alter the index format by using different postings format
                    • Use your knowledge to create scalable, efficient, and fault tolerant clusters
                    • Monitor your cluster by using and understanding the ElasticSearch API
                    • Learn to control segments merging and why ElasticSearch uses merging at all
                    • Overcome problems with garbage collection, threading, and I/O
                    • Improve the user search experience by using ElasticSearch functionality
                    • Develop an application using the ElasticSearch Java API and develop custom ElasticSearch plugins

                    In Detail

                    ElasticSearch is fast, distributed, scalable, and written in the Java search engine that leverages Apache Lucene capabilities providing a new level of control over how you index and search even the largest set of data.

                    "Mastering ElasticSearch" covers the intermediate and advanced functionalities of ElasticSearch and will let you understand not only how ElasticSearch works, but will also guide you through its internals such as caches, Apache Lucene library, monitoring capabilities, and the Java API. In addition to that you'll see the practical usage of ElasticSearch configuration parameters, monitoring API, and easy-to-use and extend examples on how to extend ElasticSearch by writing your own plugins.

                    "Mastering ElasticSearch" starts by showing you how Apache Lucene works and what the ElasticSearch architecture looks like. It covers advanced querying capabilities, index configuration control, index distribution, ElasticSearch administration and troubleshooting. Finally you'll see how to improve the user’s search experience, use the provided Java API and develop your own custom plugins.

                    It will help you learn how Apache Lucene works both in terms of querying and indexing. You'll also learn how to use different scoring models, rescoring documents using other queries, alter how the index is written by using custom postings and what segments merging is, and how to configure it to your needs. You'll optimize your queries by modifying them to use filters and you'll see why it is important. The book describes in details how to use the shard allocation mechanism present in ElasticSearch such as forced awareness.

                    "Mastering ElasticSearch" will open your eyes to the practical use of the statistics and information API available for the index, node and cluster level, so you are not surprised about what your ElasticSearch does while you are not looking. You'll also see how to troubleshoot by understanding how the Java garbage collector works, how to control I/O throttling, and see what threads are being executed at the any given moment. If user spelling mistakes are making you lose sleep at night - don't worry anymore the book will show you how to configure and use the ElasticSearch spell checker and improve the query relevance of your queries. Last, but not least you'll see how to use the ElasticSearch Java API to use the ElasticSearch cluster from your JVM based application and you'll extend ElasticSearch by writing your own custom plugins.

                    If you are looking for a book that will allow you to easily extend your basic knowledge about ElasticSearch or you want to go deeper into the world of full text search using ElasticSearch then this book is for you.

                     

                    Approach

                    A practical tutorial that covers the difficult design, implementation, and management of search solutions.

                    Who this book is for

                    Mastering ElasticSearch is aimed at to intermediate users who want to extend their knowledge about ElasticSearch. The topics that are described in the book are detailed, but we assume that you already know the basics, like the query DSL or data indexing. Advanced users will also find this book useful, as the examples are getting deep into the internals where it is needed.

                    Code Download and Errata
                    Packt Anytime, Anywhere
                    Register Books
                    Print Upgrades
                    eBook Downloads
                    Video Support
                    Contact Us
                    Awards Voting Nominations Previous Winners
                    Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
                    Resources
                    Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software