Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Advanced Elasticsearch 7.0

You're reading from  Advanced Elasticsearch 7.0

Product type Book
Published in Aug 2019
Publisher Packt
ISBN-13 9781789957754
Pages 560 pages
Edition 1st Edition
Languages
Author (1):
Wai Tak Wong Wai Tak Wong
Profile icon Wai Tak Wong

Table of Contents (25) Chapters

Preface Section 1: Fundamentals and Core APIs
Overview of Elasticsearch 7 Index APIs Document APIs Mapping APIs Anatomy of an Analyzer Search APIs Section 2: Data Modeling, Aggregations Framework, Pipeline, and Data Analytics
Modeling Your Data in the Real World Aggregation Frameworks Preprocessing Documents in Ingest Pipelines Using Elasticsearch for Exploratory Data Analysis Section 3: Programming with the Elasticsearch Client
Elasticsearch from Java Programming Elasticsearch from Python Programming Section 4: Elastic Stack
Using Kibana, Logstash, and Beats Working with Elasticsearch SQL Working with Elasticsearch Analysis Plugins Section 5: Advanced Features
Machine Learning with Elasticsearch Spark and Elasticsearch for Real-Time Analytics Building Analytics RESTful Services Other Books You May Enjoy

Overview of Elasticsearch 7

Welcome to Advanced Elasticsearch 7.0. Elasticsearch quickly evolved from version 1.0.0, released in February 2014, to version 6.0.0 GA, released in November 2017. Nonetheless, we will use 7.0.0 release as the base of this book. Without making any assumptions about your knowledge of Elasticsearch, this opening chapter provides setup instructions with the Elasticsearch development environment. To help beginners complete some basic features within a few minutes, several steps are given to launch the new version of an Elasticsearch server. An architectural overview and some core concepts will help you to understand the workflow within Elasticsearch. It will help you straighten your learning path.

Keep in mind that you can learn the potential benefits by reading the API conventions section and becoming familiar with it. The section New features following this one is a list of new features you can explore in the new release. Because major changes are often introduced between major versions, you must check to see whether it breaks the compatibility and affects your application. Go through the Migration between versions section to find out how to minimize the impact on your upgrade project.

In this chapter, you'll learn about the following topics:

  • Preparing your environment
  • Running Elasticsearch
  • Talking to Elasticsearch
  • Elasticsearch architectural overview
  • Key concepts
  • API conventions
  • New features
  • Breaking changes
  • Migration between versions

Preparing your environment

The first step of the novice is to set up the Elasticsearch server, while an experienced user may just need to upgrade the server to the new version. If you are going to upgrade your server software, read through the Breaking changes section and the Migration between versions section to discover the changes that require your attention.

Elasticsearch is developed in Java. As of writing this book, it is recommended that you use a specific Oracle JDK, version 1.8.0_131. By default, Elasticsearch will use the Java version defined by the JAVA_HOME environment variable. Before installing Elasticsearch, please check the installed Java version.

Elasticsearch is supported on many popular operating systems such as RHEL, Ubuntu, Windows, and Solaris. For information on supported operating systems and product compatibility, see the Elastic Support Matrix at https://www.elastic.co/support/matrix. The installation instructions for all the supported platforms can be found in the Installing Elasticsearch documentation (https://www.elastic.co/guide/en/elasticsearch/reference/7.0/install-elasticsearch.html). Although there are many ways to properly install Elasticsearch on different operating systems, it'll be simple and easy to run Elasticsearch from the command line for novices. Please follow the instructions on the official download site (https://www.elastic.co/downloads/past-releases/elasticsearch-7-0-0). In this book, we'll use the Ubuntu 16.04 operating system to host Elasticsearch Service. For example, use the following command line to check the Java version on Ubuntu 16.04:

java -version
java version "1.8.0_181"
java(TM) SE Runtime Environment(build 1.8.0_181-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)

The following is a step-by-step guide for installing the preview version from the official download site:

  1. Select the correct package for your operating system (WINDOWS, MACOS, LINUX, DEB, RPM, or MSI (BETA)) and download the 7.0.0 release. For Linux, the filename is elasticsearch-7.0.0-linux-x86_64.tar.gz.
  2. Extract the GNU zipped file into the target directory, which will generate a folder called elasticsearch-7.0.0 using the following command:
tar -zxvf elasticsearch-7.0.0-linux-86_64.tar.gz
  1. Go to the folder and run Elasticsearch with the -p parameter to create a pid file at the specified path:
cd elasticsearch-7.0.0
./bin/elasticsearch -p pid

Elasticsearch runs in the foreground when it runs with the command line above. If you want to shut it down, you can stop it by pressing Ctrl + C, or you can use the process ID from the pid file in the working directory to terminate the process:

kill -15 `cat pid`

Check the log file to make sure the process is closed. You will see the text Native controller process has stopped, stopped, closing, closed near the end of file:

tail logs/elasticsearch.log

To run Elasticsearch as a daemon in background mode, specify -d on the command line:

./bin/elasticsearch -d -p pid

In the next section, we will show you how to run an Elasticsearch instance.

Running Elasticsearch

Elasticsearch does not start automatically after installation. On Windows, to start it automatically at boot time, you can install Elasticsearch as a service. On Ubuntu, it's best to use the Debian package, which installs everything you need to configure Elasticsearch as a service. If you're interested, please refer to the official website (https://www.elastic.co/guide/en/elasticsearch/reference/master/deb.html).

Basic Elasticsearch configuration

Elasticsearch 7.0 has several configuration files located in the config directory, shown as follows. Basically, it provides good defaults, and it requires very little configuration from developers:

ls config

The output will be similar to the following:

elasticsearch.keystore  elasticsearch.yml  jvm.options  log4j2.properties  role_mapping.yml  roles.yml  users  users_roles

Let's take a quick look at the elasticsearch.yml, jvm.options, and log4j2.properties files:

  • elasticsearch.yml: The main configuration file. This configuration file contains settings for the clusters, nodes, and paths. If you specify an item, comment out the line. We'll explain the terminology in the Elasticsearch architectural overview section:
# -------------------------- Cluster ---------------------------
# Use a descriptive name for your cluster:
#cluster.name: my-application
# -------------------------- Node ------------------------------
# Use a descriptive name for the node:
#node.name: node-1
# -------------------------- Network ---------------------------
# Set the bind address to a specific IP (IPv4 or IPv6):
#network.host: 192.168.0.1
# Set a custom port for HTTP:
#http.port: 9200
# --------------------------- Paths ----------------------------
# Path to directory where to store the data (separate multiple
# locations by comma):
#path.data: /path/to/data
# Path to log files:
#path.logs: /path/to/logs
  • jvm.options: Recalling that Elasticsearch is developed in Java, this file is the preferred place to set the JVM options, as shown in the following code block:
 IMPORTANT: JVM heap size
# Xms represents the initial size of total heap space
# Xmx represents the maximum size of total heap space
-Xms1g
-Xmx1g
You rarely need to change the Java Virtual Machine (JVM) options unless the Elasticsearch server is moved to production. These settings can be used to improve performance. When configuring heap memory, please keep in mind that the Xmx setting is 32 GB at most, and no more than 50% of the available RAM.
  • log4j2.properties: Elasticsearch uses Log4j 2 for logging. The log file location is made from three given properties, ${sys:es.logs.base_path}, ${sys:es.logs.cluster_name}, and ${sys:es.logs.node_name} in the log4j2.properties file, as shown in the code block:
appender.rolling.fileName = ${sys:es.logs.base_path}${sys:file.separator}${sys:es.logs.cluster_name}.log

For example, our installed directory is ~/elasticsearch-7.0.0. Since no base path is specified, the default value of ~/elasticsearch-7.0.0/logs is used. Since no cluster name is specified, the default value of elasticsearch is used. The log file location setting appender.rolling.filename will generate a log file named ~/elasticsearch-7.0.0/logs/elasticsearch.log.

Important system configuration

Elasticsearch has two working modes, development mode and production mode. You'll work in development mode with a fresh installation. If you reconfigure a setting such as network.host, it will switch to production mode. In production mode, some settings must be taken care and you can check with the Elasticsearch Reference at https://www.elastic.co/guide/en/elasticsearch/reference/master/system-config.html. We will discuss the file descriptors and virtual memory settings as follows:

  • File descriptors: Elasticsearch uses a large number of file descriptors. Running out of file descriptors can result in data loss. Use the ulimit command to set the maximum number of open files for the current session or in a runtime script file:
ulimit -n 65536

If you want to set the value permanently, add the following line to the /etc/security/limits.conf file:

elasticsearch - nofile 65536

Ubuntu ignores the limits.conf file for processes started by init.d. You can comment out the following line to enable the ulimit feature as follow:

# Sets up user limits according to /etc/security/limits.conf
# (Replaces the use of /etc/limits in old login)
#session required pam_limits.so
  • Virtual memory: By default, Elasticsearch uses the mmapfs directory to store its indices, however, the default operating system limits setting on mmap counts is low. If the setting is below the standard, increase the limit to 262144 or higher:
sudo sysctl -w vm.max_map_count=262144
sudo sysctl -p
cat /proc/sys/vm/max_map_count
262144

By default, the Elasticsearch security features are disabled for open source downloads or basic licensing. Since Elasticsearch binds to localhost only by default, it is safe to run the installed server as a local development server. The changed setting only takes effect after the Elasticsearch server instance has been restarted. In the next section, we will discuss several ways to communicate with Elasticsearch.

Talking to Elasticsearch

Many programming languages (including Java, Python, and .NET) have official clients written and supported by Elasticsearch (https://www.elastic.co/guide/en/elasticsearch/client/index.html). However, by default, only two protocols are really supported, HTTP (via a RESTful API) and native. You can talk to Elasticsearch via one of the following ways:

  • Transport client: One of the native ways to connect to Elasticsearch.
  • Node client: Similar to the transport client. In most cases, if you're using Java, you should choose the transport client instead of the node client.
  • HTTP client: For most programming languages, HTTP is the most common way to connect to Elasticsearch.
  • Other protocols: It's possible to create a new client interface to Elasticsearch simply by writing a plugin.
Transport clients (that is, the Java API) are scheduled to be deprecated in Elasticsearch 7.0 and completely removed in 8.0. Java users should use a Java High Level REST Client.

You can communicate with Elasticsearch via the default 9200 port using the RESTful API. An example of using the curl command to communicate with Elasticsearch from the command line is shown in the following code block. You should see the instance details and the cluster information in the response. Before running the following command, make sure the installed Elasticsearch server is running. In the response, the machine's hostname is wai. The default Elasticsearch cluster name is elasticsearch. The version of Elasticsearch that is running is 7.0.0. The downloaded Elasticsearch software is in TAR format. The version of Lucene used is 8.0.0:

curl -XGET 'http://localhost:9200'
{
"name" : "wai",
"cluster_name" : "elasticsearch",
"cluster_uuid" : "7-fjLIFkQrednHgFh0Ufxw",
"version" : {
"number" : "7.0.0",
"build_flavor" : "default",
"build_type" : "tar",
"build_hash" : "a30e8c2",
"build_date" : "2018-12-17T12:33:32.311168Z",
"build_snapshot" : false,
"lucene_version" : "8.0.0",
"minimum_wire_compatibility_version" : "6.6.0",
"minimum_index_compatibility_version" : "6.0.0-beta1"
},
"tagline" : "You Know, for Search"
}

Using Postman to work with the Elasticsearch REST API

The Postman app is a handy tool for testing the REST API. In this book, we'll use Postman to illustrate the examples. The following are step-by-step instructions for installing Postman from the official download site (https://www.getpostman.com/apps):

  1. Select Package Management (Windows, macOS, or Linux) and download the appropriate 32-/64-bit version for your operating system. For 64-bit Linux package management, the filename is Postman-linux-x64-6.6.1.tar.gz.
  2. Extract the GNU zipped file into your target directory, which will generate a folder called Postman:
tar -zxvf Postman-linux-x64-6.6.1.tar.gz
  1. Go to the folder and run Postman and you'll see a pop-up window:
cd Postman
./Postman
  1. In the pop-up window, use the same URL as in the previous curl command and press the Send button. You will get the same output shown as follows:

In the next section, let's dive into the architectural overview of Elasticsearch.

Elasticsearch architectural overview

The story of how the ELK Stack becomes Elasticsearch, Logstash, and Kibana, is a pretty long story (https://www.elastic.co/about/history-of-elasticsearch). At Elastic{ON} 2015 in San Francisco, Elasticsearch Inc. was renamed Elastic and announced the next evolution of Elastic Stack. Elasticsearch will still play an important role, no matter what happens.

Elastic Stack architecture

Elastic Stack is an end-to-end software stack for search and analysis solutions. It is designed to help users get data from any type of source in any format to allow for searching, analyzing, and visualizing data in real time. The full stack consists of the following:

  • Beats master: A lightweight data conveyor that can send data directly to Elasticsearch or via Logstash
  • APM server master: Used for measuring and monitoring the performance of applications
  • Elasticsearch master: A highly scalable full-text search and analytics engine
  • Elasticsearch Hadoop master: A two-way fast data mover between Apache Hadoop and Elasticsearch
  • Kibana master: A primer on data exploration, visualization, and dashboarding
  • Logstash master: A data-collection engine with real-time pipelining capabilities

Each individual product has its own purpose and features, as shown in the following diagram:

Elasticsearch architecture

Elasticsearch is a real-time distributed search and analytics engine with high availability. It is used for full-text search, structured search, analytics, or all three in combination. It is built on top of the Apache Lucene library. It is a schema-free, document-oriented data store. However, unless you fully understand your use case, the general recommendation is not to use it as the primary data store. One of the advantages is that the RESTful API uses JSON over HTTP, which allows you to integrate, manage, and query index data in a variety of ways.

An Elasticsearch cluster is a group of one or more Elasticsearch nodes that are connected together. Let's first outline how it is laid out, as shown in the following diagram:

Although each node has its own purpose and responsibility, each node can forward client requests (coordination) to the appropriate nodes. The following are the nodes used in an Elasticsearch cluster:

  • Master-eligible node: The master node's tasks are primarily used for lightweight cluster-wide operations, including creating or deleting an index, tracking the cluster nodes, and determining the location of the allocated shards. By default, the master-eligible role is enabled. A master-eligible node can be elected to become the master node (the node with the asterisk) by the master-election process. You can disable this type of role for a node by setting node.master to false in the elasticsearch.yml file.
  • Data node: A data node contains data that contains indexed documents. It handles related operations such as CRUD, search, and aggregation. By default, the data node role is enabled, and you can disable such a role for a node by setting the node.data to false in the elasticsearch.yml file.
  • Ingest node: Using an ingest nodes is a way to process a document in pipeline mode before indexing the document. By default, the ingest node role is enabled—you can disable such a role for a node by setting node.ingest to false in the elasticsearch.yml file.
  • Coordinating-only node: If all three roles (master eligible, data, and ingest) are disabled, the node will only act as a coordination node that performs routing requests, handling the search reduction phase, and distributing works via bulk indexing.

When you launch an instance of Elasticsearch, you actually launch the Elasticsearch node. In our installation, we are running a single node of Elasticsearch, so we have a cluster with one node. Let's retrieve the information for all nodes from our installed server using the Elasticsearch cluster nodes info API, as shown in the following screenshot:

The cluster name is elasticsearch. The total number of nodes is 1. The node ID is V1P0a-tVR8afUqJW86Hnrw. The node name is wai. The wai node has three roles, which are master, data, and ingest. The Elasticsearch version running on the node is 7.0.0.

Between the Elasticsearch index and the Lucene index

The data in Elasticsearch is organized into indices. Each index is a logical namespace for organizing data. The document is a basic unit of data in Elasticsearch. An inverted index is created by tokenizing the terms in the document, creating a sorted list of all unique terms, and associating the document list with the location where the terms can be found. An index consists of one or more shards. A shard is a Lucene index that uses a data structure (inverted index) to store data. Each shard can have zero or more replicas. Elasticsearch ensures that the primary and the replica of the same shard will not collocate in the same node, as shown in the following screenshot, where Data Node 1 contains primary shard 1 of Index 1 (I1P1), primary shard 2 of Index 2 (I2P2), replica shard 2 of Index 1 (I1R2), and replica shard 1 of Index 2 (I2R1).

A Lucene index consists of one or more immutable index segments, and a segment is a functional inverted index. Segments are immutable, allowing Lucene to incrementally add new documents to the index without rebuilding efforts. To maintain the manageability of the number of segments, Elasticsearch merges the small segments together into one larger segment, commits the new merge segment to disk and eliminates the old smaller segments at the appropriate time. For each search request, all Lucene segments of a given shard of an Elasticsearch index will be searched. Let's examine the query process in a cluster, as shown in the following diagram:

In the next section, let's drilled down to the key concepts.

Key concepts

In the previous section, we learned some core concepts such as clusters, nodes, shards, replicas, and so on. We will briefly introduce the other key concepts in this section. Then, we'll drill down into the details in subsequent chapters.

Mapping concepts across SQL and Elasticsearch

In the early stages of Elasticsearch, mapping types were a way to divide the documents into different logical groups in the same index. This meant that the index could have any number of types. In the past, talking about index in Elasticsearch is similar to talking about database in SQL. In addition, the discussion of viewing index type in Elasticsearch is equivalent to a table in SQL is also very popular. According to the official Elastic website (https://www.elastic.co/guide/en/elasticsearch/reference/5.6/removal-of-types.html), the removal of mapping types was published in the documentation of version 5.6. Later, in Elasticsearch 6.0.0, indices needed to contain only one mapping type. Mapping types of the same index were completely removed in Elasticsearch 7.0.0. The main reason was that tables are independent of each other in an SQL database. However, fields with the same name in different mapping types of the same index are the same. In an Elasticsearch index, fields with the same name in different mapping types are internally supported by the same Lucene field.

Let's take a look at the terminology in SQL and Elasticsearch in the following table(https://www.elastic.co/guide/en/elasticsearch/reference/master/_mapping_concepts_across_sql_and_elasticsearch.html), showing how the data is organized:

SQL Elasticsearch Description
Column Field A column is a set of data values in the same data type, with one value for each row of the database, while Elasticsearch refers to as a field. A field is the smallest unit of data in Elasticsearch. It can contain a list of multiple values of the same type.
Row Document A row represents a structured data item, which contains a series of data values from each column of the table. A document is like a row to group fields (columns in SQL). A document is a JSON object in Elasticsearch.
Table Index A table consists of columns and rows. An index is the largest unit of data in Elasticsearch. Comparing to a database in SQL, an index is a logical partition of the indexed documents and the target against which the search queries get executed.
Schema Implicit In a relational database management system (RDBMS), a schema contains schema objects, which can be tables, columns, data types, views, and so on. A schema is typically owned by a database user. Elasticsearch does not provide an equivalent concept for it.
Catalog/database Cluster In SQL, a catalog or database represents a set of schemas. In Elasticsearch, a cluster contains a set of indices.

Mapping

A schema could mean an outline, diagram, or model, which is often used to describe the structure of different types of data. Elasticsearch is reputed to be schema-less, in contrast to traditional relational databases. In traditional relational databases, you must explicitly specify tables, fields, and field types. In Elasticsearch, schema-less simply means that the document can be indexed without specifying the schema in advance. Under the hood though, Elasticsearch dynamically derives a schema from the first document's index structure and decides how to index them when no explicit static mapping is specified. Elasticsearch enforces the term schema called mapping, which is a process of defining how Lucene stores the indexed document and those fields it contains. When you add a new field to your document, the mapping will also be automatically updated.

Starting from Elasticsearch 6.0.0, only one mapping type is allowed for each index. The mapping type has fields defined by data types and meta fields. Elasticsearch supports many different data types for fields in a document. Each document has meta-fields associated with it. We can customize the behavior of the meta-fields when creating a mapping type. We'll cover this in Chapter 4, Mapping APIs.

Analyzer

Elasticsearch comes with a variety of built-in analyzers that can be used in any index without further configuration. If the built-in analyzers are not suitable for your use case, you can create a custom analyzer. Whether it is a built-in analyzer or a customized analyzer, it is just a package of the three following lower-level building blocks:

  • Character filter: Receives the raw text as a stream of characters and can transform the stream by adding, removing, or changing its characters
  • Tokenizers: Splits the given streams of characters into a token stream
  • Token filters: Receives the token stream and may add, remove, or change tokens

The same analyzer should normally be used both at index time and at search time, but you can set search_analyzer in the field mapping to perform different analyses while searching.

Standard analyzer

The standard analyzer is the default analyzer, which is used if none is specified. A standard analyzer consists of the following:

  • Character filter: None
  • Tokenizer: Standard tokenizer
  • Token filters: Lowercase token filter and stop token filter (disabled by default)

A standard tokenizer provides a grammar-based tokenization. A lowercase token filter normalizes the token text to lowercase, where a stop token filter removes the stop words from token streams. For a list of English stop words, you can refer to https://www.ranks.nl/stopwords. Let's test the standard analyzer with the input text You'll love Elasticsearch 7.0.

Since it is a POST request, you need to set the Content-Type to application/json:

The URL is http://localhost:9200/_analyze and the request Body has a raw JSON string, {"text": "You will love Elasticsearch 7.0."}. You can see that the response has four tokens: you'll, love, elasticsearch, and 7.0, all in lowercase, which is due to the lowercase token filter:

In the next section, let's get familiar with the API conventions.

API conventions

We will only discuss some of the major conventions. For others, please refer to the Elasticsearch reference (https://www.elastic.co/guide/en/elasticsearch/reference/master/api-conventions.html). The following list can be applied throughout the REST API:

  • Access across multiple indices: This convention cannot be used in single document APIs:
    • _all: For all indices
    • comma: A separator between two indices
    • wildcard (*,-): The asterisk character, *, is used to match any sequence of characters in the index name, excluding the index afterwards
  • Common options:
    • Boolean values: false means the mentioned value is false; true means the value is true.
    • Number values: A number is as a string on top of the native JSON number type.
    • Time unit for duration: The supported time units are d for days, h for hours, m for minutes, s for seconds, ms for milliseconds, micros for microseconds, and nanos for nanoseconds.
    • Byte size unit: The supported data units are b for bytes, kb for kilobytes, mb for megabytes, gb for gigabytes, tb for terabytes, and pb for petabytes.
    • Distance unit: The supported distance units are mi for miles, yd for yards, ft for feet, in for inches, km for kilometers, m for meters, cm for centimeters, mm for millimeters, and nmi or NM for nautical miles.
    • Unit-less quantities: If the value specified is large enough, we can use a quantity as a multiplier. The supported quantities are k for kilo, m for mega, g for giga, t for tera, and p for peta. For instance, 10m represents the value 10,000,000.
    • Human-readable output: Values can be converted to human-readable values, such as 1h for 1 hour and 1kb for 1,024 kilobytes. This option can be turned on by adding ?human=true to the query string. The default value is false.
    • Pretty result: If you append ?pretty=true to the request URL, the JSON string in the response will be in pretty format.
    • REST parameters: Follow the convention of using underscore delimiting.
    • Content type: The type of content in the request body must be specified in the request header using the Content-Type key name. Check the reference as to whether the content type you use is supported. In all our POST/UPDATE/PATCH request examples, application/json is used.
    • Request body in query string: If the client library does not accept a request body for non-POST requests, you can use the source query string parameter to pass the request body and specify the source_content_type parameter with a supported media type.
    • Stack traces: If the error_trace=true request URL parameter is set, the error stack trace will be included in the response when an exception is raised.
  • Date math in a formatted date value: In range queries or in date range aggregations, you can format date fields using date math:
    • The date math expressions start with an anchor date (now, or a date string ending with a double vertical bar: ||), followed by one or more sub-expressions such as +1h, -1d, or /d.
    • The supported time units are different from the time units for duration in the previously mentioned Common options bullet list. Where y is for years, M is for months, w is for weeks, d is for days, h, or H is for hours, m is for minutes, s is for seconds, + is for addition, - is for subtraction, and / is for rounding down to the nearest time unit. For example, this means that /d means rounding down to the nearest day.
For the following discussion of these data parameters, assume that the current system time now is 2019.01.03 01:20:00, now+1h is 2019.01.03 02:20:00, now-1d is 2019.01.02 01:20:00, now/d is 2019.01.03 00:00:00, now/M is 2019.01.01 00:00:00, 2019.01.03 01:20:00||+1h is 2019.01.03 02:20:00, and so forth.
  • Date math in index name: If you want to index time series data, such as logs, you can use a pattern with different date fields as the index names to manage daily logging information. Date math then gives you a way to search through a series of time series indices. The date math syntax for the index name is as follows:
<static_name{date_math_expr{date_format|time_zone}}>

The following are the terms used in the preceding syntax:

    • static_name: The unchanged text portion of the index name.
    • date_math_expr: The changing text portion of the index name according to the date math to vary.
    • date_format: The default value is YYYY.MM.dd, where YYYYY stands for the year, MM for the month, and dd for the day.
    • time_zone: The time zone offset and the default time zone is UTC. For instance, the UTC time offset is -08:00 for PST.
      Given that the current system time is 1:00 PM, January 3, 2019, the index name interpreted from the date math is expressed by <logstash-{now/d{YYYY.MM.dd|+12:00}} and is logstash-2019.1.4, where now/d means the current system time rounded down to the nearest day.
  • URL-based access control: There are many APIs in Elasticsearch that allow you to specify the index in the request body, such as multi-search, multi-get, and a Bulk request. By default, the index specified in the request body will override the index parameter specified in the URL. If you use a proxy with URL-based access control to protect access to Elasticsearch indices, you can add the following setting to the elasticsearch.yml configuration file to disable the default action:
rest.action.multi.allow_explicit_index: false

For other concerns or detailed usage, check out the official Elasticsearch reference (https://www.elastic.co/guide/en/elasticsearch/reference/master/api-conventions.html). In the next section, we will review the new features in version 7.0.0.

New features

New features are introduced and documented in the 7.0.0 release notes (https://www.elastic.co/guide/en/elasticsearch/reference/current/release-notes-7.0.0.html). There are many new features involved in the new release, however, some of them are not our interests and some of which are beyond the scope of this book. Therefore, we show them with two sub-sections. The first sub-section lists those new features to be discussed in the later chapters. The second sub-section lists those new features with description and their issue number.

New features to be discussed

The new features to be discussed include the following:

  • Analysis (see the examples in Chapter 15, Working with Elasticsearch Analysis Plugin):
    • Added support for inlined user dictionary in Nori's tokenizer
    • Added a prebuilt ICU analyzer
  • Geo (see the examples at https://www.elastic.co/guide/en/elasticsearch/reference/master/geo-shape.html):
    • Integrated Lucene's LatLonShape (BKD-backed geoshapes) as the default geo_shape indexing approach
  • Java High Level REST Client (see the examples in Chapter 11, Elasticsearch from Java Programming):
    • Added rollup search
  • Java Low Level REST Client (see the examples in Chapter 11, Elasticsearch from Java Programming):
    • Made warning behavior pluggable for each request
    • Added PreferHasAttributeNodeSelector
  • Machine learning (see the examples in Chapter 16, Machine Learning with Elasticsearch):
    • Added a delayed datacheck to the datafeed job runner
  • Mapping (see the examples in Chapter 4, Mapping APIs):
    • Made typeless APIs usable with indices whose type name is different from _doc
    • Added nanosecond field mapper date_nanos for the Date datatype
    • Added rank_feature and rank_features datatype to expose Lucene’s FeatureField
  • Search (see the examples in Chapter 6, Search APIs):
    • Added intervals query
    • Added soft limit to open scroll contexts
    • Added took timing info to response for the _msearch/template request
    • Added allow_partial_search_results flag to search requests with default setting
    • Introduced ability to minimize round-trips parameter ccs_minimize_roundtrips in cross-cluster search requests
    • Added script filter query to intervals query
    • Added track_total_hits parameter to enable the setting of the number of hits to track accurately
  • SQL (see the examples in Chapter 14, Working with Elasticsearch SQL):
    • Introduced the HISTOGRAM grouping function
    • Introduced DATABASE() and USER() system functions
    • Introduced INTERVAL query
    • Introduced SQL DATE data type
    • Introduced FIRST and LAST aggregate function

New features with description and issue number

These new features include:

In the next section, let's pay attention to the breaking changes in version 7.0.0.

Breaking changes

Aggregations changes

The changes related to aggregation are as follows:

  • The execution hints (global_ordinals_hash and global_ordinals_low_cardinality) for the term aggregations are eliminated.
  • The max limit of buckets allowed in a single response for bucket aggregations is controlled by the search.max_buckets cluster setting, with the default value of 10,000. An attempt to return a request that exceeds the limit will fail with an exception.
  • You should use the missing_bucket option instead of the missing of the parameter sources in the composite aggregation to include documents that have no value in the response. The deprecated missing option is eliminated.
  • The params._agg script parameter, or params._aggs in the scripted metric aggregation, should be replaced by the new ScriptContext state and states variables.
  • In previous versions, the map_script parameter was the only parameter required in the Script Metric Aggregation. Now, the combine_script and reduce_script parameters are also required.
  • The response of percentiles and percentile_ranks aggregation will return null instead of NaN if its input is empty.
  • The response of stats and extended_stats aggregation will return 0 instead null if its input is empty.

Analysis changes

The changes related to analysis are as follows:

  • The max limit for tokens that can be obtained in the _analyze API is 10000.
  • The max limit for input characters analyzed during highlighting is 1000000.
  • Use the delimited_payload parameter for the delimited payload token filter, instead of the deprecated delimited_payload_filter. For existing pre-7.0 indices, a deprecation warning is logged. The new index will fail with an exception.
  • The standard filter is eliminated.
  • The standard_html_strip analyzer is deprecated.
  • Using the deprecated nGram and edgeNGram token filter will throw an error. Use the name ngram and edge_ngram respectively instead.

API changes

The changes related to APIs are as follows:

  • The internal versioning support for optimistic concurrency control is eliminated.
  • In the document bulk API, use the retry_on_conflict parameter instead of _retry_on_conflict; use routing instead of _routing; use version instead of _version; and use version_type instead of _version_type. Use the join meta-field instead of the _parent in mapping. All previous underscore parameters are eliminated. The camel-case parameters such as opType, versionType, and _versionType have been eliminated.
  • The cat thread pool API has renamed some field names from 6.x to 7.0 to align the meaning in the fixed thread pools and scaling thread pools. Use pool_size instead of the original size and core instead of the original min. For the corresponding alias, use psz instead of s, and cr instead of mi. In addition, the alias for max has changed from ma to mx. A new size field that represents the configured fixed number of active threads allowed in the current thread pool is introduced.
  • For the bulk request and update request, if a request contains an unknown parameter, a Bad Request (400) response will be returned.
  • The feature for the Suggest statistics obtained during the Search statistics operation on the indices stats _stats API is eliminated.
  • The copy_setting parameter in the split index operation will be removed in 8.0.0. These settings are copied by default during the operation.
  • Instead of using the stored search template _search API, you must use the stored script _scripts API to register search templates. The search template name must be provided.
  • Previously, the response status of the index alias API depended on whether the security feature was turned on or off. Now, an empty response with a status of OK (200) is always returned.
  • The feature for the response object to create a user using the /_xpack/security/user API with an additional field created outside the user field is eliminated.
  • Use the corrected URL _source_excludes and _source_includes parameters instead of the original _source_exclude and _source_include parameters in the query.
  • Unknown keys in the multi search _msearch API were ignored before, but will fail with an exception now.
  • The graph /_graph/_explore API is eliminated.
  • Term vector can be used to return information and statistics in specific document fields in the document API. Use the corrected plural-form _termvectors method instead of the singular form, _termvector.
  • The Index Monitoring APIs are not authorized implicitly anymore. The privileges must be granted explicitly.
  • The deprecated parameter fields of the bulk request is eliminated.
  • If the document is missing when the PUT Document API is used with version number X, the error message is different from previous version. The new message is shown in the code block below:
 document does not exist (expected version [X]).
  • The compressed_size and compressed_size_in_bytes fields are removed from the Cluster State API response.
  • The Migration Assistance API is removed.
  • When the cluster is configured as read-only, 200 status will be returned for a GET request.
  • The Clear Cache API support POST or GET request previously. Using GET request for such API is eliminated.

Cluster changes

The changes related to cluster are as follows:

  • The colon (:) is not a valid character for the cluster name anymore due to cross-cluster search support.
  • The number of allocated shards (wait_for_active_shards) that must be ready before the open index API can be proceeded has been incremented from 0 to 1.
  • The shard preferences in the search APIs, including _primary, _primary_first, _replica, and _replica_first, are eliminated.
  • The cluster-wide shard limit used to prevent user error now depends on the value of max_shards_per_node * number_of_nodes.

Discovery changes

The changes related to Discovery are as follows:

  • The cluster.initial_master_nodes setting must be set before cluster bootstrapping is performed.
  • If half or more of the master-eligible nodes are going to remove from a cluster, those affected nodes must be excluded from the voting configuration using the _cluster/voting_config_exclusions API.
  • At least one of the following settings must be specified in the elastiscearch.yml configuration file.
    • discovery.seed_hosts
    • discovery.seed_providers
    • cluster.initial_master_nodes
    • discovery.zen.ping.unicast.hosts
    • discovery.zen.hosts_provider
  • Use the setting name cluster.no_master_block instead of discovery.zen.no_master_block, which is deprecated.
  • The default timeout for heartbeat fault detection ping operation between cluster nodes is 10 seconds instead of 30 seconds.

High-level REST client changes

The changes related to the high-level REST client are as follows:

  • Methods that accept headers as the header varargs argument have been eliminated from the RestHighLevelClient class.
  • Previously, the cluster health API was a shard-level base, but now it is a cluster-level base.

Low-level REST client changes

The changes related to low-level REST client are as follows:

  • The maxRetryTimeout setting of the RestClient and RestClientBuilder class is eliminated.
  • Methods that do not take Request objects, such as performRequest and performRequestAsync, have been eliminated from the RestClient class.
  • The setHosts method is removed from the RestClient class.
  • The minimum compiler version is bumped to JDK 8.

Indices changes

The changes related to indices are as follows:

  • By default, each index in Elasticsearch is allocated 1 primary shard and 1 replica.
  • The colon (:) is no longer a valid character in the index name anymore due to the cross-cluster search support.
  • Negative values for index.unassigned.node_left.delayed_timeout settings are treated as zero.
  • The undocumented side effects from a _flush or a _force_merge operation have been fixed.
  • The difference between max_ngram and min_ngram in NGramTokenFilter and NGramTokenizer is limited to 1 before. This default limit can be changed with the index.max_ngram_diff index setting. If the limit is exceeded, it will fail with an exception.
  • The difference between max_shingle_size and min_shingle_size in ShingleTokenFilter was limited to 3 before. This default limit can be changed with the index.max_shingle_diff index setting. If the difference exceeds the limit, it will fail with an exception.
  • New indices created in version 7.0.0 will have a default value for the number_of_routing_shards parameter. The requirement of the split index API for the source index must be associated with this setting . In order to maintain the exact same distribution as a pre-7.0.0 index, you must make sure the values in the split index API and the value at the index creation time are the same.
  • Background refreshing is disabled. If you don't set the value of index.refresh_interval, no refresh operation will be acted on for the search idle shards.
  • The Clear Cache API allows you to clear all caches, or just specific caches. The original usage of the specific cache name is eliminated. Use query instead of query_cache or filter_cache. Use request instead of request_cache. Use fielddata instead of field_data.
  • The network.breaker.inflight_requests.overhead setting has changed from 1 to 2. The estimated memory usage limit of all currently active incoming requests at transport or HTTP level on a node has been increased.
  • The parent circuit breaker defines a new setting indices.breaker.total.use_real_memory. The starting limit for the overall parent breaker indices.breaker.total.limit is 95% of the JVM heap if it is true (default), otherwise it is 70%.
  • The field data limit for the circuit breaker of index indices.breaker.fielddata.limit has been reduced from 60% to 40% of the maximum JVM heap by default.
  • The fix option of the index setting index.shard.check_on_startup, which checks the corruption of shard, has been eliminated.
  • The elasticsearch-translog tool has been eliminated. Use the elasticsearch-shard tool instead.

Java API changes

The changes related to the Java API are as follows:

  • Use the isShardsAcknowledged() method instead of the isShardsAcked() method in the CreateIndexResponse, RolloverResponse, and CreateIndexClusterStateUpdateResponse classes. The isShardsAcked() method is eliminated.
  • The aggregation framework has had some classes moved upward. The new location of the classes in org.elasticsearch.search.aggregations.metrics.* packages is under the org.elasticsearch.search.aggregations.metrics package. The new location of the classes in org.elasticsearch.search.aggregations.pipeline.* packages is under the org.elasticsearch.search.aggregations.pipeline package. The new location of the org.elasticsearch.search.aggregations.pipeline.PipelineAggregationBuilders class is under the org.elasticsearch.search.aggregations package.
  • Regarding the org.elasticsearch.action.bulk.Retry class, the withBackoff() method usage with the Settings field is eliminated.
  • Regarding the Java client class, use the method name of the plural form, termVectors(), instead of the singular form, termVector().
  • The prepareExecute() method has also been eliminated.
  • The deprecated constructor AbstractLifeCycleComponent(Settings settings) is eliminated.

Mapping changes

The changes related to mapping are as follows:

  • The original indexing meta field, _all, which indexed the values of all fields, has been eliminated.
  • The original indexing meta field, _uid, which combined _type and _id, has been eliminated.
  • The original default mapping meta field, _default_, which was used as the base mapping for any new mapping type, has been eliminated.
  • For search and highlighting purposes, the index_options parameter controls which information has been added to the inverted index. However, it no longer supports numeric fields.
  • The max limit of nested JSON objects within a single document across all nested fields is 10000.
  • In the past, specifying that the update_all_types parameter update the mappings would update all fields with the same name of all _type in the same index. It has been eliminated.
  • The classic similarity feature, which is based on the TF/IDF to define how matching documents are scored, has been eliminated since it is no longer supported by Lucene.
  • The error for providing unknown similarity parameters in the request will fail with exception.
  • The geo_shape datatypes in the indexing strategy now defaults to using a vector-indexing approach based on Lucene's new LatLonShape field type.
  • Most options of the geo_shape mapping will be eliminated in a future version. They are tree, precision, tree_levels, strategy, distance_error_pct, and points_only.
  • The max limit of completion context is 10. A deprecation warning will be logged if the setting exceeds.
  • The default value of include_type_name has changed from true to false.
If you use tree as a mapping option for geo_shape mapping and also use a timed index created from a template, you must set geohash or quadtree as the option to ensure compatibility with your previously created indices.

ML changes

The change related to machine learning is as follow:

  • Types parameter is eliminated from the datafeed configuration

Packaging changes

The changes related to packaging are as follows:

  • If using rpm of deb package, to override the settings of the systemd elasticsearch service, it should be made in /etc/systemd/system/elasticsearch.service.d/override.conf
  • The tar package will not include the files in bin directory for Window platform
  • Stop supporting Ubuntu 14.04 version
  • Stop supporting secrets input from command line input

Search changes

The changes related to search are as follows:

  • By default, the adaptive replica selection, cluster.routing.use_adaptive_replica_selection, is enabled to send copies of data to replicas. You may disable it to use the old round-robin method as in 6.x.
  • In the following error situations, a bad request (400) will be returned instead of an internal server error (500):
    • The resulting window is too large, from + size must be less than or equal to: [x] but was [y].
    • Cannot use the [sort] option in conjunction with [rescore].
    • The rescore window, [x], is too large.
    • The number of slices, [x], is too large.
    • Keep alive for scroll, [x], is too large.
    • In adjacency matrix aggregation, the number of filters exceeds the max limit.
    • An org.elasticsearch.script.ScriptException compile error.
  • The request_cache setting in the scroll search is eliminated. A bad request (400) will be returned if you still use it.
  • The method of including a rescore clause on a query to create a scroll search is eliminated. A bad request (400) will be returned if you still use it.
  • Use the corrected name, levenshtein, instead of levenstein, and jaro_winkler instead of jarowinkler, for the string_distance term suggest options in the term suggester.
  • The meaning of suggest_mode=popular in the suggesters (term and phrase) is now the doc frequency from the input terms to compute the frequency threshold for candidate suggestions.
  • Search requests that contain extra tokens after the main object will fail with a parsing exception.
  • The completion suggester provides an auto-complete/search-as-you-type feature. When indexing and querying a context-enabled completion field, you must provide a context.
  • The semantics of max_concurrent_shard_requests has changed from cluster level to node level. The default number of concurrent shard requests per node is 5.
  • The format of the total number of documents that matches the search criteria in the response has changed from a value type to an object of a value and a relation.
  • When track_total_hits is set to false in the search request, the total number of matching documents (hits.total) in the response will return null instead of -1. You may set the option as rest_total_hits_as_int=true in the request to return to the old format.
  • The track_total_hits defaults to 10,000 documents in the search response.
  • The default format for doc-value field is switched back to 6.x style. The Date field can take any date format and the Numeric fields can take a DecimalFormat pattern.
  • For geo context completion suggester, the context is only accepted if the path parameter points to a field with geo_point type.

Query DSL changes

The changes related to query DSL are as follows:

  • The default value of the transposition parameter in a fuzzy query is changed from false to true.
  • The query string query options of use_dis_max, split_on_whitespace, all_fields, locale, auto_generate_phrase_queries, and lowercase_expanded_terms have all been eliminated.
  • If a bool query has the must_not clause, a score of 0 for all documents is returned instead of 1 because the scoring is ignored.
  • Treats geohashes as grid cells, instead of just points, when the geohashes are used to specify the edges in the geo_bounding_box query.
  • A multi-term query (a wildcard, fuzzy, prefix, range, or regex query) against non-text fields with a custom analyzer will now throw an exception.
  • If the resulting polygon crosses the dateline, the GeoJSON standard will be applied to the geo_shape query to disambiguate the misleading results.
  • Boost settings are not allowed on complex inner span queries.
  • The number of terms in the terms query (index.max_terms_count) is limited to 65536.
  • The maximum length of a regex string (index.max_regex_length) allowed in a regex query is limited to 1000.
  • No more than 1,024 fields can be queried at a time. It also limits the auto explanation of fields in the query_string query, the simple_query_string query, and the multi_match query.
  • When a score cannot be tracked, the return value of max_score will return null instead of 0.
  • Boosting is a process that enhances the document relevance. The matching document placed at the top of the result can be given a negative boost value to move it to the last position. Negative boosting support is eliminated.
  • The score generated by the script_score_function or field_value_factor must be non-negative, otherwise it will fail with an exception.
  • The difference between the query and filter context in QueryBuilders is eliminated. Therefore, bool queries with should clauses that don't require access to scores do not need to set the minimum_should_match to 1 .
  • More constraints on the scores value. It must not be negative, must not decrease when term freq increases, and must not increase when norm increases.
  • Negative support for the weight parameters for the function_score query is eliminated.

Settings changes

The changes related to settings are as follows:

  • The default node.name is the hostname instead of the first eight characters of the node _id.
  • Use the index.percolator.map_unmapped_fields_as_text setting instead of the deprecated index.percolator.map_unmapped_fields_as_string setting to force unmapped fields to be handled as strings in a percolate query.
  • Since the indexing thread pool no longer exists, the thread_pool.index.size and thread_pool.index.queue_size settings have been removed.
  • The thread_pool.bulk.size, thread_pool.bulk.queue_size, and es.thread_pool.write.use_bulk_as_display_name settings, which were supported as the fallback settings have been removed.
  • Use node.store.allow_mmap instead of node.store.allow_mmapfs to restrict the use of the mmapfs or the hybridfs store type of indices.
  • The HTTP on/off switch setting http.enabled has been eliminated.
  • The HTTP pipeline support has been eliminated. However, the http.pipelining.max_events setting is still the same as in the previous version.
  • The setting name search.remote.* used to configure cross-cluster search was renamed to cluster.remote.*. The previous setting names fall back in version 7.0.0 and will be removed in version 8.0.0.
  • To audit local node information security settings, you must use xpack.security.audit.logfile.emit_node_host_address instead of the deprecated xpack.security.audit.logfile.prefix.emit_node_host_address; use xpack.security.audit.logfile.prefix.emit_node_host_name instead of the deprecated xpack.security.audit.logfile.emit_node_host_names; and use xpack.security.audit.logfile.prefix.emit_node_name instead of the deprecated xpack.security.audit.logfile.emit_node_name. In addition, the default value of xpack.security.audit.logfile.emit_node_name has changed from true to false.
  • For all security realm settings, instead of using the explicit type setting, the realm type must be part of the setting name. Consider the following for instance:
xpack.security.authc.realms:
realm1:
type: ldap
order: 0
...
realm2:
type: native
...

This must be updated as follows:

xpack.security.authc.realms:
ldap.realm1:
order: 0
...
native.realm2:
...
  • The default TLS/SSL settings are removed.
  • The TLS v1.0 is disabled by default.
  • The security is only enabled if xpack.security.enabled is true, or xpack.security.enabled is not set, and a gold or platinum license is installed.
  • Some of the security settings' names are changed, you must use xpack.notification.email.account.<id>.smtp.password instead of xpack.notification.email.account.<id>.smtp.secure_password, xpack.notification.hipchat.account.<id>.auth_token instead of xpack.notification.hipchat.account.<id>.secure_auth_token, xpack.notification.jira.account.<id>.url instead of xpack.notification.jira.account.<id>.secure_url, xpack.notification.jira.account.<id>.user instead of xpack.notification.jira.account.<id>.secure_user, xpack.notification.jira.account.<id>.password instead of xpack.notification.jira.account.<id>.secure_password, xpack.notification.pagerduty.account.<id>.service_api_key instead of xpack.notification.pagerduty.account.<id>.secure_service_api_key, xpack.notification.slack.account.<id>.url instead of xpack.notification.slack.account.<id>.secure_url.
  • The settings under the xpack.security.audit.index and xpack.security.audit.outputs namespace and have been removed.
  • The ecs setting for the user agent ingest processor now defaults to true.
  • The action.master.force_local setting is removed.
  • The limit of cluster-wide shard number is now enforced, not optional.
  • If http.max_content_length is set to Integer.MAX, it will not be reset to 100mb.

Scripting changes

The changes related to scripting are as follows:

  • The getter methods for the date class have been eliminated. Use .value instead of .date on the date fields. For instance, use doc['start_time'].value.minuteOfHour instead of doc['start_time'].date.minuteOfHour.
  • Accessing the missing field of the document will fail with an exception. To check if a document is missing values, you can use doc['field_name'].size() == 0.
  • A bad request (400) instead of an internal error (500) is returned for malformed scripts in search templates, ingest pipelines, and search requests.
  • The deprecated getValues() method of the ScriptDocValues class has been eliminated. Use doc["field_name"] instead of doc["field_name"].values.

If an upgrade is needed, follow the advices for the migration between versions in the next section.

Migration between versions

The rules of thumb when migrating your application between Elasticsearch versions are as follows:

  • When migrating between minor versions (for example, 7.x to 7.y), we can upgrade one node at a time.
  • Migrating between two subsequent major versions (for example, 6.x to 7.x) requires a full cluster restart.
  • Migrating between two non-subsequent major versions (for example, 5.x to 7.x) requires reindexing documents from Elasticsearch 5.x to Elasticsearch 6.x. Then, you follow the procedures for migrating between two subsequent major versions.

A reindexing API can be used to convert multi-type indices to single-type indices during migration. See the example in Chapter 3, Document APIs.

Elasticsearch 7.0 will not start on a node with documents indexed prior to 6.0.

Summary

So far, we have run the Elasticsearch server and performed some simple tests. We also familiarized ourselves with some basic concepts from an architectural point of view to reduce the trauma of our learning curve. We have listed and briefly discussed the new features and major changes in the new release. We also covered the best way to handle the migration between major versions.

In the next chapter, we'll delve into index APIs. An index is a logical namespace for organizing data in Elasticsearch. Becoming familiar with the index infrastructure is the first step in solving Elasticsearch performance issues. We'll first understand the APIs for index management, index settings, and index templates. After we practice some advanced examples of index status management operations, we will also discuss monitoring indices statistics.

lock icon The rest of the chapter is locked
You have been reading a chapter from
Advanced Elasticsearch 7.0
Published in: Aug 2019 Publisher: Packt ISBN-13: 9781789957754
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}