Downloading and Setting Up ElasticSearch

Exclusive offer: get 80% off this eBook here
ElasticSearch Cookbook

ElasticSearch Cookbook — Save 80%

Over 120 advanced recipes to search, analyze, deploy, manage, and monitor data effectively with ElasticSearch with this book and ebook

₨1,814.99    ₨363.00
by Alberto Paro | December 2013 | Open Source

This article by Alberto Paro, author of the book, ElasticSearch Cookbook, covers the basic steps to start using ElasticSearch from the simple install to cloud ones.

In this article, we will cover the following topics:

  • Downloading and installing ElasticSearch
  • Networking setup
  • Setting up a node
  • Setting up ElasticSearch for Linux systems (advanced)
  • Setting up different node types (advanced)
  • Installing a plugin
  • Installing a plugin manually
  • Removing a plugin
  • Changing logging settings (advanced)

(For more resources related to this topic, see here.)

Downloading and installing ElasticSearch

ElasticSearch has an active community and the release cycles are very fast.

Because ElasticSearch depends on many common Java libraries (Lucene, Guice, and Jackson are the most famous ones), the ElasticSearch community tries to keep them updated and fix bugs that are discovered in them and in ElasticSearch core.

If it's possible, the best practice is to use the latest available release (usually the more stable one).

Getting ready

A supported ElasticSearch Operative System (Linux/MacOSX/Windows) with installed Java JVM 1.6 or above is required. A web browser is required to download the ElasticSearch binary release.

How to do it...

For downloading and installing an ElasticSearch server, we will perform the steps given as follows:

  1. Download ElasticSearch from the Web.

    The latest version is always downloadable from the web address http://www.elasticsearch.org/download/.

    There are versions available for different operative systems:

    • elasticsearch-{ version-number} .zip: This is for both Linux/Mac OSX, and Windows operating systems
    • elasticsearch-{ version-number} .tar.gz: This is for Linux/Mac
    • elasticsearch-{ version-number} .deb: This is for Debian-based Linux distributions (this also covers Ubuntu family)

    These packages contain everything to start ElasticSearch.

    At the time of writing this book, the latest and most stable version of ElasticSearch was 0.90.7. To check out whether this is the latest available or not, please visit http://www.elasticsearch.org/download/.

  2. Extract the binary content.

    After downloading the correct release for your platform, the installation consists of expanding the archive in a working directory.

    Choose a working directory that is safe to charset problems and doesn't have a long path to prevent problems when ElasticSearch creates its directories to store the index data.

    For windows platform, a good directory could be c:\es, on Unix and MacOSX /opt/ es.

    To run ElasticSearch, you need a Java Virtual Machine 1.6 or above installed. For better performance, I suggest you use Sun/Oracle 1.7 version.

  3. We start ElasticSearch to check if everything is working.

    To start your ElasticSearch server, just go in the install directory and type:

    # bin/elasticsearch –f (for Linux and MacOsX)

    or

    # bin\elasticserch.bat –f (for Windows)

    Now your server should start as shown in the following screenshot:

How it works...

The ElasticSearch package contains three directories:

  • bin: This contains script to start and manage ElasticSearch. The most important ones are:
    • elasticsearch(.bat): This is the main script to start ElasticSearch
    • plugin(.bat): This is a script to manage plugins
  • config: This contains the ElasticSearch configs. The most important ones are:
    • elasticsearch.yml: This is the main config file for ElasticSearch
    • logging.yml: This is the logging config file
  • lib: This contains all the libraries required to run ElasticSearch

There's more...

During ElasticSearch startup a lot of events happen:

  • A node name is chosen automatically (that is Akenaten in the example) if not provided in elasticsearch.yml.
  • A node name hash is generated for this node (that is, whqVp_4zQGCgMvJ1CXhcWQ).
  • If there are plugins (internal or sites), they are loaded. In the previous example there are no plugins.
  • Automatically if not configured, ElasticSearch binds on all addresses available two ports:
    • 9300 internal, intra node communication, used for discovering other nodes
    • 9200 HTTP REST API port
  • After starting, if indices are available, they are checked and put in online mode to be used.

There are more events which are fired during ElasticSearch startup. We'll see them in detail in other recipes.

Networking setupM

Correctly setting up a networking is very important for your node and cluster.

As there are a lot of different install scenarios and networking issues in this recipe we will cover two kinds of networking setups:

  • Standard installation with autodiscovery working configuration
  • Forced IP configuration; used if it is not possible to use autodiscovery

Getting ready

You need a working ElasticSearch installation and to know your current networking configuration (that is, IP).

How to do it...

For configuring networking, we will perform the steps as follows:

  1. Open the ElasticSearch configuration file with your favorite text editor.

    Using the standard ElasticSearch configuration file (config/elasticsearch. yml), your node is configured to bind on all your machine interfaces and does autodiscovery broadcasting events, that means it sends "signals" to every machine in the current LAN and waits for a response. If a node responds to it, they can join in a cluster.

    If another node is available in the same LAN, they join in the cluster.

    Only nodes with the same ElasticSearch version and same cluster name (cluster.name option in elasticsearch.yml) can join each other.

  2. To customize the network preferences, you need to change some parameters in the elasticsearch.yml file, such as:

    cluster.name: elasticsearch node.name: "My wonderful server" network.host: 192.168.0.1 discovery.zen.ping.unicast.hosts: ["192.168.0.2","192.168.0.3[9300- 9400]"]

    This configuration sets the cluster name to elasticsearch, the node name, the network address, and it tries to bind the node to the address given in the discovery section.

  3. We can check the configuration during node loading.

    We can now start the server and check if the network is configured:

    [INFO ][node ] [Aparo] version[0.90.3], pid[16792], build[5c38d60/2013-08-06T13:18:31Z] [INFO ][node ] [Aparo] initializing ... [INFO ][plugins ] [Aparo] loaded [transport-thrift, rivertwitter, mapper-attachments, lang-python, jdbc-river, langjavascript], sites [bigdesk, head] [INFO ][node ] [Aparo] initialized [INFO ][node ] [Aparo] starting ... [INFO ][transport ] [Aparo] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/192.168.1.5:9300]} [INFO ][cluster.service] [Aparo] new_master [Angela Cairn] [yJcbdaPTSgS7ATQszgpSow][inet[/192.168.1.5:9300]], reason: zendisco- join (elected_as_master) [INFO ][discovery ] [Aparo] elasticsearch/ yJcbdaPTSgS7ATQszgpSow [INFO ][http ] [Aparo] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/192.168.1.5:9200]} [INFO ][node ] [Aparo] started

    In this case, we have:

    • The transport bounds to 0:0:0:0:0:0:0:0:9300 and 192.168.1.5:9300
    • The REST HTTP interface bounds to 0:0:0:0:0:0:0:0:9200 and 192.168.1.5:9200

How it works...

It works as follows:

  • cluster.name: This sets up the name of the cluster (only nodes with the same name can join).
  • node.name: If this is not defined, it is automatically generated by ElasticSearch. It allows defining a name for the node. If you have a lot of nodes on different machines, it is useful to set this name meaningful to easily locate it. Using a valid name is easier to remember than a generated name, such as whqVp_4zQGCgMvJ1CXhcWQ
  • network.host: This defines the IP of your machine to be used in binding the node. If your server is on different LANs or you want to limit the bind on only a LAN, you must set this value with your server IP.
  • discovery.zen.ping.unicast.hosts: This allows you to define a list of hosts (with ports or port range) to be used to discover other nodes to join the cluster. This setting allows using the node in LAN where broadcasting is not allowed or autodiscovery is not working (that is, packet filtering routers). The referred port is the transport one, usually 9300. The addresses of the hosts list can be a mix of:
    • host name, that is, myhost1
    • IP address, that is, 192.168.1.2
    • IP address or host name with the port, that is, myhost1:9300 and 192.168.1.2:9300
    • IP address or host name with a range of ports, that is, myhost1:[9300-9400], 192.168.1.2:[9300-9400]

Setting up a node

ElasticSearch allows you to customize several parameters in an installation. In this recipe, we'll see the most used ones to define where to store our data and to improve general performances.

Getting ready

You need a working ElasticSearch installation.

How to do it...

The steps required for setting up a simple node are as follows:

  • Open the config/elasticsearch.yml file with an editor of your choice.
  • Set up the directories that store your server data:

    path.conf: /opt/data/es/conf path.data: /opt/data/es/data1,/opt2/data/data2 path.work: /opt/data/work path.logs: /opt/data/logs path.plugins: /opt/data/plugins

  • Set up parameters to control the standard index creation. These parameters are:

    index.number_of_shards: 5 index.number_of_replicas: 1

How it works...

The path.conf file defines the directory that contains your configuration: mainly elasticsearch.yml and logging.yml. The default location is $ES_HOME/config with ES_HOME your install directory.

It's useful to set up the config directory outside your application directory so you don't need to copy configuration files every time you update the version or change the ElasticSearch installation directory.

The path.data file is the most important one: it allows defining one or more directories where you store index data. When you define more than one directory, they are managed similarly to a RAID 0 configuration (the total space is the sum of all the data directory entry points), favoring locations with the most free space.

The path.work file is a location where ElasticSearch puts temporary files.

The path.log file is where log files are put. The control how to log is managed in logging.yml.

The path.plugins file allows overriding the plugins path (default $ES_HOME/plugins). It's useful to put "system wide" plugins.

The main parameters used to control the index and shard is index.number_of_shards, that controls the standard number of shards for a new created index, and index.number_ of_replicas that controls the initial number of replicas.

There's more...

There are a lot of other parameters that can be used to customize your ElasticSearch installation and new ones are added with new releases. The most important ones are described in this recipe and in the next one.

ElasticSearch Cookbook Over 120 advanced recipes to search, analyze, deploy, manage, and monitor data effectively with ElasticSearch with this book and ebook
Published: December 2013
eBook Price: ₨1,814.99
Book Price: ₨3,024.99
See more
Select your format and quantity:

Setting up ElasticSearch for Linux systems (advanced)

If you are using a Linux system, typically on a server, you need to manage extra setup to have a performance gain or to resolve production problems with many indices. Getting ready

Getting ready

You need a working ElasticSearch installation.

How to do it...

For improving the performance on Linux systems, we will perform the steps given as follows:

  1. First you need to change the current limit for the user who runs the ElasticSearch server. In these examples, we call the user as elasticsearch.
  2. To allow elasticsearch to manage a large number of files, you need to increment the number of file descriptors (number of files) that a user can have. To do so, you must edit your /etc/security/limits.conf and add the following lines at the end:

    elasticsearch - nofile 999999 elasticsearch - memlock unlimited

    Then a machine restart is required to be sure that changes are taken.

  3. For controlling the memory swapping, you need to set up this parameter in elasticsearch.yml:

    bootstrap.mlockall: true

  4. To fix the memory usage size of ElasticSearch server, we need to set up the same value ES_MIN_MEM and ES_MAX_MEM in $ES_HOME/bin/elasticsearch.in.sh. You can otherwise set up ES_HEAP_SIZE that automatically initializes ES_MIN_MEM and ES_MAX_MEM to same ES_HEAP_SIZE provided value.

How it works...

The standard limit of file descriptors (max number of open files for a user) is typically 1024. When you store a lot of records in several indices, you run out of file descriptors very quickly, so your ElasticSearch server becomes unresponsive and your indices may become corrupted, losing your data.

Changing the limit to a very high number means that your ElasticSearch doesn't hit the maximum number of open files.

The other settings for the memory prevent ElasticSearch from swapping the memory and give a performance boost in the production environment. These settings are required because during indexing and searching, ElasticSearch creates and destroys a lot of objects in memory. This large number of create/destroy actions fragments the memory, reducing the performances. If you don't set bootstrap.mlockall: true, ElasticSearch dumps the memory on disk and defragments it back in memory. With this setting, the defragmentation step is done in memory with huge performance boost.

There's more...

This recipe covers two common errors that happen in production:

  • "Too many open files", that can corrupt your indices and your data
  • Slow performance in search and indexing due to garbage collector

Setting up different node types (advanced)

ElasticSearch is a native designed for the cloud, so when you need to release a production environment with a huge number of records and you need high availability and good performance, you need to aggregate more nodes in a cluster.

ElasticSearch allows defining different types of nodes to balance and improve overall performances.

Getting ready

You need a working ElasticSearch installation.

How to do it...

For an advance cluster setup, there are some parameters that must be configured to define different node types. These parameters are in config/elasticsearch.yml and they can be set with the following steps:

  1. Setup if the node can be master or not:

    node.master: true

  2. Setup if a node must contain data or not:

    node.data: true

How it works...

The working of different nodes types is as follows:

  • node.master: This parameter defines that the node can become master for the cloud. The default value for this parameter is true.

    A master node is an arbiter for the cloud: it takes a decision about shard management, it keeps cluster status and it's the main controller of every index action.

  • node.data: This allows you to store data in the node. The default value for this parameter is true. This node will be a worker that indexes and searches data.

Mixing these two parameters, it's possible to have different node types:

node.master

node.data

Node description

true

true

This is a default node. It can be master and contains data.

false

true

This node never becomes a master node, it only holds data. It can be defined as a "workhorse" of your cluster.

true

false

This node only serves as a master: to not store any data and to have free resources. This will be the "coordinator" of your cluster

false

false

This node acts as a "search load balancer" (fetching data from nodes, aggregating results, and so on).

The more frequently used node type is the first one, but if you have a very big cluster or special needs, you can differentiate the scopes of your nodes to better serve searches and aggregations.

ElasticSearch Cookbook Over 120 advanced recipes to search, analyze, deploy, manage, and monitor data effectively with ElasticSearch with this book and ebook
Published: December 2013
eBook Price: ₨1,814.99
Book Price: ₨3,024.99
See more
Select your format and quantity:

Installing a plugin

One of the main features of ElasticSearch is the possibility to extend it with plugins. Plugins extend ElasticSearch features and functionalities in several ways. There are two kinds of plugins:

  • Site plugins: These are used to serve static contents in their entry points. They are mainly used to create management application: monitoring and administration of a cluster.
  • Binary plugins: These are jar files that contain application code. They are used for:
    • Rivers (plugins that allow importing data from DBMS or other sources)
    • ScriptEngine (JavaScript, Python, Scala, and Ruby)
    • Custom analyzers and tokenizers
    • REST entry points
    • Supporting new protocols (Thrift, memcache, and so on)
    • Supporting new storages (Hadoop)

Getting ready

You need an installed working ElasticSearch server.

How to do it...

ElasticSearch provides a script for automatically downloading and installing plugins in bin/ directory, called plugin.

The steps required to install a plugin are:

  1. Call the plugin install the ElasticSearch command with the plugin name reference.

    For installing an administrative interface for Elasticsearch, simply call:

    • on Linux/Mac:

      plugin -install mobz/elasticsearch-head

    • on Windows:

      plugin.bat -install mobz/elasticsearch-head

  2. Check by starting the node that the plugin is correctly loaded.

The following screenshot shows the installation and the initialization of ElasticSearch server with the installed plugin.

Remember that a plugin installation requires to restart the ElasticSearch server.

How it works...

The plugin[.bat] script is a wrapper for ElasticSearch Plugin Manager. It can be used to install or remove a plugin with the –remove options.

To install a plugin, there are two kinds of options:

  • Pass the URL of the plugin (zip archive) with the -url parameter, that is, bin/ plugin –url http://mywoderfulserve.com/plugins/awesome-plugin.zip

  • Use the –install parameter with the Github repository of the plugin.

    The install parameter, that must be given, is formatted in this way:

    <username>/<repo>[/<version>]

In the previous example:

  • <username> was mobz
  • <repo> was elasticsearch-head
  • <version> was not given so master/trunk was used

During the install process, ElasticSearch Plugin Manager is able to:

  • Download the plugin
  • Create a plugins directory in ES_HOME if it's missing
  • Unzip the plugin content in the plugin directory
  • Remove temporary files

There's more...

There are some hints to remember while installing plugins. The first and most important is that the plugin must be certified for your current ElasticSearch version: some releases can break your plugins. Typically on the plugin developer page, there are the ElasticSearch versions supported by this plugin.

For example, if you look at the Python language plugin page (https://github.com/elasticsearch/elasticsearch-lang-python), you'll see a reference table similar to the following table:

--------------------------------------- | Python Plugin | ElasticSearch | --------------------------------------- | master | 0.90 -> master | --------------------------------------- | 1.2.0 | 0.90 -> master | --------------------------------------- | 1.1.0 | 0.19 -> 0.20 | --------------------------------------- | 1.0.0 | 0.18 | ---------------------------------------

You must choose the version working with your current ElasticSearch version.

Updating some plugins in a node environment can cause malfunction due to different plugin versions in different nodes. If you have a big cluster for safety, it's better to check the update in a separate environment to prevent problems.

Note that updating an ElasticSearch server could also break your custom binary plugins due to some internal API changes.

Installing a plugin manually

Sometimes your plugin is not available online or the standard installation fails, so you need to install your plugin manually.

Getting ready

You need an installed ElasticSearch server.

How to do it...

We assume that your plugin is named awesome and it's packed in a file called awesome.zip.

The steps required to execute a manually installed plugin are:

  1. Copy your zip file in the plugins directory in your ElasticSearch home installation.
  2. If the directory, named plugins, doesn't exist, create it.
  3. If the directory, named plugins, doesn't exist, create it.
  4. Remove the zip archive to clean up unused files.

How it works...

Every ElasticSearch plugin is contained in a directory (usually named as the plugin name).

If the plugin is a site one, the plugin should contain a directory called _site, which contains the static files that must be served by the server. If the plugin is a binary one, the plugin directory should be filled with one or more jar files.

When ElasticSearch starts, it scans the plugins directory and loads them. If a plugin is corrupted or broken, the server doesn't start.

Removing a plugin

You have installed some plugins and now you need to remove a plugin because it's not required. Removing an ElasticSearch plugin is easy to uninstall if everything goes right, otherwise you need to manually remove it.

This recipe covers both cases.

Getting ready

You need an installed working ElasticSearch server with an installed plugin. Stop the ElasticSearch server in order to safely remove the plugin.

How to do it...

ElasticSearch Plugin Manager, which comes with its script wrapper (plugin), provides command to automatically remove a plugin.

  • On Linux and MacOSX, call:

    plugin -remove mobz/elasticsearch-head

    Or

    plugin -remove head

  • On Windows, call:

    plugin.bat -remove mobz/elasticsearch-head

    Or

    plugin.bat –remove head

How it works...

The Plugin Manager –remove command tries to detect the correct name of the plugin and remove the directory of the installed plugin.

If there are undeletable files in your plugin directory (or a strange astronomical event that hits your server), the plugin script may fail: to manually remove a plugin, go in to the plugins directory and remove the directory with your plugin name.

Changing logging settings (advanced)

Standard logging settings work very well for general usage.

If you need to debug your ElasticSearch server or change how the logging works (that is, remoting send events), you need to change the logging.yml parameters.

Getting ready

You need an installed working ElasticSearch server.

How to do it...

In the config directory in your ElasticSearch, install the directory. There is a logging.yml file which controls the working settings. The steps required for changing the logging settings are:

  1. To emit every kind of logging ElasticSearch has, you can change the root-level logging from rootLogger: INFO, console, file to rootLogger: DEBUG, console, file
  2. Now if you start ElasticSearch from command-line (with bin/elasticsearch –f), you should see a lot of garbage:

How it works...

ElasticSearch logging system is based on the log4j library (http://logging.apache.org/log4j/).

Changing the log level can be useful to check for bugs or understanding malfunctions due to bad configuration or strange plugin behaviors. A verbose log can be used from ElasticSearch community to cover the problems.

This is a powerful library to manage logging, covering all the functionalities of it (it's outside the scope of this book). If a user needs advanced usage, there are a lot of books and articles on the Internet for reference.

Summary

In this article, we learned the installation process and the configuration from a single developer machine to a big cluster, giving hints on how to improve the performance, and skipping misconfiguration errors. In this article, we also learned the management of ElasticSearch plugins: installing, configuring, updating, and removing plugins.

Resources for Article:


Further resources on this subject:


About the Author :


Alberto Paro

Alberto Paro is an engineer, a project manager, and a software developer. He currently works as a CTO at The Net Planet Europe and as a Freelance Consultant of software engineering on Big Data and NoSQL solutions. He loves studying emerging solutions and applications mainly related to Big Data processing, NoSQL, Natural Language Processing, and neural networks. He started programming in Basic on a Sinclair Spectrum when he was eight years old and in his life he has gained a lot of experience using different operative systems, applications, and programming.

In 2000, he completed Computer Science engineering from Politecnico di Milano with a thesis on designing multi-users and multidevices web applications. He worked as a professor helper at the university for about one year. Then, after coming in contact with The Net Planet company and loving their innovation ideas, he started working on knowledge management solutions and advanced data-mining products.

In his spare time, when he is not playing with his children, he likes working on open source projects. When he was in high school, he started contributing to projects related to the Gnome environment (GTKMM). One of his preferred programming languages was Python and he wrote one of the first NoSQL backend for Django for MongoDB (django-mongodb-engine). In 2010, he started using ElasticSearch to provide search capabilities for some Django e-commerce sites and developed PyES (a pythonic client for ElasticSearch) and the initial part of ElasticSearch MongoDB River.

Books From Packt


 ElasticSearch Server
ElasticSearch Server

Apache Solr 3 Enterprise Search Server
Apache Solr 3 Enterprise Search Server

Apache Solr 3.1 Cookbook
Apache Solr 3.1 Cookbook

Apache Solr 4 Cookbook
Apache Solr 4 Cookbook

Instant Lucene.NET [Instant]
Instant Lucene.NET [Instant]

 Apache Tomcat 7 Essentials
Apache Tomcat 7 Essentials

Apache Axis2 Web Services, 2nd Edition
Apache Axis2 Web Services, 2nd Edition

Quickstart Apache Axis2
Quickstart Apache Axis2


Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software