Reader small image

You're reading from  Elasticsearch 5.x Cookbook - Third Edition

Product typeBook
Published inFeb 2017
Publisher
ISBN-139781786465580
Edition3rd Edition
Right arrow
Author (1)
Alberto Paro
Alberto Paro
author image
Alberto Paro

Alberto Paro is an engineer, manager, and software developer. He currently works as technology architecture delivery associate director of the Accenture Cloud First data and AI team in Italy. He loves to study emerging solutions and applications, mainly related to cloud and big data processing, NoSQL, Natural language processing (NLP), software development, and machine learning. In 2000, he graduated in computer science engineering from Politecnico di Milano. Then, he worked with many companies, mainly using Scala/Java and Python on knowledge management solutions and advanced data mining products, using state-of-the-art big data software. A lot of his time is spent teaching how to effectively use big data solutions, NoSQL data stores, and related technologies.
Read more about Alberto Paro

Right arrow

Chapter 11. Backup and Restore

In this chapter, we will cover the following recipes:

  • Managing repositories

  • Executing a snapshot

  • Restoring a snapshot

  • Setting up an NFS share for backup

  • Reindexing from a remote cluster

Introduction


Elasticsearch is very commonly used as a datastore for logs and other kind of data, so if you store valuable data you also need tools to back up and restore this data to support disaster recovery.

In the first versions of Elasticsearch the only viable solution was to dump your data with a complete scan and then reindex it. As Elasticsearch matured as a complete product, it supported native functionalities to back up the data and to restore it.

In this chapter, we'll see how to configure a shared storage via NFS for storing your backups, and how to execute and restore a backup.

In the last recipe of the chapter we will see how to use the reindex functionality to clone data between different Elasticsearch clusters. This approach is very useful if you are not able to use standard backup/restore functionalities due to moving from an old Elasticsearch version to the new one.

Managing repositories


Elasticsearch provides a built-in system to rapidly ot and restore your data. When working with live data, keeping a backup is complex, due to the large number of concurrency problems.

An Elasticsearch snapshot allows for the creation of snapshots of individual indices (or aliases), or an entire cluster, into a remote repository.

Before starting to execute a snapshot, a repository must be created--this is where your backups/snapshots will be stored.

Getting ready

You need an up-and-running Elasticsearch installation as we described in the Downloading and installing Elasticsearch recipe in Chapter 2, Downloading and Setup.

To execute curl via the command line you need to install curl for your operating system.

We need to edit config/elasticsearch.yml and add the directory of your backup repository:

path.repo: /tmp/

For our examples, we'll be using the /tmp directory available in every Unix system. Generally, in a production cluster, this directory should be a shared repository...

Executing a snapshot


In the previous recipe, we defined a repository: the place where we will store the backups. Now we can create snapshots of indices, a full backup of an index, in the exact instant that the command is called.

For every repository it's possible to define multiple snapshots.

Getting ready

You need an up-and-running Elasticsearch installation as we described in the Downloading and installing Elasticsearch recipe in Chapter 2, Downloading and Setup.

To execute curl via the command line you need to install curl for your operating system.

To correctly execute the following command, the repository created in the previous recipe is required.

How to do it...

To manage a snapshot, we will perform the following steps:

  1. To create a snapshot called snap_1 for the test and test1 indices, the HTTP method is PUT and the curl command is as follows:

            curl -XPUT 
            "http://localhost:9200/_snapshot/my_repository/snap_1?
            wait_for_completion=true" -d '{
            "indices...

Restoring a snapshot


Once you have snapshots of your data, it can be restored. The restore process is very fast: the indexed shard data is simply copied on the nodes and activated.

Getting ready

You need an up-and-running Elasticsearch installation as we described in the Downloading and installing Elasticsearch recipe in Chapter 2, Downloading and Setup.

To execute curl via the command line, you need to install curl for your operative system.

To correctly execute the following command, the backup created in the previous recipe is required.

How to do it...

To restore a snapshot, we will perform the following steps:

  1. To restore a snapshot called snap_1 for the test and test1 indices, the HTTP method is PUT and the curl command is:

    curl -XPOST 
            "http://localhost:9200/_snapshot/my_repository/snap_1/_restore? 
            pretty" -d '{
            "indices": "test-index,test-2",
            "ignore_unavailable": "true",
            "include_global_state": false,
            "rename_pattern": "test-(.+...

Setting up a NFS share for backup


Managing the repository is the most import issue in Elasticsearch backup management. Due to its native distributed architecture, the snapshot and the restore are designed in a cluster style.

During a snapshot, the shards are copied to the defined repository. If this repository is local to the nodes, the backup data is spread across all the nodes. For this reason, it's necessary to have shared repository storage if you have a multimode cluster.

A common approach is to use a Network File System (NFS), as it's very easy to set up and it's a very fast solution (also, standard Windows Samba shares can be used.)

Getting ready

We have a network with the following nodes:

  • Host server: 192.168.1.30 (where we will store the backup data)

  • Elasticsearch master node 1: 192.168.1.40

  • Elasticsearch data node 1: 192.168.1.50

  • Elasticsearch data node 2: 192.168.1.51

You need an up-and-running Elasticsearch installation as we described in the Downloading and installing Elasticsearch recipe...

Reindexing from a remote cluster


The snapshot and restore APIs are very fast and the preferred way to back up data, but they have some limitations, such as:

  • The backup is a safe Lucene index copy, so it depends on the Elasticsearch version used. If you are switching from a version of Elastisearch that is prior to version 5.x, it's not possible to restore old indices.

  • It's not possible to restore backups of a newer Elasticsearch version in an older version. The restore is only forward-compatible.

  • It's not possible to restore partial data from a backup.

To be able to copy data in this scenario, the solution is to use the reindex API using a remote server.

Getting ready

You need an up-and-running Elasticsearch installation as we described in the Downloading and installing Elasticsearch recipe in Chapter 2, Downloading and Setup.

To execute curl via command line, you need to install curl for your operative system.

How to do it...

To copy an index from a remote server, we need to execute the following...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Elasticsearch 5.x Cookbook - Third Edition
Published in: Feb 2017Publisher: ISBN-13: 9781786465580
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Alberto Paro

Alberto Paro is an engineer, manager, and software developer. He currently works as technology architecture delivery associate director of the Accenture Cloud First data and AI team in Italy. He loves to study emerging solutions and applications, mainly related to cloud and big data processing, NoSQL, Natural language processing (NLP), software development, and machine learning. In 2000, he graduated in computer science engineering from Politecnico di Milano. Then, he worked with many companies, mainly using Scala/Java and Python on knowledge management solutions and advanced data mining products, using state-of-the-art big data software. A lot of his time is spent teaching how to effectively use big data solutions, NoSQL data stores, and related technologies.
Read more about Alberto Paro