Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Elasticsearch 5.x Cookbook - Third Edition

You're reading from  Elasticsearch 5.x Cookbook - Third Edition

Product type Book
Published in Feb 2017
Publisher
ISBN-13 9781786465580
Pages 696 pages
Edition 3rd Edition
Languages
Author (1):
Alberto Paro Alberto Paro
Profile icon Alberto Paro

Table of Contents (25) Chapters

Credits
About the Author
About the Reviewer
www.PacktPub.com
Customer Feedback
Dedication
Preface
Getting Started Downloading and Setup Managing Mappings Basic Operations Search Text and Numeric Queries Relationships and Geo Queries Aggregations Scripting Managing Clusters and Nodes Backup and Restore User Interfaces Ingest Java Integration Scala Integration Python Integration Plugin Development Big Data Integration

Chapter 18. Big Data Integration

In this chapter we will cover the following recipes:

  • Installing Apache Spark

  • Indexing data via Apache Spark

  • Indexing data with meta via Apache Spark

  • Reading data with Apache Spark

  • Reading data using SparkSQL

  • Indexing data with Apache Pig

Introduction


Elasticsearch has become a common component in big data architectures because it provides several features:

  • It allows searching on massive amount of data in a very fast way

  • For common aggregation operations, it provides real-time analytics on big data

  • It's more easy to use an Elasticsearch aggregation than a spark one

  • If you need to move on to a fast data solution, starting from a subset of documents after a query is faster than doing a full rescan of all your data

The most common big data software used for processing data is now Apache Spark (http://spark.apache.org/) that is considered the evolution of the obsolete Hadoop MapReduce moving the processing from disk to memory.

In this chapter, we will see how to integrate Elasticsearch in Spark both for write and read data. In the end, we will see how to use Apache Pig to write data in Elasticsearch in a simple way.

Installing Apache Spark


To use Apache Spark, first install it. The process is very easy, because its requirements are not the traditional Hadoop ones that require Apache Zookeeper and Hadoop HDFS.

Apache spark is able to work in a standalone node installation that is similar to an Elasticsearch one.

Getting ready

You need a Java Virtual Machine installed: generally version 8.x or above is used.

How to do it...

For installing Apache Spark, we will perform the following steps:

  1. We will download a binary distribution from at http://spark.apache.org/downloads.html. For a generic usage, I suggest you to download a standard version via:

            wget http://d3kbcqa49mib13.cloudfront.net/spark-2.1.0-bin-  
            hadoop2.7.tgz
    
  2. Now we can extract the Spark distribution via:

            tar xfvz spark-2.1.0-bin-hadoop2.7.tgz
    
  3. Now, we can test if Apache Spark is working by executing a test:

            cd spark-2.1.0-bin-hadoop2.7 
            ./bin/run-example SparkPi 
    
  4. The result will be similar...

Indexing data via Apache Spark


After having installed Apache Spark, we can configure it to work with Elasticsearch and write some data in it.

Getting ready

You need an up-and-running Elasticsearch installation as we described in the Downloading and installing Elasticsearch recipe in Chapter 2Downloading and Setup.

You also need a working installation of Apache Spark.

How to do it...

To configure Apache Spark to communicate with Elasticsearch, we will perform the following steps:

  1. We need to download the ElasticSearch Spark JAR:

            wget http://download.elastic.co/hadoop/elasticsearch-hadoop-
            5.1.1.zip 
            unzip elasticsearch-hadoop-5.1.1.zip 
    
  2. A quick way to access the Spark shell in Elasticsearch is to copy the Elasticsearch Hadoop required file in Spark's .jar directory. The file that must be copied is elasticsearch-spark-20_2.11-5.1.1.jar.

    The version of Scala used by both Apache Spark and Elasticsearch Spark must match!

For storing data in Elasticsearch via Apache...

Indexing data with meta via Apache Spark


Using a simple map for ingesting data is not good for simple jobs. The best practice in Spark is to use the case class so that you have fast serialization and you are to manage complex type checking. During indexing, providing custom IDs can be very handy. In this recipe, we will see how to cover these issues.

Getting ready

You need an up-and-running Elasticsearch installation as we described in the Downloading and installing Elasticsearch recipe in Chapter 2, Downloading and Setup.

You also need a working installation of Apache Spark.

How to do it...

To store data in Elasticsearch via Apache Spark, we will perform the following steps:

  1. We need to start the Spark shell:

            ./bin/spark-shell
    
  2. We will import the required classes:

            import org.apache.spark.SparkContext 
            import org.elasticsearch.spark.rdd.EsSpark         
    
  3. We will create a case class Person:

           case class Person(username:String, name:String, age:Int)
  4. We create...

Reading data with Apache Spark


In Spark you can read data from a lot of sources, but in general NoSQL datastores such as HBase, Accumulo, and Cassandra you have a limited query subset and you often need to scan all the data to read only the required data. Using Elasticsearch you can retrieve a subset of documents that match your Elasticsearch query.

Getting ready

To read an up-and-running Elasticsearch installation as we described in the Downloading and installing Elasticsearch recipe in Chapter 2, Downloading and Setup.

You also need a working installation of Apache Spark and the data indexed in the previous example.

How to do it...

For reading data in Elasticsearch via Apache Spark, we will perform the steps given as follows:

  1. We need to start the Spark Shell:

            ./bin/spark-shell
    
  2. We import the required classes:

            import org.elasticsearch.spark._         
    
  3. Now we can create a RDD by reading data from Elasticsearch:

            val rdd=sc.esRDD("spark/persons") 
    
  4. We can watch...

Reading data using SparkSQL


Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine. Elasticsearch Spark integration allows us to read data via SQL queries.

Note

Spark SQL works with structured data; in other words, all entries are expected to have the same structure (the same number of fields, of the same type and name). Using unstructured data (documents with different structures) is not supported and will cause problems.

Getting ready

You need an up-and-running Elasticsearch installation as we described in the Downloading and installing Elasticsearch recipe in Chapter 2, Downloading and Setup.

You also need a working installation of Apache Spark and the data indexed in the Indexing data via Apache Spark recipe of this chapter.

How to do it...

To read data in Elasticsearch via Apache Spark SQL and via DataFrame, we will perform the steps given as follows:

  1. We need to start the Spark shell...

Indexing data with Apache Pig


Apache Pig (https://pig.apache.org/) is a tool frequently used to store/manipulate data in datastores. It can be very handy if you need to import some CSV in Elasticsearch in a very fast way.

Getting ready

You need an up-and-running Elasticsearch installation as we described in Downloading and installing Elasticsearch recipe in Chapter 2, Downloading and Setup.

You need a working Pig installation. Depending on your operating system you should follow the instruction at http://pig.apache.org/docs/r0.16.0/start.html.

If you are using Mac OS X with Homebrew you can install it with brew install pig.

How to do it...

We want read a CSV and write the data in Elasticsearch. We will perform the steps given as follows:

  1. We will download a CSV dataset from geonames site: all the geoname locations of Great Britain. We can fast download them and unzip them via:

            wget http://download.geonames.org/export/dump/GB.zip 
            unzip GB.zip 
    
  2. We can write es.pig that contains...

lock icon The rest of the chapter is locked
arrow left Previous Chapter
You have been reading a chapter from
Elasticsearch 5.x Cookbook - Third Edition
Published in: Feb 2017 Publisher: ISBN-13: 9781786465580
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}