Reader small image

You're reading from  Advanced Elasticsearch 7.0

Product typeBook
Published inAug 2019
Reading LevelBeginner
PublisherPackt
ISBN-139781789957754
Edition1st Edition
Languages
Right arrow
Author (1)
Wai Tak Wong
Wai Tak Wong
author image
Wai Tak Wong

Wai Tak Wong is a faculty member in the Department of Computer Science at Kean University, NJ, USA. He has more than 15 years professional experience in cloud software design and development. His PhD in computer science was obtained at NJIT, NJ, USA. Wai Tak has served as an associate professor in the Information Management Department of Chung Hua University, Taiwan. A co-founder of Shanghai Shellshellfish Information Technology, Wai Tak acted as the Chief Scientist of the R&D team, and he has published more than a dozen algorithms in prestigious journals and conferences. Wai Tak began his search and analytics technology career with Elasticsearch in the real estate market and later applied this to data management and FinTech data services.
Read more about Wai Tak Wong

Right arrow

Spark and Elasticsearch for Real-Time Analytics

In the previous chapter, we looked at the machine learning feature of Elastic Stack. We used a single metric job to track one-dimensional data (with the volume field of the cf_rfem_hist_price index) to detect anomalies by using Kibana. We also introduced the scikit-learn Python package and performed the same anomaly detection, but with three-dimensional data (with two more fields: changePercent and changeOverTime) by using Python programming.

In this chapter, we will look at another advanced feature, which is known as Elasticsearch for Apache Hadoop (ES-Hadoop). The ES-Hadoop feature contains two major areas. The first area is the integration of Elasticsearch with Hadoop distributed computing environments, such as Apache Spark, Apache Storm, and Hive. The second area is the integration of Elasticsearch to use the Hadoop filesystem...

Overview of ES-Hadoop

As we mentioned, the ES-Hadoop feature contains two major areas: distributed computing and distributed storage. The main goal of ES-Hadoop is to seamlessly connect Elasticsearch and Hadoop so that they can benefit each other with distributed computing, distributed storage, searching, analytics, visualization, and more. We can import Hadoop Distributed File System (HDFS) data to Elasticsearch for search and analysis, and export the Elastisearch data to HDFS for snapshot and restore. ES-Hadoop fully supports the Spark framework, including Spark, Hive, Pig, Storm, Cascading, and sure, the standard MapReduce. Let's take a look at the data flow between Elasticsearch, ES-Hadoop, and components in the Hadoop ecosystem, as shown in the following screenshot:

In short, we can think of ES-Hadoop as a data bridge between Elasticsearch and the Hadoop big data ecosystem...

Apache Spark support

Apache Spark is one of the most popular big data tools. It is a second-generation computing engine that works with Hadoop as an alternative to MapReduce. It provides in-memory computing capabilities to achieve high-performance analytics. The major components in Spark include Spark SQL, Spark Streaming, SparkR, Machine Learning Library (MLlib), and GraphX. Spark is built on the Scala programming language and also supports APIs for Java, Python, and R. The following diagram depicts the ecosystem of Spark:

Spark provides a hybrid processing framework, which means it supports both batch processing and stream processing. Let's look at these brief descriptions of each type of processing:

  • Batch processing: Usually, this applies to blocks of data that have been stored for a period of time and it takes a long time to complete the process. Spark handles all...

Real-time analytics using Elasticsearch and Apache Spark

We will use the same anomaly detection example from the previous chapter. Instead of using the simple k-means clustering library function provided by scikit-learn, we will use ES-Hadoop, Spark SQL, and Spark MLlib to solve the problem. From our GitHub download site (https://github.com/PacktPublishing/Mastering-Elasticsearch-7.0/tree/master/Chapter17/eshadoop), there is a project that involves using the Python 3.6 Virtual Environment (virtualenv) to create an isolated environment for the project to have its own dependencies in its site packages. The project is self-contained and allows you to run it on its working environment. The following section is a reference for you if you're interested in building a virtual environment.

...

Summary

Unbelievable! We have completed the study of Spark and Elasticsearch for real-time analytics via ES-Hadoop. We started with the basic concepts of Apache Hadoop. We learned how to configure ES-Hadoop for Apache Spark support. We read the data from Elasticsearch, processed it, and then wrote it back to Elasticsearch. We learned about the find_anomalies() function, which is a real-time anomaly detection routine based on the k-means model, which was created from past data using the Spark MLlib. This can tell you whether the input data is an anomaly.

The next chapter is the final chapter of this book. We will use Spring Boot to build a RESTful API to provide search and analytics backed by Elasticsearch. We will revisit what we have learned before and glue it together to make a real-world use case project. Finally, we will visualize the results produced by the project by using...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Advanced Elasticsearch 7.0
Published in: Aug 2019Publisher: PacktISBN-13: 9781789957754
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Wai Tak Wong

Wai Tak Wong is a faculty member in the Department of Computer Science at Kean University, NJ, USA. He has more than 15 years professional experience in cloud software design and development. His PhD in computer science was obtained at NJIT, NJ, USA. Wai Tak has served as an associate professor in the Information Management Department of Chung Hua University, Taiwan. A co-founder of Shanghai Shellshellfish Information Technology, Wai Tak acted as the Chief Scientist of the R&D team, and he has published more than a dozen algorithms in prestigious journals and conferences. Wai Tak began his search and analytics technology career with Elasticsearch in the real estate market and later applied this to data management and FinTech data services.
Read more about Wai Tak Wong