Packt+ | Advance your knowledge in tech

You're reading from Learning Apache Apex

Product type Book

Published in Nov 2017

Publisher

ISBN-13 9781788296403

Pages 290 pages

Edition 1st Edition

Languages

Java

Concepts

Data Processing

Authors (5):

Thomas Weise

Ananth Gundabattula

Munagala V. Ramanath

David Yan

Kenneth Knowles

View More author details

Table of Contents (17) Chapters

Title Page

Credits

About the Authors

About the Reviewer

www.PacktPub.com

Customer Feedback

Preface

Introduction to Apex

Getting Started with Application Development

The Apex Library

Scalability, Low Latency, and Performance

Fault Tolerance and Reliability

Example Project – Real-Time Aggregation and Visualization

Example Project – Real-Time Ride Service Data Processing

Example Project – ETL Using SQL

Introduction to Apache Beam

The Future of Stream Processing

Chapter 6. Example Project – Real-Time Aggregation and Visualization

In the previous chapters, we introduced the API to develop Apex applications and the library of connectors and transformations that are available out of the box. We also covered the scalability, performance, and fault tolerance of the Apex platform. This chapter is the first of three example project chapters that are dedicated to developing applications that are representative for scenarios and use cases readers may find in various verticals.

This chapter will cover the following topics:

Streaming ETL and beyond
The application pattern in a real-world use case
Analyzing Twitter feed
Running the application
The Pub/Sub server
Grafana Visualization

Streaming ETL and beyond

This first application will be an example of processing live streaming data with windowing and real-time visualization. The data source will be Twitter, processing of the tweet stream will compute the top hashtags in a time window as well as some counts that can be visualized as time series. The pattern is applicable to many similar use cases: data is continuously consumed from a streaming source and aggregated. Traditionally, results of such computation will land in a storage system (files, databases, and so on). Such processing can be broadly categorized as extract-transform-load (ETL) in streaming fashion. However, the focus here will be on stream processing that goes beyond the realm of general purpose ETL tools and can support streaming analytics use cases.

Stream processing needs a source of data, so every pipeline will involve the E of ETL with connector(s) to extract or ingest data (with Kafka being a common streaming source and files for batch use cases)...

The application pattern in a real-world use case

One of the case studies mentioned in the introductory chapter was an ad-tech use case. The Apex-based system powering that use case went into production in 2014 and fits the same pattern of the example that we will build in this chapter. Summarized at a high level, the system consumes log records from Apache Kafka, then aggregates these records into a dimensional model and provides the aggregates to a real-time dashboard that users can use to gain actionable insights into the performance of advertising campaigns. Time to insight is critical, as the ability to perform timely adjustments often translates to incremental revenue or reduced cost (or both) for the business:

The Apex-based implementation replaced a batch system that involved several hours of latency with a streaming pipeline that reduced latency to seconds (end-to-end, including data collection and ingestion). The dimension store holds the aggregated data in memory (conceptually,...

Analyzing Twitter feed

The example we are going to build will show how an application similar to the ad-tech use case can be built using the Apex library. Instead of processing a stream of ad-impression events from Kafka clusters, we will use a stream of tweets retrieved via the Twitter developer API. Instead of a dimensional model, we will compute windowed aggregates for selected metrics. The data visualization will take advantage of Grafana with a Pub/Sub plugin instead of a custom portal and frontend server. The goal will be to introduce the relevant building blocks and enable the reader to derive a similar application for their domain or use case.

The following sections will walk through the application functionality, its components, and a few selected implementation details. The full code, ready to run is available at the following link: https://github.com/tweise/apex-samples/tree/master/twitter.

The following is the DAG of the example application:

The input operator reads tweets from...

Running the application

This section assumes that you have already set up the development environment as explained in the Chapter 2, Getting Started with Application Development. All components of the Twitter example can run on the host OS. There is no need for a Hadoop cluster, although it would also be possible to run the Apex application in the Docker container. We will instead run it as a JUnit test, as it is easier to modify and experiment with.

Check out the code using the following command:

git clone https://github.com/tweise/apex-samples.git

Then, import the Twitter project into your IDE and run JUnit test:

TwitterStatsAppTest.testApplication

Alternatively, you can run it from the command line:

cd twitter; mvn test -Dtest=TwitterStatsAppTest

By default the test runs the application with a file source of sample tweets (instead of connecting to the Twitter API) and writes results to the console (instead of WebSocket).

To configure the application for live input and visualization of results...

The Pub/Sub server

Once the WebSocket output is enabled and the application is running, you will see connect exceptions trying to reach the Pub/Sub server (java.net.ConnectException: Connection refused: localhost/127.0.0.1:8890). To launch the server, open a new terminal window and run the following commands to fetch the source code and run the server:

git clone https://github.com/atrato/pubsub-server.git 
cd pubsub-server 
mvn compile exec:java

The console will show in the first log lines the listening address of the server:

[INFO] --- exec-maven-plugin:1.5.0:java (default-cli) @ atrato-pubsub-server ---
 2017-06-16 14:18:29,408 [io.atrato.pubsubserver.PubsubServer.main()] INFO  util.log initialized - Logging initialized @4190ms
 2017-06-16 14:18:29,487 [io.atrato.pubsubserver.PubsubServer.main()] INFO  server.Server doStart - jetty-9.1.6.v20160112
 2017-06-16 14:18:30,055 [io.atrato.pubsubserver.PubsubServer.main()] INFO  handler.ContextHandler doStart - Started o.e.j.s.ServletContextHandler...

Grafana visualization

Grafana is an open source metric visualization suite that can be used to build dashboards from various widgets and data sources. Grafana is traditionally used for visualizing time series data for infrastructure and application monitoring, but it is generic and can be used with pretty much any data source that produces time series and tabular data. Thanks to its monitoring orientation Grafana is suitable for low-latency visualization of frequently changing data. We will use this capability to display the top hashtags and counts, and have them updated as changes occur in the application (with perhaps one to two seconds of end-to-end latency).

In order to use Grafana, we need to tap into the data that is available from the Pub/Sub server. The Pub/Sub server has an HTTP interface, but the data format of the queryable state protocol of the Apex application needs to be adapted to what Grafana requires.

There are two options to accomplish this:

Implementing a new datasource that...

Summary

The project that we developed in this chapter is an example for streaming analytics. The incoming tweet stream is processed to compute aggregates which are visualized in real time with a Grafana dashboard. It shows how continuously generated data (in this case, tweets) can be analyzed and used to generate immediate insights. We have seen how existing building blocks from the Apex library (connectors, windowing) are used to accelerate application development and how integration with other infrastructure for data visualization can be accomplished.

The pipeline pattern is broadly applicable. Similar to the introductory ad-tech use case, it can be applied to other domains with data streams such as mobile, sensor, or financial transaction data. Instead of simple functionality (top words and counters), real-world applications may perform sentiment analysis, fraud detection, device health monitoring, and other complex processing.

The next chapter will go into more depth with windowing and...