Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Learning Apache Apex

You're reading from  Learning Apache Apex

Product type Book
Published in Nov 2017
Publisher
ISBN-13 9781788296403
Pages 290 pages
Edition 1st Edition
Languages
Authors (5):
Thomas Weise Thomas Weise
Profile icon Thomas Weise
Ananth Gundabattula Ananth Gundabattula
Profile icon Ananth Gundabattula
Munagala V. Ramanath Munagala V. Ramanath
Profile icon Munagala V. Ramanath
David Yan David Yan
Profile icon David Yan
Kenneth Knowles Kenneth Knowles
Profile icon Kenneth Knowles
View More author details

Table of Contents (17) Chapters

Title Page
Credits
About the Authors
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface
Introduction to Apex Getting Started with Application Development The Apex Library Scalability, Low Latency, and Performance Fault Tolerance and Reliability Example Project – Real-Time Aggregation and Visualization Example Project – Real-Time Ride Service Data Processing Example Project – ETL Using SQL Introduction to Apache Beam The Future of Stream Processing

Chapter 6. Example Project – Real-Time Aggregation and Visualization

In the previous chapters, we introduced the API to develop Apex applications and the library of connectors and transformations that are available out of the box. We also covered the scalability, performance, and fault tolerance of the Apex platform. This chapter is the first of three example project chapters that are dedicated to developing applications that are representative for scenarios and use cases readers may find in various verticals.

This chapter will cover the following topics:

  • Streaming ETL and beyond
  • The application pattern in a real-world use case
  • Analyzing Twitter feed
  • Running the application
  • The Pub/Sub server
  • Grafana Visualization

Streaming ETL and beyond


This first application will be an example of processing live streaming data with windowing and real-time visualization. The data source will be Twitter, processing of the tweet stream will compute the top hashtags in a time window as well as some counts that can be visualized as time series. The pattern is applicable to many similar use cases: data is continuously consumed from a streaming source and aggregated. Traditionally, results of such computation will land in a storage system (files, databases, and so on). Such processing can be broadly categorized as extract-transform-load (ETL) in streaming fashion. However, the focus here will be on stream processing that goes beyond the realm of general purpose ETL tools and can support streaming analytics use cases.

Stream processing needs a source of data, so every pipeline will involve the E of ETL with connector(s) to extract or ingest data (with Kafka being a common streaming source and files for batch use cases)...

The application pattern in a real-world use case


One of the case studies mentioned in the introductory chapter was an ad-tech use case. The Apex-based system powering that use case went into production in 2014 and fits the same pattern of the example that we will build in this chapter. Summarized at a high level, the system consumes log records from Apache Kafka, then aggregates these records into a dimensional model and provides the aggregates to a real-time dashboard that users can use to gain actionable insights into the performance of advertising campaigns. Time to insight is critical, as the ability to perform timely adjustments often translates to incremental revenue or reduced cost (or both) for the business:

The Apex-based implementation replaced a batch system that involved several hours of latency with a streaming pipeline that reduced latency to seconds (end-to-end, including data collection and ingestion). The dimension store holds the aggregated data in memory (conceptually,...

Analyzing Twitter feed


The example we are going to build will show how an application similar to the ad-tech use case can be built using the Apex library. Instead of processing a stream of ad-impression events from Kafka clusters, we will use a stream of tweets retrieved via the Twitter developer API. Instead of a dimensional model, we will compute windowed aggregates for selected metrics. The data visualization will take advantage of Grafana with a Pub/Sub plugin instead of a custom portal and frontend server. The goal will be to introduce the relevant building blocks and enable the reader to derive a similar application for their domain or use case.

The following sections will walk through the application functionality, its components, and a few selected implementation details. The full code, ready to run is available at the following link: https://github.com/tweise/apex-samples/tree/master/twitter.

The following is the DAG of the example application:

The input operator reads tweets from...

Running the application


This section assumes that you have already set up the development environment as explained in the Chapter 2, Getting Started with Application Development. All components of the Twitter example can run on the host OS. There is no need for a Hadoop cluster, although it would also be possible to run the Apex application in the Docker container. We will instead run it as a JUnit test, as it is easier to modify and experiment with.

  1. Check out the code using the following command:
git clone https://github.com/tweise/apex-samples.git
  1. Then, import the Twitter project into your IDE and run JUnit test:
TwitterStatsAppTest.testApplication

Alternatively, you can run it from the command line:

cd twitter; mvn test -Dtest=TwitterStatsAppTest

By default the test runs the application with a file source of sample tweets (instead of connecting to the Twitter API) and writes results to the console (instead of WebSocket).

  1. To configure the application for live input and visualization of results...

The Pub/Sub server


Once the WebSocket output is enabled and the application is running, you will see connect exceptions trying to reach the Pub/Sub server (java.net.ConnectException: Connection refused: localhost/127.0.0.1:8890). To launch the server, open a new terminal window and run the following commands to fetch the source code and run the server:

git clone https://github.com/atrato/pubsub-server.git 
cd pubsub-server 
mvn compile exec:java 

The console will show in the first log lines the listening address of the server:

[INFO] --- exec-maven-plugin:1.5.0:java (default-cli) @ atrato-pubsub-server ---
 2017-06-16 14:18:29,408 [io.atrato.pubsubserver.PubsubServer.main()] INFO  util.log initialized - Logging initialized @4190ms
 2017-06-16 14:18:29,487 [io.atrato.pubsubserver.PubsubServer.main()] INFO  server.Server doStart - jetty-9.1.6.v20160112
 2017-06-16 14:18:30,055 [io.atrato.pubsubserver.PubsubServer.main()] INFO  handler.ContextHandler doStart - Started o.e.j.s.ServletContextHandler...

Grafana visualization


Grafana is an open source metric visualization suite that can be used to build dashboards from various widgets and data sources. Grafana is traditionally used for visualizing time series data for infrastructure and application monitoring, but it is generic and can be used with pretty much any data source that produces time series and tabular data. Thanks to its monitoring orientation Grafana is suitable for low-latency visualization of frequently changing data. We will use this capability to display the top hashtags and counts, and have them updated as changes occur in the application (with perhaps one to two seconds of end-to-end latency).

In order to use Grafana, we need to tap into the data that is available from the Pub/Sub server. The Pub/Sub server has an HTTP interface, but the data format of the queryable state protocol of the Apex application needs to be adapted to what Grafana requires.

There are two options to accomplish this:

  • Implementing a new datasource that...

Summary


The project that we developed in this chapter is an example for streaming analytics. The incoming tweet stream is processed to compute aggregates which are visualized in real time with a Grafana dashboard. It shows how continuously generated data (in this case, tweets) can be analyzed and used to generate immediate insights. We have seen how existing building blocks from the Apex library (connectors, windowing) are used to accelerate application development and how integration with other infrastructure for data visualization can be accomplished.

The pipeline pattern is broadly applicable. Similar to the introductory ad-tech use case, it can be applied to other domains with data streams such as mobile, sensor, or financial transaction data. Instead of simple functionality (top words and counters), real-world applications may perform sentiment analysis, fraud detection, device health monitoring, and other complex processing.

The next chapter will go into more depth with windowing and...

lock icon The rest of the chapter is locked
You have been reading a chapter from
Learning Apache Apex
Published in: Nov 2017 Publisher: ISBN-13: 9781788296403
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}