Reader small image

You're reading from  Learning Apache Apex

Product typeBook
Published inNov 2017
Reading LevelIntermediate
Publisher
ISBN-139781788296403
Edition1st Edition
Languages
Right arrow
Authors (5):
Thomas Weise
Thomas Weise
author image
Thomas Weise

Thomas Weise is the Apache Apex PMC Chair and cofounder at Atrato. Earlier, he worked at a number of other technology companies in the San Francisco Bay Area, including DataTorrent, where he was a cofounder of the Apex project. Thomas is also a committer to Apache Beam and has contributed to several more of the ecosystem projects. He has been working on distributed systems for 20 years and has been a speaker at international big data conferences. Thomas received the degree of Diplom-Informatiker (MSc in computer science) from TU Dresden, Germany. He can be reached on Twitter at: @thweise.
Read more about Thomas Weise

Ananth Gundabattula
Ananth Gundabattula
author image
Ananth Gundabattula

Ananth is a senior application architect in the Decisioning and Advanced Analytics architecture team for Commonwealth Bank of Australia. Ananth holds a Ph.D degree in the domain of computer science security and is interested in all things data including low latency distributed processing systems, machine learning and data engineering domains. He holds 3 patents granted by USPTO and has one application pending. Prior to joining to CBA, he was an architect at Threatmetrix and the member of the core team that scaled Threatmetrix architecture to 100 million transactions per day that runs at very low latencies using Cassandra, Zookeeper and Kafka. He also migrated Threatmetrix data warehouse into the next generation architecture based on Hadoop and Impala. Prior to Threatmetrix, he worked for the IBM software labs and IBM CIO labs enabling some of the first IBM CIO projects onboarding HBase, Hadoop and Mahout stack. Ananth is a committer for Apache Apex and is currently working for the next generation architectures for CBA fraud platform and Advanced Analytics Omnia platform at CBA.
Read more about Ananth Gundabattula

Munagala V. Ramanath
Munagala V. Ramanath
author image
Munagala V. Ramanath

Dr. Munagala V. Ramanath got his PhD in Computer Science from the University of Wisconsin, USA and an MSc in Mathematics from Carleton University, Ottawa, Canada. After that, he taught Computer Science courses as Assistant/Associate Professor at the University of Western Ontario in Canada for a few years, before transitioning to the corporate sphere. Since then, he has worked as a senior software engineer at a number of technology companies in California including SeeBeyond, EMC, Sun Microsystems, DataTorrent, and Cloudera. He has published papers in peer reviewed journals in several areas including code optimization, graph theory, and image processing.
Read more about Munagala V. Ramanath

David Yan
David Yan
author image
David Yan

David Yan is based in the Silicon Valley, California. He is a senior software engineer at Google. Prior to Google, he worked at DataTorrent, Yahoo!, and the Jet Propulsion Laboratory. David holds a master of science in Computer Science from Stanford University and a bachelor of science in Electrical Engineering and Computer Science from the University of California at Berkeley
Read more about David Yan

Kenneth Knowles
Kenneth Knowles
author image
Kenneth Knowles

Kenneth Knowles is a founding PMC member of Apache Beam. Kenn has been working on Google Cloud Dataflow—Google's Beam backend—since 2014. Prior to that, he built backends for startups such as Cityspan, Inkling, and Dimagi. Kenn holds a PhD in
Read more about Kenneth Knowles

View More author details
Right arrow

Chapter 2. Getting Started with Application Development

The previous chapter introduced concepts, use cases, and application model. This chapter will be about application development and guide the reader through the process of building and running their first application. The examples will be simple; more comprehensive applications will be covered in subsequent chapters.

In this chapter we will cover the following topics:

  • Development process and methodology
  • Setting up the development environment
  • Creating a new Maven project
  • Custom operator development
  • Testing within the IDE
  • Running application on the cluster

Development process and methodology


Development of an Apex application starts with mapping the functional specification to operators (smaller functional building blocks), which can then be composed into a DAG to collectively provide the functionality required for the use case.

This involves identifying the data sources, formats, transformations and sinks for the application, and finding matching operators from the Apex library (which will be covered in the next chapter). In most cases, the required connectors will be available from the library that support frequent sources, such as files and Kafka, along with many other external systems that are part of the Apache Big Data ecosystem.

With the comprehensive operator library and set of examples to cover frequently used I/O cases and transformations, it is often possible to assemble a preliminary end-to-end flow that covers a subset of the functionality quickly, before building out the complete business logic in detail.

Note

Examples that show...

Setting up the development environment


Development of Apex applications requires a Java development environment with the following:

  • Java Development Kit (JDK): Apex applications are mostly written in Java, and Apex itself is implemented in Java. Other Java Virtual Machine (JVM) languages such as Scala can also be used, but this is outside the scope of this book.
  • Maven: Apex comes with a Maven Archetype to bootstrap new projects and the Apex project itself also uses Maven as build tool.

In addition to the above, it is recommended to have an IDE with Maven support such as IntelliJ or Eclipse. Apex provides code style settings for these IDEs that can optionally be used.

It is further recommended to have Git installed. Git is not required to build an application, but it is a convenient way to fetch the Apex source code and is especially useful for easily navigating the full operator library (apex-malhar) project within the IDE when working on operator customizations.

Note

For the latest details on...

Creating a new Maven project


Apex applications are packaged in a special ZIP file format that contains everything needed for an application to be launched on a cluster (dependency jars, configuration files, and so on). It is roughly comparable to the uber jar approach that some other frameworks employ, with the difference that dependencies in the Apex package remain as individual JAR files, rather than being flattened into a standard JAR.

Note

More information about Apex application packages can be found at http://apex.apache.org/docs/apex/application_packages/#apache-apex-packages.

It would be a rather involved task to set up a new Maven project from scratch. The Apex application archetype simplifies the process of creating an application skeleton for the expected artifact structure. Here is an example of the Maven command to create an Apex application archetype:

mvn archetype:generate \
  -DarchetypeGroupId=org.apache.apex \
  -DarchetypeArtifactId=apex-app-archetype -DarchetypeVersion=RELEASE...

Application specifications


Let's start by transforming this placeholder application into an application that counts words – the Hello World equivalent for big data processing frameworks. The functionality is easy to understand and not very important, as our focus here is on the development process.

The full source code of the modified application is available at https://github.com/tweise/apex-samples/tree/master/wordcount. Here is the modified application assembly in Application.java:

@Override
   public void populateDAG(DAG dag, Configuration conf)
   {
    LineByLineFileInputOperator lineReader = dag.addOperator("input",
         new LineByLineFileInputOperator());
     LineSplitter parser = dag.addOperator("parser", new LineSplitter());
     UniqueCounter counter = dag.addOperator("counter", new UniqueCounter());
     GenericFileOutputOperator<Object> output = dag.addOperator("output",
         new GenericFileOutputOperator<>());
     output.setConverter(new ToStringConverter...

Custom operator development


As our example application has the LineSplitter operator, which is not part of the Apex library, we will use it as an example to illustrate the process of developing a custom operator.

Splitting a line into words is, of course, a simple stateless operation. Connectors and stateful transformations will be more involved, and there are many examples in the Apex library to look at for this.

Here is the line splitter:

public class LineSplitter extends BaseOperator 
{ 
  // default pattern for word-separators 
  private static final Pattern nonWordDefault =    Pattern.compile
    ("[\\p{Punct}\\s]+"); 
 
  private String nonWordStr;              // configurable regex 
  private transient Pattern nonWord;      // compiled regex 
 
  /** 
   * Output port on which words from the current file are emitted 
   */ 
  public final transient DefaultOutputPort<String> output = new 
    DefaultOutputPort<>(); 
 
  /** 
   * Input port on which lines from the current...

Application configuration


In the previous sections, we have seen how applications can be specified and how custom operators can be developed (with an example for configurable property). Most operators have properties that need to be configured, for example, a file reader will need to be supplied with the directory path or a Kafka consumer the broker address and topic. Whoever deploys the application needs to know and be able to supply values for these properties.

In addition to properties that are directly related to the functionality of an operator, there is another category of settings called attributes that control behavior of the platform (as opposed to the functionality of operators).

Attributes are defined for three different scopes:

  • Application: Platform behavior for the application as a whole, such as streaming window interval, container JVM options, container heartbeat interval and timeout, and so on. See the complete list of attributes here https://ci.apache.org/projects/apex-core...

Testing in the IDE


This section will show how the previously created example application can be configured and run as a JUnit test within the IDE. Setting up an integration test that can be executed after every change will avoid a full package/deploy cycle to run on a cluster just to find basic issues. It allows for efficient debugging and will also come in handy when setting up continuous integration for a project.

Writing the integration test

The test covers the entire DAG and will run the application in embedded mode. In embedded mode, all operators and containers share the JUnit JVM. Containers are threads (instead of separate processes) but the data flow still behaves as if operators lived in separate processes. This means operators execute asynchronously as they would in a distributed cluster and data is transferred over the loopback interface (if that's how the streams are configured).

@Test 
public void testApplication() throws Exception { 
  EmbeddedAppLauncher<?> launcher =...

Running the application on YARN


Once the application is functionally complete and passes the tests in embedded mode, it is time to take it for a test drive on the cluster. Compared to working within the IDE, execution in distributed mode requires a different approach and tools for deployment, testing and troubleshooting of the application. In this section, we will introduce YARN as the execution layer and how to setup and navigate the cluster for various tasks. Note that, as of release 3.6, Apex supports YARN as cluster manager, support for other infrastructure is likely to follow in one of the next releases.

Execution layer components

YARN (Yet Another Resource Negotiator) originates from an effort to separate processing resource management from the application framework MapReduce, which was tightly coupled in the first version of Hadoop. Today, many of the big data processing frameworks, including Apache Spark, support YARN.

Note

The following blog is one of many resources that provide a good...

Working on the cluster


This section will cover some of the tools and techniques to monitor and debug the application in the distributed environment. We will also look at some of the options to apply changes to the application without rebuilding or packaging it. The tools we use in this section are standard components of Apex, Hadoop, and the operating system (nothing distribution or vendor specific).

Let's begin with some of the basic tools and commands that will allow us to gather information. YARN provides a basic web interface to look at information about running applications and container processes. Examples are based on the local Docker environment which was discussed earlier. The tools are all standard and available when working with a different cluster setup as well, although machine addresses and access may differ.

YARN web UI

Following is the RM web UI. It provides information about the cluster and running and terminated applications. Here, there are two applications of typeApacheApex...

Summary


In this chapter, we introduced the end-to-end process of developing applications with Apex, including how to create a new project, write a custom operator, assemble the DAG, integration test within the IDE, deploy the application package to the cluster, and how to navigate the distributed environment. The functionality was intentionally basic to keep the focus on the process.

Subsequent chapters will expand from here; we will cover the operators that are available in the Apex library, how to scale and tune applications, and how they are fault tolerant with exactly-once processing guarantee, as well as providing comprehensive examples that put it all together.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Learning Apache Apex
Published in: Nov 2017Publisher: ISBN-13: 9781788296403
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (5)

author image
Thomas Weise

Thomas Weise is the Apache Apex PMC Chair and cofounder at Atrato. Earlier, he worked at a number of other technology companies in the San Francisco Bay Area, including DataTorrent, where he was a cofounder of the Apex project. Thomas is also a committer to Apache Beam and has contributed to several more of the ecosystem projects. He has been working on distributed systems for 20 years and has been a speaker at international big data conferences. Thomas received the degree of Diplom-Informatiker (MSc in computer science) from TU Dresden, Germany. He can be reached on Twitter at: @thweise.
Read more about Thomas Weise

author image
Ananth Gundabattula

Ananth is a senior application architect in the Decisioning and Advanced Analytics architecture team for Commonwealth Bank of Australia. Ananth holds a Ph.D degree in the domain of computer science security and is interested in all things data including low latency distributed processing systems, machine learning and data engineering domains. He holds 3 patents granted by USPTO and has one application pending. Prior to joining to CBA, he was an architect at Threatmetrix and the member of the core team that scaled Threatmetrix architecture to 100 million transactions per day that runs at very low latencies using Cassandra, Zookeeper and Kafka. He also migrated Threatmetrix data warehouse into the next generation architecture based on Hadoop and Impala. Prior to Threatmetrix, he worked for the IBM software labs and IBM CIO labs enabling some of the first IBM CIO projects onboarding HBase, Hadoop and Mahout stack. Ananth is a committer for Apache Apex and is currently working for the next generation architectures for CBA fraud platform and Advanced Analytics Omnia platform at CBA.
Read more about Ananth Gundabattula

author image
Munagala V. Ramanath

Dr. Munagala V. Ramanath got his PhD in Computer Science from the University of Wisconsin, USA and an MSc in Mathematics from Carleton University, Ottawa, Canada. After that, he taught Computer Science courses as Assistant/Associate Professor at the University of Western Ontario in Canada for a few years, before transitioning to the corporate sphere. Since then, he has worked as a senior software engineer at a number of technology companies in California including SeeBeyond, EMC, Sun Microsystems, DataTorrent, and Cloudera. He has published papers in peer reviewed journals in several areas including code optimization, graph theory, and image processing.
Read more about Munagala V. Ramanath

author image
David Yan

David Yan is based in the Silicon Valley, California. He is a senior software engineer at Google. Prior to Google, he worked at DataTorrent, Yahoo!, and the Jet Propulsion Laboratory. David holds a master of science in Computer Science from Stanford University and a bachelor of science in Electrical Engineering and Computer Science from the University of California at Berkeley
Read more about David Yan

author image
Kenneth Knowles

Kenneth Knowles is a founding PMC member of Apache Beam. Kenn has been working on Google Cloud Dataflow—Google's Beam backend—since 2014. Prior to that, he built backends for startups such as Cityspan, Inkling, and Dimagi. Kenn holds a PhD in
Read more about Kenneth Knowles