Storm Blueprints: Patterns for Distributed Real-time Computation


Storm Blueprints: Patterns for Distributed Real-time Computation
eBook: $29.99
Formats: PDF, PacktLib, ePub and Mobi formats
$25.50
save 15%!
Print + free eBook + free PacktLib access to the book: $79.98    Print cover: $49.99
$75.49
save 6%!
Free Shipping!
UK, US, Europe and selected countries in Asia.
Also available on:
Overview
Table of Contents
Author
Support
Sample Chapters
  • Process high-volume log files in real time while learning the fundamentals of Storm topologies and system deployment.
  • Deploy Storm on Hadoop (YARN) and understand how the systems complement each other for online advertising and trade processing.
  • Follow along as each chapter presents a new problem and the architectural pattern, design, and implementation of a solution.

Book Details

Language : English
Paperback : 336 pages [ 235mm x 191mm ]
Release Date : March 2014
ISBN : 178216829X
ISBN 13 : 9781782168294
Author(s) : P. Taylor Goetz, Brian O'Neill
Topics and Technologies : All Books, Big Data and Business Intelligence, Open Source


Table of Contents

Preface
Chapter 1: Distributed Word Count
Chapter 2: Configuring Storm Clusters
Chapter 3: Trident Topologies and Sensor Data
Chapter 4: Real-time Trend Analysis
Chapter 5: Real-time Graph Analysis
Chapter 6: Artificial Intelligence
Chapter 7: Integrating Druid for Financial Analytics
Chapter 8: Natural Language Processing
Chapter 9: Deploying Storm on Hadoop for Advertising Analysis
Chapter 10: Storm in the Cloud
Index

  • Chapter 1: Distributed Word Count
    • Introducing elements of a Storm topology – streams, spouts, and bolts
      • Streams
      • Spouts
      • Bolts
    • Introducing the word count topology data flow
      • Sentence spout
        • Introducing the split sentence bolt
        • Introducing the word count bolt
        • Introducing the report bolt
    • Implementing the word count topology
      • Setting up a development environment
      • Implementing the sentence spout
      • Implementing the split sentence bolt
      • Implementing the word count bolt
      • Implementing the report bolt
      • Implementing the word count topology
    • Introducing parallelism in Storm
      • WordCountTopology parallelism
        • Adding workers to a topology
        • Configuring executors and tasks
    • Understanding stream groupings
    • Guaranteed processing
      • Reliability in spouts
      • Reliability in bolts
      • Reliable word count
    • Summary
  • Chapter 2: Configuring Storm Clusters
    • Introducing the anatomy of a Storm cluster
      • Understanding the nimbus daemon
      • Working with the supervisor daemon
      • Introducing Apache ZooKeeper
      • Working with Storm's DRPC server
      • Introducing the Storm UI
    • Introducing the Storm technology stack
      • Java and Clojure
      • Python
    • Installing Storm on Linux
      • Installing the base operating system
      • Installing Java
      • ZooKeeper installation
      • Storm installation
      • Running the Storm daemons
      • Configuring Storm
      • Mandatory settings
      • Optional settings
      • The Storm executable
      • Setting up the Storm executable on a workstation
      • The daemon commands
        • Nimbus
        • Supervisor
        • UI
        • DRPC
      • The management commands
        • Jar
        • Kill
        • Deactivate
        • Activate
        • Rebalance
        • Remoteconfvalue
      • Local debug/development commands
        • REPL
        • Classpath
        • Localconfvalue
    • Submitting topologies to a Storm cluster
    • Automating the cluster configuration
    • A rapid introduction to Puppet
      • Puppet manifests
      • Puppet classes and modules
      • Puppet templates
      • Managing environments with Puppet Hiera
      • Introducing Hiera
    • Summary
  • Chapter 3: Trident Topologies and Sensor Data
    • Examining our use case
    • Introducing Trident topologies
    • Introducing Trident spouts
    • Introducing Trident operations – filters and functions
      • Introducing Trident filters
      • Introducing Trident functions
    • Introducing Trident aggregators – Combiners and Reducers
      • CombinerAggregator
      • ReducerAggregator
      • Aggregator
    • Introducing the Trident state
      • The Repeat Transactional state
      • The Opaque state
    • Executing the topology
    • Summary
  • Chapter 4: Real-time Trend Analysis
    • Use case
    • Architecture
      • The source application
      • The logback Kafka appender
      • Apache Kafka
      • Kafka spout
      • The XMPP server
    • Installing the required software
      • Installing Kafka
      • Installing OpenFire
    • Introducing the sample application
      • Sending log messages to Kafka
    • Introducing the log analysis topology
      • Kafka spout
      • The JSON project function
      • Calculating a moving average
      • Adding a sliding window
      • Implementing the moving average function
      • Filtering on thresholds
      • Sending notifications with XMPP
    • The final topology
    • Running the log analysis topology
    • Summary
  • Chapter 5: Real-time Graph Analysis
    • Use case
    • Architecture
      • The Twitter client
      • Kafka spout
      • A titan-distributed graph database
    • A brief introduction to graph databases
      • Accessing the graph – the TinkerPop stack
      • Manipulating the graph with the Blueprints API
      • Manipulating the graph with the Gremlin shell
    • Software installation
      • Titan installation
    • Setting up Titan to use the Cassandra storage backend
      • Installing Cassandra
      • Starting Titan with the Cassandra backend
    • Graph data model
    • Connecting to the Twitter stream
      • Setting up the Twitter4J client
      • The OAuth configuration
        • The TwitterStreamConsumer class
        • The TwitterStatusListener class
    • Twitter graph topology
      • The JSONProjectFunction class
    • Implementing GraphState
      • GraphFactory
      • GraphTupleProcessor
      • GraphStateFactory
      • GraphState
      • GraphUpdater
    • Implementing GraphFactory
    • Implementing GraphTupleProcessor
    • Putting it all together – the TwitterGraphTopology class
      • The TwitterGraphTopology class
    • Querying the graph with Gremlin
    • Summary
  • Chapter 6: Artificial Intelligence
    • Designing for our use case
    • Establishing the architecture
      • Examining the design challenges
      • Implementing the recursion
        • Accessing the function's return values
        • Immutable tuple field values
        • Upfront field declaration
        • Tuple acknowledgement in recursion
        • Output to multiple streams
        • Read-before-write
      • Solving the challenges
    • Implementing the architecture
      • The data model
      • Examining the recursive topology
      • The queue interaction
      • Functions and filters
      • Examining the Scoring Topology
        • Addressing read-before-write
        • Enumerating the game tree
      • Distributed Remote Procedure Call (DRPC)
        • Remote deployment
    • Summary
  • Chapter 7: Integrating Druid for Financial Analytics
    • Use case
    • Integrating a non-transactional system
    • The topology
      • The spout
      • The filter
      • The state design
    • Implementing the architecture
      • DruidState
      • Implementing the StormFirehose object
      • Implementing the partition status in ZooKeeper
    • Executing the implementation
    • Examining the analytics
    • Summary
  • Chapter 8: Natural Language Processing
    • Motivating a Lambda architecture
    • Examining our use case
    • Realizing a Lambda architecture
    • Designing the topology for our use case
    • Implementing the design
      • TwitterSpout/TweetEmitter
      • Functions
        • TweetSplitterFunction
        • WordFrequencyFunction
        • PersistenceFunction
    • Examining the analytics
    • Batch processing / historical analysis
    • Hadoop
      • An overview of MapReduce
    • The Druid setup
      • HadoopDruidIndexer
  • Summary
  • Chapter 9: Deploying Storm on Hadoop for Advertising Analysis
    • Examining the use case
    • Establishing the architecture
      • Examining HDFS
      • Examining YARN
    • Configuring the infrastructure
      • The Hadoop infrastructure
      • Configuring HDFS
        • Configuring the NameNode
        • Configuring the DataNode
        • Configuring YARN
        • Configuring the NodeManager
    • Deploying the analytics
      • Performing a batch analysis with the Pig infrastructure
      • Performing a real-time analysis with the Storm-YARN infrastructure
    • Performing the analytics
      • Executing the batch analysis
      • Executing real-time analysis
    • Deploying the topology
    • Executing the topology
    • Summary
  • Chapter 10: Storm in the Cloud
    • Introducing Amazon Elastic Compute Cloud (EC2)
      • Setting up an AWS account
      • The AWS Management Console
        • Creating an SSH key pair
      • Launching an EC2 instance manually
        • Logging in to the EC2 instance
    • Introducing Apache Whirr
      • Installing Whirr
  • Configuring a Storm cluster with Whirr
    • Launching the cluster
  • Introducing Whirr Storm
    • Setting up Whirr Storm
      • Cluster configuration
      • Customizing Storm's configuration
      • Customizing firewall rules
  • Introducing Vagrant
    • Installing Vagrant
    • Launching your first virtual machine
      • The Vagrantfile and shared filesystem
      • Vagrant provisioning
      • Configuring multimachine clusters with Vagrant
  • Creating Storm-provisioning scripts
    • ZooKeeper
    • Storm
    • Supervisord
      • The Storm Vagrantfile
      • Launching the Storm cluster
  • Summary

P. Taylor Goetz

P. Taylor Goetz is an Apache Storm committer and release manager and has been involved with the usage and development of Storm since it was first released as open source in October of 2011. As an active contributor to the Storm user community, Taylor leads a number of open source projects that enable enterprises to integrate Storm into heterogeneous infrastructure.

Presently, he works at Hortonworks where he leads the integration of Storm into Hortonworks Data Platform (HDP). Prior to joining Hortonworks, he worked at Health Market Science where he led the integration of Storm into HMS' next generation Master Data Management platform with technologies including Cassandra, Kafka, Elastic Search, and the Titan graph database.


Brian O'Neill

Brian O'Neill is a husband, hacker, hiker, and kayaker. He is a fisherman and father as well as big data believer, innovator, and distributed computing dreamer.

He has been a technology leader for over 15 years and is recognized as an authority on big data. He has experience as an architect in a wide variety of settings, from start-ups to Fortune 500 companies. He believes in open source and contributes to numerous projects. He leads projects that extend Cassandra and integrate the database with indexing engines, distributed processing frameworks, and analytics engines. He won InfoWorld's Technology Leadership award in 2013. He authored the Dzone reference card on  Cassandra and was selected as a Datastax Cassandra MVP in 2012 and 2013.

In the past, he has contributed to expert groups within the Java Community Process (JCP) and has patents in artificial intelligence and context-based discovery. He is proud to hold a B.S. in Computer Science from Brown University.

Presently, Brian is Chief Technology Officer for Health Market Science (HMS), where he heads the development of their big data platform focused on data management and analysis for the healthcare space. The platform is powered by Storm and Cassandra and delivers real-time data management and analytics as a service.

Sorry, we don't have any reviews for this title yet.

Code Downloads

Download the code and support files for this book.


Submit Errata

Please let us know if you have found any errors not listed on this list by completing our errata submission form. Our editors will check them and add them to this list. Thank you.

Sample chapters

You can view our sample chapters and prefaces of this title on PacktLib or download sample chapters in PDF format.

Frequently bought together

Storm Blueprints: Patterns for Distributed Real-time Computation +    PhoneGap 3.x Mobile Application Development Hotshot =
50% Off
the second eBook
Price for both: $43.05

Buy both these recommended eBooks together and get 50% off the cheapest eBook.

What you will learn from this book

  • Learn the fundamentals of Storm
  • Install and configure storm in pseudo-distributed and fully-distributed mode
  • Familiarize yourself with the fundamentals of Trident and distributed state
  • Design patterns for data flows in a distributed system
  • Create integration patterns for persistence mechanisms such as Titan
  • Deploy and run Storm clusters by leveraging YARN
  • Achieve continuous availability and fault tolerance through distributed storage
  • Recognize centralized logging mechanisms and processing
  • Implement polyglot persistence and distributed transactions
  • Calculate the effectiveness of a campaign using click-through analysis

In Detail

Storm is the most popular framework for real-time stream processing. Storm provides the fundamental primitives and guarantees required for fault-tolerant distributed computing in high-volume, mission critical applications. It is both an integration technology as well as a data flow and control mechanism, making it the core of many big data platforms. Storm is essential if you want to deploy, operate, and develop data processing flows capable of processing billions of transactions.

"Storm: Distributed Real-time Computation Blueprints" covers a broad range of distributed computing topics, including not only design and integration patterns, but also domains and applications to which the technology is immediately useful and commonly applied. This book introduces you to Storm using real-world examples, beginning with simple Storm topologies. The examples increase in complexity, introducing advanced Storm concepts as well as more sophisticated approaches to deployment and operational concerns.

This book covers the domains of real-time log processing, sensor data analysis, collective and artificial intelligence, financial market analysis, Natural Language Processing (NLP), graph analysis, polyglot persistence and online advertising. While exploring distributed computing applications in each of those domains, the book covers advanced Storm topics such as Trident and Distributed State, as well as integration patterns for Druid and Titan. Simultaneously, the book also describes the deployment of Storm to YARN and the Amazon infrastructure, as well as other key operational concerns such as centralized logging.

By the end of the book, you will have gained an understanding of the fundamentals of Storm and Trident and be able to identify and apply those fundamentals to any suitable problem.

Approach

A blueprints book with 10 different projects built in 10 different chapters which demonstrate the various use cases of storm for both beginner and intermediate users, grounded in real-world example applications.

Who this book is for

Although the book focuses primarily on Java development with Storm, the patterns are more broadly applicable and the tips, techniques, and approaches described in the book apply to architects, developers, and operations.

Additionally, the book should provoke and inspire applications of distributed computing to other industries and domains. Hadoop enthusiasts will also find this book a good introduction to Storm, providing a potential migration path from batch processing to the world of real-time analytics.

Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software