Packt+ | Advance your knowledge in tech

You're reading from Mastering Spark for Data Science

Product type Book

Published in Mar 2017

Publisher Packt

ISBN-13 9781785882142

Pages 560 pages

Edition 1st Edition

Languages

Concepts

Data Science

Authors (4):

Andrew Morgan

Antoine Amend

Matthew Hallett

David George

View More author details

Table of Contents (22) Chapters

Mastering Spark for Data Science

Credits

Foreword

About the Authors

About the Reviewer

www.PacktPub.com

Customer Feedback

Preface

The Big Data Science Ecosystem

Data Acquisition

Input Formats and Schema

Exploratory Data Analysis

Spark for Geographic Analysis

Scraping Link-Based External Data

Building Communities

Building a Recommendation System

News Dictionary and Real-Time Tagging System

Story De-duplication and Mutation

Anomaly Detection on Sentiment Analysis

TrendCalculus

Secure Data

Scalable Algorithms

Chapter 12. TrendCalculus

Long before the concept of what's trending became a popular topic of study by data scientists, there was an older one that is still not well served by data science: it is that of Trends. Presently, the analysis of trends, if it can be called that, is primarily carried out by people "eyeballing" time series charts and offering interpretations. But what is it that people's eyes are doing?

This chapter describes an implementation in Apache Spark of a new algorithm for studying trends numerically, called TrendCalculus, invented by Andrew Morgan. The original reference implementation is written in the Lua language and was open-sourced in 2015, the code can be viewed at https://bitbucket.org/bytesumo/trendcalculus-public.

This chapter explains the core method, which delivers the fast extraction of trend change points on a time series; these are the moments when trends change direction. We will describe our TrendCalculus algorithm in detail while implementing it in Apache...

Studying trends

The dictionary definition of trend is a general direction in which something is developing or changing, but there are other more focused definitions that might be more helpful for guiding data science. Two such definitions are from Salomé Areias, who studies social trends, and Eurostat, the official statistical agency in the European Union:

"A trend is the slow variation over a longer period of time, usually several years, generally associated with the structural causes affecting the phenomenon being measured." - EUROSTAT, official statistical agency in the European Union (http://ec.europa.eu/eurostat/statistics-explained/index.php/Glossary:Trend)

"A Trend is defined by a shift in behavior or mentality that influences a significant amount of people." - Salomé Areias, social trend commentator (https://salomeareias.wordpress.com/what-is-a-trend/)

We generally think of trends as nothing more than a long rise or fall in stock market prices. However, trends can also refer to many...

The TrendCalculus algorithm

In this section we will explain the detail of the TrendCalculus implementation, using the Brent oil price data set seen in Chapter 5, Spark for Geographic Analysis, as an example use case.

Trend windows

In order to measure any type of change, we must first quantify it in some way. For trends, we are going to define this in the following manner:

Overall positive change (usually expressed as a value increase)

Higher highs and higher lows => +1

Overall negative change (usually expressed as a value decrease)

Lower highs and lower lows => -1

We must therefore translate our data into a time series of trend direction, being either +1 or -1. By splitting our data into a series of windows, size n, we can calculate the dated highs and lows for each of them:

Since this type of windowing is a common practice in data science, it is reasonable to think there must be an implementation in Spark; if you have read Chapter 5, Spark for Geographic Analysis you will have seen them...

Practical applications

Now that we have our algorithm coded, let's look at practical applications for this method on real data. We will start by understanding how the algorithm performs, so that we can determine where we might use it.

Algorithm characteristics

So, what are the characteristics of this algorithm? Below is a list of strengths and weaknesses.

Advantages

The advantages are as follows:

The algorithm is general, lending itself well to both stream based and Spark implementations
The theory is simple, yet effective
The implementation is fast and efficient
The result is visual and interpretable
The method is stackable and allows for multi scale studies; this is very simple when using Spark windows

Disadvantages

The disadvantages are as follows:

A lagging indicator the algorithm finds trend reversals that occurred in the past, and cannot be used directly to predict a trend change as it happens
The lag accumulates for higher scales, meaning much more data (and thus time lag) is required to find...

Summary

In this chapter, we have introduced a method for analyzing trends with TrendCalculus. We have outlined the fact that despite analysis of trends being a very common use case, there are few tools to aid the data scientist in this cause apart from very general-purpose visualization software. We have guided the reader through the TrendCalculus algorithm, demonstrating how we implement an efficient and scalable realization of the theory in Spark. We have described the process of identifying the key output of the algorithm: trend reversals on a named scale. Having calculated reversals, we used D3.js to visualize time series data that has been summarized for one-week windows, and plotted trend reversals. The chapter continued with an explanation of how to overcome the main edge case: the zero values found during simple trend calculation. We have concluded with a brief outline of the algorithm characteristics and potential use cases, demonstrating how the method is elegant and can be easily...