Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Mastering Spark for Data Science

You're reading from  Mastering Spark for Data Science

Product type Book
Published in Mar 2017
Publisher Packt
ISBN-13 9781785882142
Pages 560 pages
Edition 1st Edition
Languages
Authors (4):
Andrew Morgan Andrew Morgan
Profile icon Andrew Morgan
Antoine Amend Antoine Amend
Profile icon Antoine Amend
Matthew Hallett Matthew Hallett
Profile icon Matthew Hallett
David George David George
Profile icon David George
View More author details

Table of Contents (22) Chapters

Mastering Spark for Data Science
Credits
Foreword
About the Authors
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface
The Big Data Science Ecosystem Data Acquisition Input Formats and Schema Exploratory Data Analysis Spark for Geographic Analysis Scraping Link-Based External Data Building Communities Building a Recommendation System News Dictionary and Real-Time Tagging System Story De-duplication and Mutation Anomaly Detection on Sentiment Analysis TrendCalculus Secure Data Scalable Algorithms

Chapter 12. TrendCalculus

Long before the concept of what's trending became a popular topic of study by data scientists, there was an older one that is still not well served by data science: it is that of Trends. Presently, the analysis of trends, if it can be called that, is primarily carried out by people "eyeballing" time series charts and offering interpretations. But what is it that people's eyes are doing?

This chapter describes an implementation in Apache Spark of a new algorithm for studying trends numerically, called TrendCalculus, invented by Andrew Morgan. The original reference implementation is written in the Lua language and was open-sourced in 2015, the code can be viewed at https://bitbucket.org/bytesumo/trendcalculus-public.

This chapter explains the core method, which delivers the fast extraction of trend change points on a time series; these are the moments when trends change direction. We will describe our TrendCalculus algorithm in detail while implementing it in Apache...

Studying trends


The dictionary definition of trend is a general direction in which something is developing or changing, but there are other more focused definitions that might be more helpful for guiding data science. Two such definitions are from Salomé Areias, who studies social trends, and Eurostat, the official statistical agency in the European Union:

"A trend is the slow variation over a longer period of time, usually several years, generally associated with the structural causes affecting the phenomenon being measured." - EUROSTAT, official statistical agency in the European Union (http://ec.europa.eu/eurostat/statistics-explained/index.php/Glossary:Trend)

"A Trend is defined by a shift in behavior or mentality that influences a significant amount of people." - Salomé Areias, social trend commentator (https://salomeareias.wordpress.com/what-is-a-trend/)

We generally think of trends as nothing more than a long rise or fall in stock market prices. However, trends can also refer to many...

The TrendCalculus algorithm


In this section we will explain the detail of the TrendCalculus implementation, using the Brent oil price data set seen in Chapter 5, Spark for Geographic Analysis, as an example use case.

Trend windows

In order to measure any type of change, we must first quantify it in some way. For trends, we are going to define this in the following manner:

  • Overall positive change (usually expressed as a value increase)

Higher highs and higher lows => +1

  • Overall negative change (usually expressed as a value decrease)

Lower highs and lower lows => -1

We must therefore translate our data into a time series of trend direction, being either +1 or -1. By splitting our data into a series of windows, size n, we can calculate the dated highs and lows for each of them:

Since this type of windowing is a common practice in data science, it is reasonable to think there must be an implementation in Spark; if you have read Chapter 5, Spark for Geographic Analysis you will have seen them...

Practical applications


Now that we have our algorithm coded, let's look at practical applications for this method on real data. We will start by understanding how the algorithm performs, so that we can determine where we might use it.

Algorithm characteristics

So, what are the characteristics of this algorithm? Below is a list of strengths and weaknesses.

Advantages

The advantages are as follows:

  • The algorithm is general, lending itself well to both stream based and Spark implementations

  • The theory is simple, yet effective

  • The implementation is fast and efficient

  • The result is visual and interpretable

  • The method is stackable and allows for multi scale studies; this is very simple when using Spark windows

Disadvantages

The disadvantages are as follows:

  • A lagging indicator the algorithm finds trend reversals that occurred in the past, and cannot be used directly to predict a trend change as it happens

  • The lag accumulates for higher scales, meaning much more data (and thus time lag) is required to find...

Summary


In this chapter, we have introduced a method for analyzing trends with TrendCalculus. We have outlined the fact that despite analysis of trends being a very common use case, there are few tools to aid the data scientist in this cause apart from very general-purpose visualization software. We have guided the reader through the TrendCalculus algorithm, demonstrating how we implement an efficient and scalable realization of the theory in Spark. We have described the process of identifying the key output of the algorithm: trend reversals on a named scale. Having calculated reversals, we used D3.js to visualize time series data that has been summarized for one-week windows, and plotted trend reversals. The chapter continued with an explanation of how to overcome the main edge case: the zero values found during simple trend calculation. We have concluded with a brief outline of the algorithm characteristics and potential use cases, demonstrating how the method is elegant and can be easily...

lock icon The rest of the chapter is locked
You have been reading a chapter from
Mastering Spark for Data Science
Published in: Mar 2017 Publisher: Packt ISBN-13: 9781785882142
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}