Reader small image

You're reading from  Mastering Spark for Data Science

Product typeBook
Published inMar 2017
PublisherPackt
ISBN-139781785882142
Edition1st Edition
Concepts
Right arrow
Authors (4):
Andrew Morgan
Andrew Morgan
author image
Andrew Morgan

Andrew Morgan is a specialist in data strategy and its execution, and has deep experience in the supporting technologies, system architecture, and data science that bring it to life. With over 20 years of experience in the data industry, he has worked designing systems for some of its most prestigious players and their global clients often on large, complex and international projects. In 2013, he founded ByteSumo Ltd, a data science and big data engineering consultancy, and he now works with clients in Europe and the USA. Andrew is an active data scientist, and the inventor of the TrendCalculus algorithm. It was developed as part of his ongoing research project investigating long-range predictions based on machine learning the patterns found in drifting cultural, geopolitical and economic trends. He also sits on the Hadoop Summit EU data science selection committee, and has spoken at many conferences on a variety of data topics. He also enjoys participating in the Data Science and Big Data communities where he lives in London.
Read more about Andrew Morgan

Antoine Amend
Antoine Amend
author image
Antoine Amend

Antoine Amend is a data scientist passionate about big data engineering and scalable computing. The books theme of torturing astronomical amounts of unstructured data to gain new insights mainly comes from his background in theoretical physics. Graduating in 2008 with a Msc. in Astrophysics, he worked for a large consultancy business in Switzerland before discovering the concept of big data at the early stages of Hadoop. He has embraced big data technologies ever since, and is now working as the Head of Data Science for cyber security at Barclays Bank. By combining a scientific approach with core IT skills, Antoine qualified two years running for the Big Data World Championships finals held in Austin TX. He Placed in the top 12 in both 2014 and 2015 edition (over 2000+ competitors) where he additionally won the Innovation Award using the methodologies and technologies explained in this book.
Read more about Antoine Amend

Matthew Hallett
Matthew Hallett
author image
Matthew Hallett

Matthew Hallett is a Software Engineer and Computer Scientist with over 15 years of industry experience. He is an expert Object Oriented programmer and systems engineer with extensive knowledge of low level programming paradigms and, for the last 8 years, has developed an expertise in Hadoop and distributed programming within mission critical environments, comprising multithousandnode data centres. With consultancy experience in distributed algorithms and the implementation of distributed computing architectures, in a variety of languages, Matthew is currently a Consultant Data Engineer in the Data Science & Engineering team at a top four audit firm.
Read more about Matthew Hallett

David George
David George
author image
David George

David George is a distinguished distributed computing expert with 15+ years of data systems experience, mainly with globally recognized IT consultancies and brands. Working with core Hadoop technologies since the early days, he has delivered implementations at the largest scale. David always takes a pragmatic approach to software design and values elegance in simplicity. Today he continues to work as a lead engineer, designing scalable applications for financial sector customers with some of the toughest requirements. His latest projects focus on the adoption of advanced AI techniques for increasing levels of automation across knowledge-based industries.
Read more about David George

View More author details
Right arrow

Chapter 5. Spark for Geographic Analysis

Geographic processing is a powerful use case for Spark and therefore the aim of this chapter is to explain how data scientists can process geographic data using Spark to produce powerful, map-based views of very large datasets. We will demonstrate how to process spatio-temporal datasets easily via Spark integrations with GeoMesa, which helps turn Spark into a sophisticated geographic processing engine. As the Internet of Things (IoT) and other location-aware datasets become ever more common, and moving objects data volumes climb, Spark will become a critical tool that closes the geoprocessing gap that exists between spatial functionality and processing scalability. This chapter reveals how to conduct advanced geopolitical analysis of global news with a view to leveraging the data to analyze and perform data science on oil prices.

In this chapter, we will cover the following topics:

  • Using Spark to ingest and preprocess geolocated data

  • Storing geodata...

GDELT and oil


The premise of this chapter is that we can manipulate GDELT data to determine, to a greater or lesser extent, the price of oil based on historic events. The accuracy of our predictor will depend on many variables including the detail of our events, the number used and our hypotheses surrounding the nature of the relationship between oil and these events.

The oil industry is very complex and is driven by many factors. It has been found however, that most major oil price fluctuations are largely explained by shifts in the demand of crude oil. The price also increases during times of greater demand for stock, and historically has been high in times of geopolitical tension in the Middle East. In particular, political events have a strong influence on the oil price and it is this aspect that we will concentrate on.

Crude oil is produced by many countries around the world; there are however, three main benchmarks that are used by producers for pricing:

  • Brent: Produced by various entities...

Formulating a plan of action


Having inspected the GDELT schemas, we now need to make some decisions around what data we are going to use, and make sure we justify that usage based on our hypotheses. This is a critical stage as there are many areas to consider, and at the very least we need to:

  • Ensure that our hypotheses are clear so that we have a known starting point

  • Ensure that we are clear about how we are going to implement the hypotheses, and determine an action plan

  • Ensure that we use enough appropriate data to meet our action plan; scope the data usage to ensure we can produce a conclusion within a given time frame, for example, using all GDELT data would be great, but is probably not reasonable unless a large processing cluster is available. On the other hand using one day is clearly not enough to gauge any patterns over time

  • Formulate a plan B in case our initial results are not conclusive

Our second hypothesis is about the detail of the events; for the purposes of clarity, in this chapter...

GeoMesa


GeoMesa is an open source product designed to leverage the distributed nature of storage systems, such as Accumulo and Cassandra, to hold a distributed spatio-temporal database. With this design, GeoMesa is capable of running the large-scale geospatial analytics that are required for very large data sets, including GDELT.

We are going to use GeoMesa to store GDELT data and run our analytics across a large proportion of that data; this should give us access to enough data to train our model so that we can predict the future rise and fall of oil prices. Also, GeoMesa will enable us to plot large amounts of points on a map, so that we can visualize GDELT and any other useful data.

Installing

There is a very good tutorial on the GeoMesa website (www.geomesa.org) that guides the user through the installation process. Therefore, it is not our intention here to produce another how-to guide; there are, however, a few points worth noting that may save you time in getting everything up and running...

Gauging oil prices


Now that we have a substantial amount of data in our data store (we can always add more data using the preceding Spark job) we will proceed to query that data, using the GeoMesa API, to get the rows ready for application to our learning algorithm. We could of course use raw GDELT files, but the following method is a useful tool to have available.

Using the GeoMesa query API

The GeoMesa query API enables us to query for results based upon spatio-temporal attributes, whilst also leveraging the parallelization of the data store, in this case Accumulo with its iterators. We can use the API to build SimpleFeatureCollections, which we can then parse to realize GeoMesa SimpleFeatures and ultimately the raw data that matches our query.

At this stage we should build code that is generic, such that we can change it easily should we decide later that we have not used enough data, or perhaps if we need to change the output fields. Initially, we will extract a few fields; SQLDATE, Actor1Name...

Summary


In this chapter, we have introduced the concepts of storing data in a spatio-temporal way so that we can use GeoMesa and GeoServer to create and run queries. We have shown these queries executed in both the tools themselves and in a programmatic way, leveraging GeoServer to display results. Further, we have demonstrated how to merge different artifacts to create insights purely from the raw GDELT events, before any follow-on processing. Following on from GeoMesa, we have touched upon the highly complex world of oil pricing and worked on a simple algorithm to estimate weekly oil changes. Whilst it is not reasonable to create an accurate model with the time and resources available, we have explored a number of areas of concern and attempted to address these, at least at a high level, in order to give an insight into possible approaches that can be made in this problem space.

Throughout the chapter, we have introduced a number of key Spark libraries and functions, the key area being...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Mastering Spark for Data Science
Published in: Mar 2017Publisher: PacktISBN-13: 9781785882142
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (4)

author image
Andrew Morgan

Andrew Morgan is a specialist in data strategy and its execution, and has deep experience in the supporting technologies, system architecture, and data science that bring it to life. With over 20 years of experience in the data industry, he has worked designing systems for some of its most prestigious players and their global clients often on large, complex and international projects. In 2013, he founded ByteSumo Ltd, a data science and big data engineering consultancy, and he now works with clients in Europe and the USA. Andrew is an active data scientist, and the inventor of the TrendCalculus algorithm. It was developed as part of his ongoing research project investigating long-range predictions based on machine learning the patterns found in drifting cultural, geopolitical and economic trends. He also sits on the Hadoop Summit EU data science selection committee, and has spoken at many conferences on a variety of data topics. He also enjoys participating in the Data Science and Big Data communities where he lives in London.
Read more about Andrew Morgan

author image
Antoine Amend

Antoine Amend is a data scientist passionate about big data engineering and scalable computing. The books theme of torturing astronomical amounts of unstructured data to gain new insights mainly comes from his background in theoretical physics. Graduating in 2008 with a Msc. in Astrophysics, he worked for a large consultancy business in Switzerland before discovering the concept of big data at the early stages of Hadoop. He has embraced big data technologies ever since, and is now working as the Head of Data Science for cyber security at Barclays Bank. By combining a scientific approach with core IT skills, Antoine qualified two years running for the Big Data World Championships finals held in Austin TX. He Placed in the top 12 in both 2014 and 2015 edition (over 2000+ competitors) where he additionally won the Innovation Award using the methodologies and technologies explained in this book.
Read more about Antoine Amend

author image
Matthew Hallett

Matthew Hallett is a Software Engineer and Computer Scientist with over 15 years of industry experience. He is an expert Object Oriented programmer and systems engineer with extensive knowledge of low level programming paradigms and, for the last 8 years, has developed an expertise in Hadoop and distributed programming within mission critical environments, comprising multithousandnode data centres. With consultancy experience in distributed algorithms and the implementation of distributed computing architectures, in a variety of languages, Matthew is currently a Consultant Data Engineer in the Data Science & Engineering team at a top four audit firm.
Read more about Matthew Hallett

author image
David George

David George is a distinguished distributed computing expert with 15+ years of data systems experience, mainly with globally recognized IT consultancies and brands. Working with core Hadoop technologies since the early days, he has delivered implementations at the largest scale. David always takes a pragmatic approach to software design and values elegance in simplicity. Today he continues to work as a lead engineer, designing scalable applications for financial sector customers with some of the toughest requirements. His latest projects focus on the adoption of advanced AI techniques for increasing levels of automation across knowledge-based industries.
Read more about David George