Mastering Spark for Data Science

Master the techniques and sophisticated analytics used to construct Spark-based solutions that scale to deliver production-grade data science products

Mastering Spark for Data Science

This ebook is included in a Mapt subscription
Andrew Morgan et al.

Master the techniques and sophisticated analytics used to construct Spark-based solutions that scale to deliver production-grade data science products
$0.00
$43.99
$54.99
$29.99p/m after trial
RRP $43.99
RRP $54.99
Subscription
eBook
Print + eBook
Start 30 Day Trial
Subscribe and access every Packt eBook & Video.
 
  • 4,000+ eBooks & Videos
  • 40+ New titles a month
  • 1 Free eBook/Video to keep every month
Start Free Trial
 
Preview in Mapt

Book Details

ISBN 139781785882142
Paperback560 pages

Book Description

Data science seeks to transform the world using data, and this is typically achieved through disrupting and changing real processes in real industries. In order to operate at this level you need to build data science solutions of substance –solutions that solve real problems. Spark has emerged as the big data platform of choice for data scientists due to its speed, scalability, and easy-to-use APIs.

This book deep dives into using Spark to deliver production-grade data science solutions. This process is demonstrated by exploring the construction of a sophisticated global news analysis service that uses Spark to generate continuous geopolitical and current affairs insights.You will learn all about the core Spark APIs and take a comprehensive tour of advanced libraries, including Spark SQL, Spark Streaming, MLlib, and more.

You will be introduced to advanced techniques and methods that will help you to construct commercial-grade data products. Focusing on a sequence of tutorials that deliver a working news intelligence service, you will learn about advanced Spark architectures, how to work with geographic data in Spark, and how to tune Spark algorithms so they scale linearly.

Table of Contents

Chapter 1: The Big Data Science Ecosystem
Introducing the Big Data ecosystem
Overall architecture
Data technologies
Companion tools
Summary
Chapter 2: Data Acquisition
Data pipelines
Content registry
Quality assurance
Summary
Chapter 3: Input Formats and Schema
A structured life is a good life
GDELT dimensional modeling
Loading your data
Avro
Parquet
Summary
Chapter 4: Exploratory Data Analysis
The problem, principles and planning
Preparation
Exploring GDELT
Summary
Chapter 5: Spark for Geographic Analysis
GDELT and oil
Formulating a plan of action
GeoMesa
Gauging oil prices
Summary
Chapter 6: Scraping Link-Based External Data
Building a web scale news scanner
Named entity recognition
GIS lookup
Names de-duplication
News index dashboard
Summary
Chapter 7: Building Communities
Building a graph of persons
Using the Accumulo database
Community detection algorithm
GDELT dataset
Summary
Chapter 8: Building a Recommendation System
Different approaches
Uninformed data
Building a song analyzer
Building a recommender
Summary
Chapter 9: News Dictionary and Real-Time Tagging System
The mechanical Turk
Designing a Spark Streaming application
Consuming data streams
Processing Twitter data
Fetching HTML content
Using Elasticsearch as a caching layer
Classifying data
Our Twitter mechanical Turk
Summary
Chapter 10: Story De-duplication and Mutation
Detecting near duplicates
Building stories
Story mutation
Summary
Chapter 11: Anomaly Detection on Sentiment Analysis
Following the US elections on Twitter
Analysing sentiment
Using Timely as a time series database
Twitter and the Godwin point
A Small Step into sarcasm detection
Summary
Chapter 12: TrendCalculus
Studying trends
The TrendCalculus algorithm
Practical applications
Summary
Chapter 13: Secure Data
Data security
Authentication and authorization
Access
Encryption
Data disposal
Kerberos authentication
Security ecosystem
Your Secure Responsibility
Summary
Chapter 14: Scalable Algorithms
General principles
Spark architecture
Challenges
Plotting your course
Design patterns and techniques
Summary

What You Will Learn

  • Learn the design patterns that integrate Spark into industrialized data science pipelines
  • See how commercial data scientists design scalable code and reusable code for data science services
  • Explore cutting edge data science methods so that you can study trends and causality
  • Discover advanced programming techniques using RDD and the DataFrame and Dataset APIs
  • Find out how Spark can be used as a universal ingestion engine tool and as a web scraper
  • Practice the implementation of advanced topics in graph processing, such as community detection and contact chaining
  • Get to know the best practices when performing Extended Exploratory Data Analysis, commonly used in commercial data science teams
  • Study advanced Spark concepts, solution design patterns, and integration architectures
  • Demonstrate powerful data science pipelines

Authors

Table of Contents

Chapter 1: The Big Data Science Ecosystem
Introducing the Big Data ecosystem
Overall architecture
Data technologies
Companion tools
Summary
Chapter 2: Data Acquisition
Data pipelines
Content registry
Quality assurance
Summary
Chapter 3: Input Formats and Schema
A structured life is a good life
GDELT dimensional modeling
Loading your data
Avro
Parquet
Summary
Chapter 4: Exploratory Data Analysis
The problem, principles and planning
Preparation
Exploring GDELT
Summary
Chapter 5: Spark for Geographic Analysis
GDELT and oil
Formulating a plan of action
GeoMesa
Gauging oil prices
Summary
Chapter 6: Scraping Link-Based External Data
Building a web scale news scanner
Named entity recognition
GIS lookup
Names de-duplication
News index dashboard
Summary
Chapter 7: Building Communities
Building a graph of persons
Using the Accumulo database
Community detection algorithm
GDELT dataset
Summary
Chapter 8: Building a Recommendation System
Different approaches
Uninformed data
Building a song analyzer
Building a recommender
Summary
Chapter 9: News Dictionary and Real-Time Tagging System
The mechanical Turk
Designing a Spark Streaming application
Consuming data streams
Processing Twitter data
Fetching HTML content
Using Elasticsearch as a caching layer
Classifying data
Our Twitter mechanical Turk
Summary
Chapter 10: Story De-duplication and Mutation
Detecting near duplicates
Building stories
Story mutation
Summary
Chapter 11: Anomaly Detection on Sentiment Analysis
Following the US elections on Twitter
Analysing sentiment
Using Timely as a time series database
Twitter and the Godwin point
A Small Step into sarcasm detection
Summary
Chapter 12: TrendCalculus
Studying trends
The TrendCalculus algorithm
Practical applications
Summary
Chapter 13: Secure Data
Data security
Authentication and authorization
Access
Encryption
Data disposal
Kerberos authentication
Security ecosystem
Your Secure Responsibility
Summary
Chapter 14: Scalable Algorithms
General principles
Spark architecture
Challenges
Plotting your course
Design patterns and techniques
Summary

Book Details

ISBN 139781785882142
Paperback560 pages
Read More

Read More Reviews