Scala: Guide for Data Science Professionals

Scala will be a valuable tool to have on hand during your data science journey for everything from data cleaning to cutting-edge machine learning
Preview in Mapt

Scala: Guide for Data Science Professionals

Pascal Bugnion, Arun Manivannan, Patrick R. Nicolas

1 customer reviews
Scala will be a valuable tool to have on hand during your data science journey for everything from data cleaning to cutting-edge machine learning

Quick links: > What will you learn?> Table of content> Product reviews

Mapt Subscription
FREE
$29.99/m after trial
eBook
$36.00
RRP $71.99
Save 49%
What do I get with a Mapt Pro subscription?
  • Unlimited access to all Packt’s 5,000+ eBooks and Videos
  • Early Access content, Progress Tracking, and Assessments
  • 1 Free eBook or Video to download and keep every month after trial
What do I get with an eBook?
  • Download this book in EPUB, PDF, MOBI formats
  • DRM FREE - read and interact with your content when you want, where you want, and how you want
  • Access this title in the Mapt reader
What do I get with Print & eBook?
  • Get a paperback copy of the book delivered to you
  • Download this book in EPUB, PDF, MOBI formats
  • DRM FREE - read and interact with your content when you want, where you want, and how you want
  • Access this title in the Mapt reader
What do I get with a Video?
  • Download this Video course in MP4 format
  • DRM FREE - read and interact with your content when you want, where you want, and how you want
  • Access this title in the Mapt reader
$0.00
$36.00
$29.99 p/m after trial
RRP $71.99
Subscription
eBook
Start 14 Day Trial

Frequently bought together


Scala: Guide for Data Science Professionals Book Cover
Scala: Guide for Data Science Professionals
$ 71.99
$ 36.00
Statistics for Machine Learning Book Cover
Statistics for Machine Learning
$ 39.99
$ 20.00
Buy 2 for $35.00
Save $76.98
Add to Cart

Book Details

ISBN 139781787282858
Paperback1100 pages

Book Description

Scala is especially good for analyzing large sets of data as the scale of the task doesn’t have any significant impact on performance. Scala’s powerful functional libraries can interact with databases and build scalable frameworks — resulting in the creation of robust data pipelines.

The first module introduces you to Scala libraries to ingest, store, manipulate, process, and visualize data. Using real world examples, you will learn how to design scalable architecture to process and model data — starting from simple concurrency constructs and progressing to actor systems and Apache Spark. After this, you will also learn how to build interactive visualizations with web frameworks.

Once you have become familiar with all the tasks involved in data science, you will explore data analytics with Scala in the second module. You’ll see how Scala can be used to make sense of data through easy to follow recipes. You will learn about Bokeh bindings for exploratory data analysis and quintessential machine learning with algorithms with Spark ML library. You’ll get a sufficient understanding of Spark streaming, machine learning for streaming data, and Spark graphX.

Armed with a firm understanding of data analysis, you will be ready to explore the most cutting-edge aspect of data science — machine learning. The final module teaches you the A to Z of machine learning with Scala. You’ll explore Scala for dependency injections and implicits, which are used to write machine learning algorithms. You’ll also explore machine learning topics such as clustering, dimentionality reduction, Naïve Bayes, Regression models, SVMs, neural networks, and more.

This learning path combines some of the best that Packt has to offer into one complete, curated package. It includes content from the following Packt products:

  • Scala for Data Science, Pascal Bugnion
  • Scala Data Analysis Cookbook, Arun Manivannan
  • Scala for Machine Learning, Patrick R. Nicolas

Table of Contents

Chapter 1: Scala and Data Science
Data science
Programming in data science
Why Scala?
When not to use Scala
Summary
References
Chapter 2: Manipulating Data with Breeze
Code examples
Installing Breeze
Getting help on Breeze
Basic Breeze data types
An example – logistic regression
Towards re-usable code
Alternatives to Breeze
Summary
References
Chapter 3: Plotting with breeze-viz
Diving into Breeze
Customizing plots
Customizing the line type
More advanced scatter plots
Multi-plot example – scatterplot matrix plots
Managing without documentation
Breeze-viz reference
Data visualization beyond breeze-viz
Summary
Chapter 4: Parallel Collections and Futures
Parallel collections
Futures
Summary
References
Chapter 5: Scala and SQL through JDBC
Interacting with JDBC
First steps with JDBC
JDBC summary
Functional wrappers for JDBC
Safer JDBC connections with the loan pattern
Enriching JDBC statements with the "pimp my library" pattern
Wrapping result sets in a stream
Looser coupling with type classes
Creating a data access layer
Summary
References
Chapter 6: Slick – A Functional Interface for SQL
FEC data
Invokers
Operations on columns
Aggregations with "Group by"
Accessing database metadata
Slick versus JDBC
Summary
References
Chapter 7: Web APIs
A whirlwind tour of JSON
Querying web APIs
JSON in Scala – an exercise in pattern matching
Extraction using case classes
Concurrency and exception handling with futures
Authentication – adding HTTP headers
Summary
References
Chapter 8: Scala and MongoDB
MongoDB
Connecting to MongoDB with Casbah
Inserting documents
Extracting objects from the database
Complex queries
Casbah query DSL
Custom type serialization
Beyond Casbah
Summary
References
Chapter 9: Concurrency with Akka
GitHub follower graph
Actors as people
Hello world with Akka
Case classes as messages
Actor construction
Anatomy of an actor
Follower network crawler
Fetcher actors
Routing
Message passing between actors
Queue control and the pull pattern
Accessing the sender of a message
Stateful actors
Follower network crawler
Fault tolerance
Custom supervisor strategies
Life-cycle hooks
What we have not talked about
Summary
References
Chapter 10: Distributed Batch Processing with Spark
Installing Spark
Acquiring the example data
Resilient distributed datasets
Building and running standalone programs
Spam filtering
Lifting the hood
Data shuffling and partitions
Summary
Reference
Chapter 11: Spark SQL and DataFrames
DataFrames – a whirlwind introduction
Aggregation operations
Joining DataFrames together
Custom functions on DataFrames
DataFrame immutability and persistence
SQL statements on DataFrames
Complex data types – arrays, maps, and structs
Interacting with data sources
Standalone programs
Summary
References
Chapter 12: Distributed Machine Learning with MLlib
Introducing MLlib – Spam classification
Pipeline components
Evaluation
Regularization in logistic regression
Cross-validation and model selection
Beyond logistic regression
Summary
References
Chapter 13: Web APIs with Play
Client-server applications
Introduction to web frameworks
Model-View-Controller architecture
Single page applications
Building an application
The Play framework
Dynamic routing
Actions
Interacting with JSON
Querying external APIs and consuming JSON
Creating APIs with Play: a summary
Rest APIs: best practice
Summary
References
Chapter 14: Visualization with D3 and the Play Framework
GitHub user data
Do I need a backend?
JavaScript dependencies through web-jars
Towards a web application: HTML templates
Modular JavaScript through RequireJS
Bootstrapping the applications
Client-side program architecture
Drawing plots with NVD3
Summary
References
Chapter 15: Getting Started with Breeze
Introduction
Getting Breeze – the linear algebra library
Working with vectors
Working with matrices
Vectors and matrices with randomly distributed values
Reading and writing CSV files
Chapter 16: Getting Started with Apache Spark DataFrames
Introduction
Getting Apache Spark
Creating a DataFrame from CSV
Manipulating DataFrames
Creating a DataFrame from Scala case classes
Chapter 17: Loading and Preparing Data – DataFrame
Introduction
Loading more than 22 features into classes
Loading JSON into DataFrames
Storing data as Parquet files
Using the Avro data model in Parquet
Loading from RDBMS
Preparing data in Dataframes
Chapter 18: Data Visualization
Introduction
Visualizing using Zeppelin
Creating scatter plots with Bokeh-Scala
Creating a time series MultiPlot with Bokeh-Scala
Chapter 19: Learning from Data
Introduction
Supervised and unsupervised learning
Gradient descent
Predicting continuous values using linear regression
Binary classification using LogisticRegression and SVM
Binary classification using LogisticRegression with Pipeline API
Clustering using K-means
Feature reduction using principal component analysis
Chapter 20: Scaling Up
Introduction
Building the Uber JAR
Submitting jobs to the Spark cluster (local)
Running the Spark Standalone cluster on EC2
Running the Spark Job on Mesos (local)
Running the Spark Job on YARN (local)
Chapter 21: Going Further
Introduction
Using Spark Streaming to subscribe to a Twitter stream
Using Spark as an ETL tool
Using StreamingLogisticRegression to classify a Twitter stream using Kafka as a training stream
Using GraphX to analyze Twitter data
Chapter 22: Getting Started
Mathematical notation for the curious
Why machine learning?
Why Scala?
Model categorization
Taxonomy of machine learning algorithms
Tools and frameworks
Source code
Let's kick the tires
Summary
Chapter 23: Hello World!
Modeling
Designing a workflow
Assessing a model
Summary
Chapter 24: Data Preprocessing
Time series
Moving averages
Fourier analysis
The Kalman filter
Alternative preprocessing techniques
Summary
Chapter 25: Unsupervised Learning
Clustering
Dimension reduction
Performance considerations
Summary
Chapter 26: Naïve Bayes Classifiers
Probabilistic graphical models
Naïve Bayes classifiers
Multivariate Bernoulli classification
Naïve Bayes and text mining
Pros and cons
Summary
Chapter 27: Regression and Regularization
Linear regression
Regularization
Numerical optimization
The logistic regression
Summary
Chapter 28: Sequential Data Models
Markov decision processes
The hidden Markov model (HMM)
Conditional random fields
CRF and text analytics
Comparing CRF and HMM
Performance consideration
Summary
Chapter 29: Kernel Models and Support Vector Machines
Kernel functions
The support vector machine (SVM)
Support vector classifier (SVC)
Anomaly detection with one-class SVC
Support vector regression (SVR)
Performance considerations
Summary
Chapter 30: Artificial Neural Networks
Feed-forward neural networks (FFNN)
The multilayer perceptron (MLP)
Evaluation
Benefits and limitations
Summary
Chapter 31: Genetic Algorithms
Evolution
Genetic algorithms and machine learning
Genetic algorithm components
Implementation
GA for trading strategies
Advantages and risks of genetic algorithms
Summary
Chapter 32: Reinforcement Learning
Introduction
Learning classifier systems
Summary
Chapter 33: Scalable Frameworks
Overview
Scala
Scalability with Actors
Akka
Apache Spark
Summary

What You Will Learn

  • Transfer and filter tabular data to extract features for machine learning
  • Read, clean, transform, and write data to both SQL and NoSQL databases
  • Create Scala web applications that couple with JavaScript libraries such as D3 to create compelling interactive visualizations
  • Load data from HDFS and HIVE with ease
  • Run streaming and graph analytics in Spark for exploratory analysis
  • Bundle and scale up Spark jobs by deploying them into a variety of cluster managers
  • Build dynamic workflows for scientific computing
  • Leverage open source libraries to extract patterns from time series
  • Master probabilistic models for sequential data

Authors

Table of Contents

Chapter 1: Scala and Data Science
Data science
Programming in data science
Why Scala?
When not to use Scala
Summary
References
Chapter 2: Manipulating Data with Breeze
Code examples
Installing Breeze
Getting help on Breeze
Basic Breeze data types
An example – logistic regression
Towards re-usable code
Alternatives to Breeze
Summary
References
Chapter 3: Plotting with breeze-viz
Diving into Breeze
Customizing plots
Customizing the line type
More advanced scatter plots
Multi-plot example – scatterplot matrix plots
Managing without documentation
Breeze-viz reference
Data visualization beyond breeze-viz
Summary
Chapter 4: Parallel Collections and Futures
Parallel collections
Futures
Summary
References
Chapter 5: Scala and SQL through JDBC
Interacting with JDBC
First steps with JDBC
JDBC summary
Functional wrappers for JDBC
Safer JDBC connections with the loan pattern
Enriching JDBC statements with the "pimp my library" pattern
Wrapping result sets in a stream
Looser coupling with type classes
Creating a data access layer
Summary
References
Chapter 6: Slick – A Functional Interface for SQL
FEC data
Invokers
Operations on columns
Aggregations with "Group by"
Accessing database metadata
Slick versus JDBC
Summary
References
Chapter 7: Web APIs
A whirlwind tour of JSON
Querying web APIs
JSON in Scala – an exercise in pattern matching
Extraction using case classes
Concurrency and exception handling with futures
Authentication – adding HTTP headers
Summary
References
Chapter 8: Scala and MongoDB
MongoDB
Connecting to MongoDB with Casbah
Inserting documents
Extracting objects from the database
Complex queries
Casbah query DSL
Custom type serialization
Beyond Casbah
Summary
References
Chapter 9: Concurrency with Akka
GitHub follower graph
Actors as people
Hello world with Akka
Case classes as messages
Actor construction
Anatomy of an actor
Follower network crawler
Fetcher actors
Routing
Message passing between actors
Queue control and the pull pattern
Accessing the sender of a message
Stateful actors
Follower network crawler
Fault tolerance
Custom supervisor strategies
Life-cycle hooks
What we have not talked about
Summary
References
Chapter 10: Distributed Batch Processing with Spark
Installing Spark
Acquiring the example data
Resilient distributed datasets
Building and running standalone programs
Spam filtering
Lifting the hood
Data shuffling and partitions
Summary
Reference
Chapter 11: Spark SQL and DataFrames
DataFrames – a whirlwind introduction
Aggregation operations
Joining DataFrames together
Custom functions on DataFrames
DataFrame immutability and persistence
SQL statements on DataFrames
Complex data types – arrays, maps, and structs
Interacting with data sources
Standalone programs
Summary
References
Chapter 12: Distributed Machine Learning with MLlib
Introducing MLlib – Spam classification
Pipeline components
Evaluation
Regularization in logistic regression
Cross-validation and model selection
Beyond logistic regression
Summary
References
Chapter 13: Web APIs with Play
Client-server applications
Introduction to web frameworks
Model-View-Controller architecture
Single page applications
Building an application
The Play framework
Dynamic routing
Actions
Interacting with JSON
Querying external APIs and consuming JSON
Creating APIs with Play: a summary
Rest APIs: best practice
Summary
References
Chapter 14: Visualization with D3 and the Play Framework
GitHub user data
Do I need a backend?
JavaScript dependencies through web-jars
Towards a web application: HTML templates
Modular JavaScript through RequireJS
Bootstrapping the applications
Client-side program architecture
Drawing plots with NVD3
Summary
References
Chapter 15: Getting Started with Breeze
Introduction
Getting Breeze – the linear algebra library
Working with vectors
Working with matrices
Vectors and matrices with randomly distributed values
Reading and writing CSV files
Chapter 16: Getting Started with Apache Spark DataFrames
Introduction
Getting Apache Spark
Creating a DataFrame from CSV
Manipulating DataFrames
Creating a DataFrame from Scala case classes
Chapter 17: Loading and Preparing Data – DataFrame
Introduction
Loading more than 22 features into classes
Loading JSON into DataFrames
Storing data as Parquet files
Using the Avro data model in Parquet
Loading from RDBMS
Preparing data in Dataframes
Chapter 18: Data Visualization
Introduction
Visualizing using Zeppelin
Creating scatter plots with Bokeh-Scala
Creating a time series MultiPlot with Bokeh-Scala
Chapter 19: Learning from Data
Introduction
Supervised and unsupervised learning
Gradient descent
Predicting continuous values using linear regression
Binary classification using LogisticRegression and SVM
Binary classification using LogisticRegression with Pipeline API
Clustering using K-means
Feature reduction using principal component analysis
Chapter 20: Scaling Up
Introduction
Building the Uber JAR
Submitting jobs to the Spark cluster (local)
Running the Spark Standalone cluster on EC2
Running the Spark Job on Mesos (local)
Running the Spark Job on YARN (local)
Chapter 21: Going Further
Introduction
Using Spark Streaming to subscribe to a Twitter stream
Using Spark as an ETL tool
Using StreamingLogisticRegression to classify a Twitter stream using Kafka as a training stream
Using GraphX to analyze Twitter data
Chapter 22: Getting Started
Mathematical notation for the curious
Why machine learning?
Why Scala?
Model categorization
Taxonomy of machine learning algorithms
Tools and frameworks
Source code
Let's kick the tires
Summary
Chapter 23: Hello World!
Modeling
Designing a workflow
Assessing a model
Summary
Chapter 24: Data Preprocessing
Time series
Moving averages
Fourier analysis
The Kalman filter
Alternative preprocessing techniques
Summary
Chapter 25: Unsupervised Learning
Clustering
Dimension reduction
Performance considerations
Summary
Chapter 26: Naïve Bayes Classifiers
Probabilistic graphical models
Naïve Bayes classifiers
Multivariate Bernoulli classification
Naïve Bayes and text mining
Pros and cons
Summary
Chapter 27: Regression and Regularization
Linear regression
Regularization
Numerical optimization
The logistic regression
Summary
Chapter 28: Sequential Data Models
Markov decision processes
The hidden Markov model (HMM)
Conditional random fields
CRF and text analytics
Comparing CRF and HMM
Performance consideration
Summary
Chapter 29: Kernel Models and Support Vector Machines
Kernel functions
The support vector machine (SVM)
Support vector classifier (SVC)
Anomaly detection with one-class SVC
Support vector regression (SVR)
Performance considerations
Summary
Chapter 30: Artificial Neural Networks
Feed-forward neural networks (FFNN)
The multilayer perceptron (MLP)
Evaluation
Benefits and limitations
Summary
Chapter 31: Genetic Algorithms
Evolution
Genetic algorithms and machine learning
Genetic algorithm components
Implementation
GA for trading strategies
Advantages and risks of genetic algorithms
Summary
Chapter 32: Reinforcement Learning
Introduction
Learning classifier systems
Summary
Chapter 33: Scalable Frameworks
Overview
Scala
Scalability with Actors
Akka
Apache Spark
Summary

Book Details

ISBN 139781787282858
Paperback1100 pages
Read More
From 1 reviews

Read More Reviews

Recommended for You

Statistics for Machine Learning Book Cover
Statistics for Machine Learning
$ 39.99
$ 20.00
Microservices: Building Scalable Software Book Cover
Microservices: Building Scalable Software
$ 71.99
$ 36.00
Python: End-to-end Data Analysis Book Cover
Python: End-to-end Data Analysis
$ 71.99
$ 36.00
Building Data Streaming Applications with Apache Kafka Book Cover
Building Data Streaming Applications with Apache Kafka
$ 35.99
$ 18.00
Advanced Statistics and Data Mining for Data Science [Video] Book Cover
Advanced Statistics and Data Mining for Data Science [Video]
$ 124.99
$ 25.00
Jupyter for Data Science [Video] Book Cover
Jupyter for Data Science [Video]
$ 124.99
$ 25.00