Clojure for Data Science

Statistics, big data, and machine learning for Clojure programmers

Clojure for Data Science

This ebook is included in a Mapt subscription
Henry Garner

2 customer reviews
Statistics, big data, and machine learning for Clojure programmers
$10.00
$44.99
RRP $35.99
RRP $44.99
eBook
Print + eBook
Subscribe and access every Packt eBook & Video.
 
  • 4,000+ eBooks & Videos
  • 40+ New titles a month
  • 1 Free eBook/Video to keep every month
Start Free Trial
 
Preview in Mapt

Book Details

ISBN 139781784397180
Paperback608 pages

Book Description

The term “data science” has been widely used to define this new profession that is expected to interpret vast datasets and translate them to improved decision-making and performance. Clojure is a powerful language that combines the interactivity of a scripting language with the speed of a compiled language. Together with its rich ecosystem of native libraries and an extremely simple and consistent functional approach to data manipulation, which maps closely to mathematical formula, it is an ideal, practical, and flexible language to meet a data scientist’s diverse needs.

Taking you on a journey from simple summary statistics to sophisticated machine learning algorithms, this book shows how the Clojure programming language can be used to derive insights from data. Data scientists often forge a novel path, and you’ll see how to make use of Clojure’s Java interoperability capabilities to access libraries such as Mahout and Mllib for which Clojure wrappers don’t yet exist. Even seasoned Clojure developers will develop a deeper appreciation for their language’s flexibility!

You’ll learn how to apply statistical thinking to your own data and use Clojure to explore, analyze, and visualize it in a technically and statistically robust way. You can also use Incanter for local data processing and ClojureScript to present interactive visualisations and understand how distributed platforms such as Hadoop sand Spark’s MapReduce and GraphX’s BSP solve the challenges of data analysis at scale, and how to explain algorithms using those programming models.

Above all, by following the explanations in this book, you’ll learn not just how to be effective using the current state-of-the-art methods in data science, but why such methods work so that you can continue to be productive as the field evolves into the future.

Table of Contents

Chapter 1: Statistics
Downloading the sample code
Running the examples
Downloading the data
Inspecting the data
Data scrubbing
Descriptive statistics
Variance
Quantiles
Binning data
Histograms
The normal distribution
Poincaré's baker
Skewness
Comparative visualizations
The importance of visualizations
Adding columns
Comparative visualizations of electorate data
Visualizing the Russian election data
Comparative visualizations
Summary
Chapter 2: Inference
Introducing AcmeContent
Download the sample code
Load and inspect the data
Visualizing the dwell times
The exponential distribution
The central limit theorem
Standard error
Samples and populations
Confidence intervals
Visualizing different populations
Hypothesis testing
Testing a new site design
The t-statistic
Performing the t-test
One-sample t-test
Resampling
Testing multiple designs
Multiple comparisons
The browser simulation
jStat
B1
Plotting probability densities
State and Reagent
Simulating multiple tests
The Bonferroni correction
Analysis of variance
The F-distribution
The F-statistic
The F-test
Effect size
Summary
Chapter 3: Correlation
About the data
Inspecting the data
Visualizing the data
The log-normal distribution
Covariance
Pearson's correlation
Hypothesis testing
Confidence intervals
Regression
Ordinary least squares
Goodness-of-fit and R-square
Multiple linear regression
Matrices
The normal equation
Multiple R-squared
Adjusted R-squared
Collinearity
Prediction
Summary
Chapter 4: Classification
About the data
Inspecting the data
Comparisons with relative risk and odds
The standard error of a proportion
The binomial distribution
Significance testing proportions
Chi-squared multiple significance testing
Classification with logistic regression
Implementing logistic regression with Incanter
Probability
Naive Bayes classification
Decision trees
Classification with clj-ml
Bias and variance
Ensemble learning and random forests
Saving the classifier to a file
Summary
Chapter 5: Big Data
Downloading the code and data
The reducers library
Mathematical folds with Tesser
Multiple regression with gradient descent
Scaling gradient descent with Hadoop
Stochastic gradient descent
Summary
Chapter 6: Clustering
Downloading the data
Extracting the data
Inspecting the data
Clustering text
Creating term frequency vectors
Clustering with k-means and Incanter
Better clustering with TF-IDF
Large-scale clustering with Mahout
Running k-means clustering with Mahout
Cluster evaluation measures
The drawbacks of k-means
The curse of dimensionality
Summary
Chapter 7: Recommender Systems
Download the code and data
Inspect the data
Parse the data
Types of recommender systems
Item-based and user-based recommenders
Slope One recommenders
Building a user-based recommender with Mahout
k-nearest neighbors
Recommender evaluation with Mahout
Probabilistic methods for large sets
Jaccard similarity for large sets with MinHash
Dimensionality reduction
Large-scale machine learning with Apache Spark and MLlib
Machine learning on Spark with MLlib
Summary
Chapter 8: Network Analysis
Download the data
Graph traversal with Loom
Breadth-first and depth-first search
Finding the shortest path
Whole-graph analysis
Scale-free networks
Distributed graph computation with GraphX
Summary
Chapter 9: Time Series
About the data
Fitting curves with a linear model
Time series decomposition
Discrete time models
Maximum likelihood estimation
Time series forecasting
Summary
Chapter 10: Visualization
Download the code and data
Exploratory data visualization
Using Quil for visualization
Visualization for communication
Summary

What You Will Learn

  • Perform hypothesis testing and understand feature selection and statistical significance to interpret your results with confidence
  • Implement the core machine learning techniques of regression, classification, clustering and recommendation
  • Understand the importance of the value of simple statistics and distributions in exploratory data analysis
  • Scale algorithms to web-sized datasets efficiently using distributed programming models on Hadoop and Spark
  • Apply suitable analytic approaches for text, graph, and time series data
  • Interpret the terminology that you will encounter in technical papers
  • Import libraries from other JVM languages such as Java and Scala
  • Communicate your findings clearly and convincingly to nontechnical colleagues

Authors

Table of Contents

Chapter 1: Statistics
Downloading the sample code
Running the examples
Downloading the data
Inspecting the data
Data scrubbing
Descriptive statistics
Variance
Quantiles
Binning data
Histograms
The normal distribution
Poincaré's baker
Skewness
Comparative visualizations
The importance of visualizations
Adding columns
Comparative visualizations of electorate data
Visualizing the Russian election data
Comparative visualizations
Summary
Chapter 2: Inference
Introducing AcmeContent
Download the sample code
Load and inspect the data
Visualizing the dwell times
The exponential distribution
The central limit theorem
Standard error
Samples and populations
Confidence intervals
Visualizing different populations
Hypothesis testing
Testing a new site design
The t-statistic
Performing the t-test
One-sample t-test
Resampling
Testing multiple designs
Multiple comparisons
The browser simulation
jStat
B1
Plotting probability densities
State and Reagent
Simulating multiple tests
The Bonferroni correction
Analysis of variance
The F-distribution
The F-statistic
The F-test
Effect size
Summary
Chapter 3: Correlation
About the data
Inspecting the data
Visualizing the data
The log-normal distribution
Covariance
Pearson's correlation
Hypothesis testing
Confidence intervals
Regression
Ordinary least squares
Goodness-of-fit and R-square
Multiple linear regression
Matrices
The normal equation
Multiple R-squared
Adjusted R-squared
Collinearity
Prediction
Summary
Chapter 4: Classification
About the data
Inspecting the data
Comparisons with relative risk and odds
The standard error of a proportion
The binomial distribution
Significance testing proportions
Chi-squared multiple significance testing
Classification with logistic regression
Implementing logistic regression with Incanter
Probability
Naive Bayes classification
Decision trees
Classification with clj-ml
Bias and variance
Ensemble learning and random forests
Saving the classifier to a file
Summary
Chapter 5: Big Data
Downloading the code and data
The reducers library
Mathematical folds with Tesser
Multiple regression with gradient descent
Scaling gradient descent with Hadoop
Stochastic gradient descent
Summary
Chapter 6: Clustering
Downloading the data
Extracting the data
Inspecting the data
Clustering text
Creating term frequency vectors
Clustering with k-means and Incanter
Better clustering with TF-IDF
Large-scale clustering with Mahout
Running k-means clustering with Mahout
Cluster evaluation measures
The drawbacks of k-means
The curse of dimensionality
Summary
Chapter 7: Recommender Systems
Download the code and data
Inspect the data
Parse the data
Types of recommender systems
Item-based and user-based recommenders
Slope One recommenders
Building a user-based recommender with Mahout
k-nearest neighbors
Recommender evaluation with Mahout
Probabilistic methods for large sets
Jaccard similarity for large sets with MinHash
Dimensionality reduction
Large-scale machine learning with Apache Spark and MLlib
Machine learning on Spark with MLlib
Summary
Chapter 8: Network Analysis
Download the data
Graph traversal with Loom
Breadth-first and depth-first search
Finding the shortest path
Whole-graph analysis
Scale-free networks
Distributed graph computation with GraphX
Summary
Chapter 9: Time Series
About the data
Fitting curves with a linear model
Time series decomposition
Discrete time models
Maximum likelihood estimation
Time series forecasting
Summary
Chapter 10: Visualization
Download the code and data
Exploratory data visualization
Using Quil for visualization
Visualization for communication
Summary

Book Details

ISBN 139781784397180
Paperback608 pages
Read More
From 2 reviews

Read More Reviews