Python Data Analysis Cookbook

Over 140 practical recipes to help you make sense of your data with ease and build production-ready data apps

Python Data Analysis Cookbook

This ebook is included in a Mapt subscription
Ivan Idris

7 customer reviews
Over 140 practical recipes to help you make sense of your data with ease and build production-ready data apps
$0.00
$20.00
$49.99
$29.99p/m after trial
RRP $39.99
RRP $49.99
Subscription
eBook
Print + eBook
Start 30 Day Trial
Subscribe and access every Packt eBook & Video.
 
  • 5,000+ eBooks & Videos
  • 50+ New titles a month
  • 1 Free eBook/Video to keep every month
Start Free Trial
 
Preview in Mapt

Book Details

ISBN 139781785282287
Paperback462 pages

Book Description

Data analysis is a rapidly evolving field and Python is a multi-paradigm programming language suitable for object-oriented application development and functional design patterns. As Python offers a range of tools and libraries for all purposes, it has slowly evolved as the primary language for data science, including topics on: data analysis, visualization, and machine learning.

Python Data Analysis Cookbook focuses on reproducibility and creating production-ready systems. You will start with recipes that set the foundation for data analysis with libraries such as matplotlib, NumPy, and pandas. You will learn to create visualizations by choosing color maps and palettes then dive into statistical data analysis using distribution algorithms and correlations. You’ll then help you find your way around different data and numerical problems, get to grips with Spark and HDFS, and then set up migration scripts for web mining.

In this book, you will dive deeper into recipes on spectral analysis, smoothing, and bootstrapping methods. Moving on, you will learn to rank stocks and check market efficiency, then work with metrics and clusters. You will achieve parallelism to improve system performance by using multiple threads and speeding up your code.
By the end of the book, you will be capable of handling various data analysis techniques in Python and devising solutions for problem scenarios.

Table of Contents

Chapter 1: Laying the Foundation for Reproducible Data Analysis
Introduction
Setting up Anaconda
Installing the Data Science Toolbox
Creating a virtual environment with virtualenv and virtualenvwrapper
Sandboxing Python applications with Docker images
Keeping track of package versions and history in IPython Notebook
Configuring IPython
Learning to log for robust error checking
Unit testing your code
Configuring pandas
Configuring matplotlib
Seeding random number generators and NumPy print options
Standardizing reports, code style, and data access
Chapter 2: Creating Attractive Data Visualizations
Introduction
Graphing Anscombe's quartet
Choosing seaborn color palettes
Choosing matplotlib color maps
Interacting with IPython Notebook widgets
Viewing a matrix of scatterplots
Visualizing with d3.js via mpld3
Creating heatmaps
Combining box plots and kernel density plots with violin plots
Visualizing network graphs with hive plots
Displaying geographical maps
Using ggplot2-like plots
Highlighting data points with influence plots
Chapter 3: Statistical Data Analysis and Probability
Introduction
Fitting data to the exponential distribution
Fitting aggregated data to the gamma distribution
Fitting aggregated counts to the Poisson distribution
Determining bias
Estimating kernel density
Determining confidence intervals for mean, variance, and standard deviation
Sampling with probability weights
Exploring extreme values
Correlating variables with Pearson's correlation
Correlating variables with the Spearman rank correlation
Correlating a binary and a continuous variable with the point biserial correlation
Evaluating relations between variables with ANOVA
Chapter 4: Dealing with Data and Numerical Issues
Introduction
Clipping and filtering outliers
Winsorizing data
Measuring central tendency of noisy data
Normalizing with the Box-Cox transformation
Transforming data with the power ladder
Transforming data with logarithms
Rebinning data
Applying logit() to transform proportions
Fitting a robust linear model
Taking variance into account with weighted least squares
Using arbitrary precision for optimization
Using arbitrary precision for linear algebra
Chapter 5: Web Mining, Databases, and Big Data
Introduction
Simulating web browsing
Scraping the Web
Dealing with non-ASCII text and HTML entities
Implementing association tables
Setting up database migration scripts
Adding a table column to an existing table
Adding indices after table creation
Setting up a test web server
Implementing a star schema with fact and dimension tables
Using HDFS
Setting up Spark
Clustering data with Spark
Chapter 6: Signal Processing and Timeseries
Introduction
Spectral analysis with periodograms
Estimating power spectral density with the Welch method
Analyzing peaks
Measuring phase synchronization
Exponential smoothing
Evaluating smoothing
Using the Lomb-Scargle periodogram
Analyzing the frequency spectrum of audio
Analyzing signals with the discrete cosine transform
Block bootstrapping time series data
Moving block bootstrapping time series data
Applying the discrete wavelet transform
Chapter 7: Selecting Stocks with Financial Data Analysis
Introduction
Computing simple and log returns
Ranking stocks with the Sharpe ratio and liquidity
Ranking stocks with the Calmar and Sortino ratios
Analyzing returns statistics
Correlating individual stocks with the broader market
Exploring risk and return
Examining the market with the non-parametric runs test
Testing for random walks
Determining market efficiency with autoregressive models
Creating tables for a stock prices database
Populating the stock prices database
Optimizing an equal weights two-asset portfolio
Chapter 8: Text Mining and Social Network Analysis
Introduction
Creating a categorized corpus
Tokenizing news articles in sentences and words
Stemming, lemmatizing, filtering, and TF-IDF scores
Recognizing named entities
Extracting topics with non-negative matrix factorization
Implementing a basic terms database
Computing social network density
Calculating social network closeness centrality
Determining the betweenness centrality
Estimating the average clustering coefficient
Calculating the assortativity coefficient of a graph
Getting the clique number of a graph
Creating a document graph with cosine similarity
Chapter 9: Ensemble Learning and Dimensionality Reduction
Introduction
Recursively eliminating features
Applying principal component analysis for dimension reduction
Applying linear discriminant analysis for dimension reduction
Stacking and majority voting for multiple models
Learning with random forests
Fitting noisy data with the RANSAC algorithm
Bagging to improve results
Boosting for better learning
Nesting cross-validation
Reusing models with joblib
Hierarchically clustering data
Taking a Theano tour
Chapter 10: Evaluating Classifiers, Regressors, and Clusters
Introduction
Getting classification straight with the confusion matrix
Computing precision, recall, and F1-score
Examining a receiver operating characteristic and the area under a curve
Visualizing the goodness of fit
Computing MSE and median absolute error
Evaluating clusters with the mean silhouette coefficient
Comparing results with a dummy classifier
Determining MAPE and MPE
Comparing with a dummy regressor
Calculating the mean absolute error and the residual sum of squares
Examining the kappa of classification
Taking a look at the Matthews correlation coefficient
Chapter 11: Analyzing Images
Introduction
Setting up OpenCV
Applying Scale-Invariant Feature Transform (SIFT)
Detecting features with SURF
Quantizing colors
Denoising images
Extracting patches from an image
Detecting faces with Haar cascades
Searching for bright stars
Extracting metadata from images
Extracting texture features from images
Applying hierarchical clustering on images
Segmenting images with spectral clustering
Chapter 12: Parallelism and Performance
Introduction
Just-in-time compiling with Numba
Speeding up numerical expressions with Numexpr
Running multiple threads with the threading module
Launching multiple tasks with the concurrent.futures module
Accessing resources asynchronously with the asyncio module
Distributed processing with execnet
Profiling memory usage
Calculating the mean, variance, skewness, and kurtosis on the fly
Caching with a least recently used cache
Caching HTTP requests
Streaming counting with the Count-min sketch
Harnessing the power of the GPU with OpenCL

What You Will Learn

  • Set up reproducible data analysis
  • Clean and transform data
  • Apply advanced statistical analysis
  • Create attractive data visualizations
  • Web scrape and work with databases, Hadoop, and Spark
  • Analyze images and time series data
  • Mine text and analyze social networks
  • Use machine learning and evaluate the results
  • Take advantage of parallelism and concurrency

Authors

Table of Contents

Chapter 1: Laying the Foundation for Reproducible Data Analysis
Introduction
Setting up Anaconda
Installing the Data Science Toolbox
Creating a virtual environment with virtualenv and virtualenvwrapper
Sandboxing Python applications with Docker images
Keeping track of package versions and history in IPython Notebook
Configuring IPython
Learning to log for robust error checking
Unit testing your code
Configuring pandas
Configuring matplotlib
Seeding random number generators and NumPy print options
Standardizing reports, code style, and data access
Chapter 2: Creating Attractive Data Visualizations
Introduction
Graphing Anscombe's quartet
Choosing seaborn color palettes
Choosing matplotlib color maps
Interacting with IPython Notebook widgets
Viewing a matrix of scatterplots
Visualizing with d3.js via mpld3
Creating heatmaps
Combining box plots and kernel density plots with violin plots
Visualizing network graphs with hive plots
Displaying geographical maps
Using ggplot2-like plots
Highlighting data points with influence plots
Chapter 3: Statistical Data Analysis and Probability
Introduction
Fitting data to the exponential distribution
Fitting aggregated data to the gamma distribution
Fitting aggregated counts to the Poisson distribution
Determining bias
Estimating kernel density
Determining confidence intervals for mean, variance, and standard deviation
Sampling with probability weights
Exploring extreme values
Correlating variables with Pearson's correlation
Correlating variables with the Spearman rank correlation
Correlating a binary and a continuous variable with the point biserial correlation
Evaluating relations between variables with ANOVA
Chapter 4: Dealing with Data and Numerical Issues
Introduction
Clipping and filtering outliers
Winsorizing data
Measuring central tendency of noisy data
Normalizing with the Box-Cox transformation
Transforming data with the power ladder
Transforming data with logarithms
Rebinning data
Applying logit() to transform proportions
Fitting a robust linear model
Taking variance into account with weighted least squares
Using arbitrary precision for optimization
Using arbitrary precision for linear algebra
Chapter 5: Web Mining, Databases, and Big Data
Introduction
Simulating web browsing
Scraping the Web
Dealing with non-ASCII text and HTML entities
Implementing association tables
Setting up database migration scripts
Adding a table column to an existing table
Adding indices after table creation
Setting up a test web server
Implementing a star schema with fact and dimension tables
Using HDFS
Setting up Spark
Clustering data with Spark
Chapter 6: Signal Processing and Timeseries
Introduction
Spectral analysis with periodograms
Estimating power spectral density with the Welch method
Analyzing peaks
Measuring phase synchronization
Exponential smoothing
Evaluating smoothing
Using the Lomb-Scargle periodogram
Analyzing the frequency spectrum of audio
Analyzing signals with the discrete cosine transform
Block bootstrapping time series data
Moving block bootstrapping time series data
Applying the discrete wavelet transform
Chapter 7: Selecting Stocks with Financial Data Analysis
Introduction
Computing simple and log returns
Ranking stocks with the Sharpe ratio and liquidity
Ranking stocks with the Calmar and Sortino ratios
Analyzing returns statistics
Correlating individual stocks with the broader market
Exploring risk and return
Examining the market with the non-parametric runs test
Testing for random walks
Determining market efficiency with autoregressive models
Creating tables for a stock prices database
Populating the stock prices database
Optimizing an equal weights two-asset portfolio
Chapter 8: Text Mining and Social Network Analysis
Introduction
Creating a categorized corpus
Tokenizing news articles in sentences and words
Stemming, lemmatizing, filtering, and TF-IDF scores
Recognizing named entities
Extracting topics with non-negative matrix factorization
Implementing a basic terms database
Computing social network density
Calculating social network closeness centrality
Determining the betweenness centrality
Estimating the average clustering coefficient
Calculating the assortativity coefficient of a graph
Getting the clique number of a graph
Creating a document graph with cosine similarity
Chapter 9: Ensemble Learning and Dimensionality Reduction
Introduction
Recursively eliminating features
Applying principal component analysis for dimension reduction
Applying linear discriminant analysis for dimension reduction
Stacking and majority voting for multiple models
Learning with random forests
Fitting noisy data with the RANSAC algorithm
Bagging to improve results
Boosting for better learning
Nesting cross-validation
Reusing models with joblib
Hierarchically clustering data
Taking a Theano tour
Chapter 10: Evaluating Classifiers, Regressors, and Clusters
Introduction
Getting classification straight with the confusion matrix
Computing precision, recall, and F1-score
Examining a receiver operating characteristic and the area under a curve
Visualizing the goodness of fit
Computing MSE and median absolute error
Evaluating clusters with the mean silhouette coefficient
Comparing results with a dummy classifier
Determining MAPE and MPE
Comparing with a dummy regressor
Calculating the mean absolute error and the residual sum of squares
Examining the kappa of classification
Taking a look at the Matthews correlation coefficient
Chapter 11: Analyzing Images
Introduction
Setting up OpenCV
Applying Scale-Invariant Feature Transform (SIFT)
Detecting features with SURF
Quantizing colors
Denoising images
Extracting patches from an image
Detecting faces with Haar cascades
Searching for bright stars
Extracting metadata from images
Extracting texture features from images
Applying hierarchical clustering on images
Segmenting images with spectral clustering
Chapter 12: Parallelism and Performance
Introduction
Just-in-time compiling with Numba
Speeding up numerical expressions with Numexpr
Running multiple threads with the threading module
Launching multiple tasks with the concurrent.futures module
Accessing resources asynchronously with the asyncio module
Distributed processing with execnet
Profiling memory usage
Calculating the mean, variance, skewness, and kurtosis on the fly
Caching with a least recently used cache
Caching HTTP requests
Streaming counting with the Count-min sketch
Harnessing the power of the GPU with OpenCL

Book Details

ISBN 139781785282287
Paperback462 pages
Read More
From 7 reviews

Read More Reviews