Python: End-to-end Data Analysis

Leverage the power of Python to clean, scrape, analyze, and visualize your data

Python: End-to-end Data Analysis

This ebook is included in a Mapt subscription
Phuong Vothihong et al.

1 customer reviews
Leverage the power of Python to clean, scrape, analyze, and visualize your data
$0.00
$36.00
$29.99p/m after trial
RRP $71.99
Subscription
eBook
Start 30 Day Trial
Subscribe and access every Packt eBook & Video.
 
  • 5,000+ eBooks & Videos
  • 50+ New titles a month
  • 1 Free eBook/Video to keep every month
Start Free Trial
 
Preview in Mapt

Book Details

ISBN 139781788394697
Paperback931 pages

Book Description

Data analysis is the process of applying logical and analytical reasoning to study each component of data present in the system. Python is a multi-domain, high-level, programming language that offers a range of tools and libraries suitable for all purposes, it has slowly evolved as one of the primary languages for data science. Have you ever imagined becoming an expert at effectively approaching data analysis problems, solving them, and extracting all of the available information from your data? If yes, look no further, this is the course you need!

In this course, we will get you started with Python data analysis by introducing the basics of data analysis and supported Python libraries such as matplotlib, NumPy, and pandas. Create visualizations by choosing color maps, different shapes, sizes, and palettes then delve into statistical data analysis using distribution algorithms and correlations. You’ll then find your way around different data and numerical problems, get to grips with Spark and HDFS, and set up migration scripts for web mining. You’ll be able to quickly and accurately perform hands-on sorting, reduction, and subsequent analysis, and fully appreciate how data analysis methods can support business decision-making. Finally, you will delve into advanced techniques such as performing regression, quantifying cause and effect using Bayesian methods, and discovering how to use Python’s tools for supervised machine learning.

The course provides you with highly practical content explaining data analysis with Python, from the following Packt books:

  1. Getting Started with Python Data Analysis.
  2. Python Data Analysis Cookbook.
  3. Mastering Python Data Analysis.

By the end of this course, you will have all the knowledge you need to analyze your data with varying complexity levels, and turn it into actionable insights.

Table of Contents

Chapter 1: Introducing Data Analysis and Libraries
Data analysis and processing
An overview of the libraries in data analysis
Python libraries in data analysis
Summary
Chapter 2: NumPy Arrays and Vectorized Computation
NumPy arrays
Array functions
Data processing using arrays
Linear algebra with NumPy
NumPy random numbers
Summary
Chapter 3: Data Analysis with Pandas
An overview of the Pandas package
The Pandas data structure
The essential basic functionality
Indexing and selecting data
Computational tools
Working with missing data
Advanced uses of Pandas for data analysis
Summary
Chapter 4: Data Visualization
The matplotlib API primer
Exploring plot types
Legends and annotations
Plotting functions with Pandas
Additional Python data visualization tools
Summary
Chapter 5: Time Series
Time series primer
Working with date and time objects
Resampling time series
Downsampling time series data
Upsampling time series data
Time zone handling
Timedeltas
Time series plotting
Summary
Chapter 6: Interacting with Databases
Interacting with data in text format
Interacting with data in binary format
Interacting with data in MongoDB
Interacting with data in Redis
Summary
Chapter 7: Data Analysis Application Examples
Data munging
Data aggregation
Grouping data
Summary
Chapter 8: Machine Learning Models with scikit-learn
An overview of machine learning models
The scikit-learn modules for different models
Data representation in scikit-learn
Supervised learning – classification and regression
Unsupervised learning – clustering and dimensionality reduction
Measuring prediction performance
Summary
Chapter 9: Laying the Foundation for Reproducible Data Analysis
Introduction
Setting up Anaconda
Installing the Data Science Toolbox
Creating a virtual environment with virtualenv and virtualenvwrapper
Sandboxing Python applications with Docker images
Keeping track of package versions and history in IPython Notebook
Configuring IPython
Learning to log for robust error checking
Unit testing your code
Configuring pandas
Configuring matplotlib
Seeding random number generators and NumPy print options
Standardizing reports, code style, and data access
Chapter 10: Creating Attractive Data Visualizations
Introduction
Graphing Anscombe's quartet
Choosing seaborn color palettes
Choosing matplotlib color maps
Interacting with IPython Notebook widgets
Viewing a matrix of scatterplots
Visualizing with d3.js via mpld3
Creating heatmaps
Combining box plots and kernel density plots with violin plots
Visualizing network graphs with hive plots
Displaying geographical maps
Using ggplot2-like plots
Highlighting data points with influence plots
Chapter 11: Statistical Data Analysis and Probability
Introduction
Fitting data to the exponential distribution
Fitting aggregated data to the gamma distribution
Fitting aggregated counts to the Poisson distribution
Determining bias
Estimating kernel density
Determining confidence intervals for mean, variance, and standard deviation
Sampling with probability weights
Exploring extreme values
Correlating variables with Pearson's correlation
Correlating variables with the Spearman rank correlation
Correlating a binary and a continuous variable with the point biserial correlation
Evaluating relations between variables with ANOVA
Chapter 12: Dealing with Data and Numerical Issues
Introduction
Clipping and filtering outliers
Winsorizing data
Measuring central tendency of noisy data
Normalizing with the Box-Cox transformation
Transforming data with the power ladder
Transforming data with logarithms
Rebinning data
Applying logit() to transform proportions
Fitting a robust linear model
Taking variance into account with weighted least squares
Using arbitrary precision for optimization
Using arbitrary precision for linear algebra
Chapter 13: Web Mining, Databases, and Big Data
Introduction
Simulating web browsing
Scraping the Web
Dealing with non-ASCII text and HTML entities
Implementing association tables
Setting up database migration scripts
Adding a table column to an existing table
Adding indices after table creation
Setting up a test web server
Implementing a star schema with fact and dimension tables
Using HDFS
Setting up Spark
Clustering data with Spark
Chapter 14: Signal Processing and Timeseries
Introduction
Spectral analysis with periodograms
Estimating power spectral density with the Welch method
Analyzing peaks
Measuring phase synchronization
Exponential smoothing
Evaluating smoothing
Using the Lomb-Scargle periodogram
Analyzing the frequency spectrum of audio
Analyzing signals with the discrete cosine transform
Block bootstrapping time series data
Moving block bootstrapping time series data
Applying the discrete wavelet transform
Chapter 15: Selecting Stocks with Financial Data Analysis
Introduction
Computing simple and log returns
Ranking stocks with the Sharpe ratio and liquidity
Ranking stocks with the Calmar and Sortino ratios
Analyzing returns statistics
Correlating individual stocks with the broader market
Exploring risk and return
Examining the market with the non-parametric runs test
Testing for random walks
Determining market efficiency with autoregressive models
Creating tables for a stock prices database
Populating the stock prices database
Optimizing an equal weights two-asset portfolio
Chapter 16: Text Mining and Social Network Analysis
Introduction
Creating a categorized corpus
Tokenizing news articles in sentences and words
Stemming, lemmatizing, filtering, and TF-IDF scores
Recognizing named entities
Extracting topics with non-negative matrix factorization
Implementing a basic terms database
Computing social network density
Calculating social network closeness centrality
Determining the betweenness centrality
Estimating the average clustering coefficient
Calculating the assortativity coefficient of a graph
Getting the clique number of a graph
Creating a document graph with cosine similarity
Chapter 17: Ensemble Learning and Dimensionality Reduction
Introduction
Recursively eliminating features
Applying principal component analysis for dimension reduction
Applying linear discriminant analysis for dimension reduction
Stacking and majority voting for multiple models
Learning with random forests
Fitting noisy data with the RANSAC algorithm
Bagging to improve results
Boosting for better learning
Nesting cross-validation
Reusing models with joblib
Hierarchically clustering data
Taking a Theano tour
Chapter 18: Evaluating Classifiers, Regressors, and Clusters
Introduction
Getting classification straight with the confusion matrix
Computing precision, recall, and F1-score
Examining a receiver operating characteristic and the area under a curve
Visualizing the goodness of fit
Computing MSE and median absolute error
Evaluating clusters with the mean silhouette coefficient
Comparing results with a dummy classifier
Determining MAPE and MPE
Comparing with a dummy regressor
Calculating the mean absolute error and the residual sum of squares
Examining the kappa of classification
Taking a look at the Matthews correlation coefficient
Chapter 19: Analyzing Images
Introduction
Setting up OpenCV
Applying Scale-Invariant Feature Transform (SIFT)
Detecting features with SURF
Quantizing colors
Denoising images
Extracting patches from an image
Detecting faces with Haar cascades
Searching for bright stars
Extracting metadata from images
Extracting texture features from images
Applying hierarchical clustering on images
Segmenting images with spectral clustering
Chapter 20: Parallelism and Performance
Introduction
Just-in-time compiling with Numba
Speeding up numerical expressions with Numexpr
Running multiple threads with the threading module
Launching multiple tasks with the concurrent.futures module
Accessing resources asynchronously with the asyncio module
Distributed processing with execnet
Profiling memory usage
Calculating the mean, variance, skewness, and kurtosis on the fly
Caching with a least recently used cache
Caching HTTP requests
Streaming counting with the Count-min sketch
Harnessing the power of the GPU with OpenCL
Chapter 21: Tools of the Trade
Before you start
Using the notebook interface
Imports
An example using the Pandas library
Summary
Chapter 22: Exploring Data
The General Social Survey
Univariate data
Relationships between variables – scatterplots
Summary
Chapter 23: Learning About Models
Models and experiments
The cumulative distribution function
Working with distributions
The probability density function
Where do models come from?
Multivariate distributions
Summary
Chapter 24: Regression
Introducing linear regression
Multivariate regression
Logistic regression
Summary
Chapter 25: Clustering
Introduction to cluster finding
K-means clustering
Hierarchical clustering analysis
Summary
Chapter 26: Bayesian Methods
The Bayesian method
U.S. air travel safety record
Climate change - CO in the atmosphere
Summary
Chapter 27: Supervised and Unsupervised Learning
Introduction to machine learning
Scikit-learn
Linear regression
Clustering
Seeds classification
Summary
Chapter 28: Time Series Analysis
Introduction
Pandas and time series data
Indexing and slicing
Resampling, smoothing, and other estimates
Stationarity
Patterns and components
Time series models
Summary

What You Will Learn

  • Understand the importance of data analysis and master its processing steps
  • Get comfortable using Python and its associated data analysis libraries such as Pandas, NumPy, and SciPy
  • Clean and transform your data and apply advanced statistical analysis to create attractive visualizations
  • Analyze images and time series data
  • Mine text and analyze social networks
  • Perform web scraping and work with different databases, Hadoop, and Spark
  • Use statistical models to discover patterns in data
  • Detect similarities and differences in data with clustering
  • Work with Jupyter Notebook to produce publication-ready figures to be included in reports

Authors

Table of Contents

Chapter 1: Introducing Data Analysis and Libraries
Data analysis and processing
An overview of the libraries in data analysis
Python libraries in data analysis
Summary
Chapter 2: NumPy Arrays and Vectorized Computation
NumPy arrays
Array functions
Data processing using arrays
Linear algebra with NumPy
NumPy random numbers
Summary
Chapter 3: Data Analysis with Pandas
An overview of the Pandas package
The Pandas data structure
The essential basic functionality
Indexing and selecting data
Computational tools
Working with missing data
Advanced uses of Pandas for data analysis
Summary
Chapter 4: Data Visualization
The matplotlib API primer
Exploring plot types
Legends and annotations
Plotting functions with Pandas
Additional Python data visualization tools
Summary
Chapter 5: Time Series
Time series primer
Working with date and time objects
Resampling time series
Downsampling time series data
Upsampling time series data
Time zone handling
Timedeltas
Time series plotting
Summary
Chapter 6: Interacting with Databases
Interacting with data in text format
Interacting with data in binary format
Interacting with data in MongoDB
Interacting with data in Redis
Summary
Chapter 7: Data Analysis Application Examples
Data munging
Data aggregation
Grouping data
Summary
Chapter 8: Machine Learning Models with scikit-learn
An overview of machine learning models
The scikit-learn modules for different models
Data representation in scikit-learn
Supervised learning – classification and regression
Unsupervised learning – clustering and dimensionality reduction
Measuring prediction performance
Summary
Chapter 9: Laying the Foundation for Reproducible Data Analysis
Introduction
Setting up Anaconda
Installing the Data Science Toolbox
Creating a virtual environment with virtualenv and virtualenvwrapper
Sandboxing Python applications with Docker images
Keeping track of package versions and history in IPython Notebook
Configuring IPython
Learning to log for robust error checking
Unit testing your code
Configuring pandas
Configuring matplotlib
Seeding random number generators and NumPy print options
Standardizing reports, code style, and data access
Chapter 10: Creating Attractive Data Visualizations
Introduction
Graphing Anscombe's quartet
Choosing seaborn color palettes
Choosing matplotlib color maps
Interacting with IPython Notebook widgets
Viewing a matrix of scatterplots
Visualizing with d3.js via mpld3
Creating heatmaps
Combining box plots and kernel density plots with violin plots
Visualizing network graphs with hive plots
Displaying geographical maps
Using ggplot2-like plots
Highlighting data points with influence plots
Chapter 11: Statistical Data Analysis and Probability
Introduction
Fitting data to the exponential distribution
Fitting aggregated data to the gamma distribution
Fitting aggregated counts to the Poisson distribution
Determining bias
Estimating kernel density
Determining confidence intervals for mean, variance, and standard deviation
Sampling with probability weights
Exploring extreme values
Correlating variables with Pearson's correlation
Correlating variables with the Spearman rank correlation
Correlating a binary and a continuous variable with the point biserial correlation
Evaluating relations between variables with ANOVA
Chapter 12: Dealing with Data and Numerical Issues
Introduction
Clipping and filtering outliers
Winsorizing data
Measuring central tendency of noisy data
Normalizing with the Box-Cox transformation
Transforming data with the power ladder
Transforming data with logarithms
Rebinning data
Applying logit() to transform proportions
Fitting a robust linear model
Taking variance into account with weighted least squares
Using arbitrary precision for optimization
Using arbitrary precision for linear algebra
Chapter 13: Web Mining, Databases, and Big Data
Introduction
Simulating web browsing
Scraping the Web
Dealing with non-ASCII text and HTML entities
Implementing association tables
Setting up database migration scripts
Adding a table column to an existing table
Adding indices after table creation
Setting up a test web server
Implementing a star schema with fact and dimension tables
Using HDFS
Setting up Spark
Clustering data with Spark
Chapter 14: Signal Processing and Timeseries
Introduction
Spectral analysis with periodograms
Estimating power spectral density with the Welch method
Analyzing peaks
Measuring phase synchronization
Exponential smoothing
Evaluating smoothing
Using the Lomb-Scargle periodogram
Analyzing the frequency spectrum of audio
Analyzing signals with the discrete cosine transform
Block bootstrapping time series data
Moving block bootstrapping time series data
Applying the discrete wavelet transform
Chapter 15: Selecting Stocks with Financial Data Analysis
Introduction
Computing simple and log returns
Ranking stocks with the Sharpe ratio and liquidity
Ranking stocks with the Calmar and Sortino ratios
Analyzing returns statistics
Correlating individual stocks with the broader market
Exploring risk and return
Examining the market with the non-parametric runs test
Testing for random walks
Determining market efficiency with autoregressive models
Creating tables for a stock prices database
Populating the stock prices database
Optimizing an equal weights two-asset portfolio
Chapter 16: Text Mining and Social Network Analysis
Introduction
Creating a categorized corpus
Tokenizing news articles in sentences and words
Stemming, lemmatizing, filtering, and TF-IDF scores
Recognizing named entities
Extracting topics with non-negative matrix factorization
Implementing a basic terms database
Computing social network density
Calculating social network closeness centrality
Determining the betweenness centrality
Estimating the average clustering coefficient
Calculating the assortativity coefficient of a graph
Getting the clique number of a graph
Creating a document graph with cosine similarity
Chapter 17: Ensemble Learning and Dimensionality Reduction
Introduction
Recursively eliminating features
Applying principal component analysis for dimension reduction
Applying linear discriminant analysis for dimension reduction
Stacking and majority voting for multiple models
Learning with random forests
Fitting noisy data with the RANSAC algorithm
Bagging to improve results
Boosting for better learning
Nesting cross-validation
Reusing models with joblib
Hierarchically clustering data
Taking a Theano tour
Chapter 18: Evaluating Classifiers, Regressors, and Clusters
Introduction
Getting classification straight with the confusion matrix
Computing precision, recall, and F1-score
Examining a receiver operating characteristic and the area under a curve
Visualizing the goodness of fit
Computing MSE and median absolute error
Evaluating clusters with the mean silhouette coefficient
Comparing results with a dummy classifier
Determining MAPE and MPE
Comparing with a dummy regressor
Calculating the mean absolute error and the residual sum of squares
Examining the kappa of classification
Taking a look at the Matthews correlation coefficient
Chapter 19: Analyzing Images
Introduction
Setting up OpenCV
Applying Scale-Invariant Feature Transform (SIFT)
Detecting features with SURF
Quantizing colors
Denoising images
Extracting patches from an image
Detecting faces with Haar cascades
Searching for bright stars
Extracting metadata from images
Extracting texture features from images
Applying hierarchical clustering on images
Segmenting images with spectral clustering
Chapter 20: Parallelism and Performance
Introduction
Just-in-time compiling with Numba
Speeding up numerical expressions with Numexpr
Running multiple threads with the threading module
Launching multiple tasks with the concurrent.futures module
Accessing resources asynchronously with the asyncio module
Distributed processing with execnet
Profiling memory usage
Calculating the mean, variance, skewness, and kurtosis on the fly
Caching with a least recently used cache
Caching HTTP requests
Streaming counting with the Count-min sketch
Harnessing the power of the GPU with OpenCL
Chapter 21: Tools of the Trade
Before you start
Using the notebook interface
Imports
An example using the Pandas library
Summary
Chapter 22: Exploring Data
The General Social Survey
Univariate data
Relationships between variables – scatterplots
Summary
Chapter 23: Learning About Models
Models and experiments
The cumulative distribution function
Working with distributions
The probability density function
Where do models come from?
Multivariate distributions
Summary
Chapter 24: Regression
Introducing linear regression
Multivariate regression
Logistic regression
Summary
Chapter 25: Clustering
Introduction to cluster finding
K-means clustering
Hierarchical clustering analysis
Summary
Chapter 26: Bayesian Methods
The Bayesian method
U.S. air travel safety record
Climate change - CO in the atmosphere
Summary
Chapter 27: Supervised and Unsupervised Learning
Introduction to machine learning
Scikit-learn
Linear regression
Clustering
Seeds classification
Summary
Chapter 28: Time Series Analysis
Introduction
Pandas and time series data
Indexing and slicing
Resampling, smoothing, and other estimates
Stationarity
Patterns and components
Time series models
Summary

Book Details

ISBN 139781788394697
Paperback931 pages
Read More
From 1 reviews

Read More Reviews

Recommended for You