Free Sample
+ Collection

Practical Data Science Cookbook

Tony Ojeda, Sean Patrick Murphy, Benjamin Bengfort, Abhijit Dasgupta

This Practical Data Cookbook has 89 hands-on recipes for all data scientists to help complete real-world big data science and numerical projects in R and Python
RRP $29.99
RRP $49.99
Print + eBook

Want this title & more?

$12.99 p/month

Subscribe to PacktLib

Enjoy full and instant access to over 2000 books and videos – you’ll find everything you need to stay ahead of the curve and make sure you can always get the job done.

Book Details

ISBN 139781783980246
Paperback396 pages

About This Book

  • Learn about the data science pipeline and use it to acquire, clean, analyze, and visualize data
  • Understand critical concepts in data science in the context of multiple projects
  • Expand your numerical programming skills through step-by-step code examples and learn more about the robust features of R and Python

Who This Book Is For

If you are an aspiring data scientist who wants to learn data science and numerical programming concepts through hands-on, real-world project examples, this is the book for you. Whether you are brand new to data science or you are a seasoned expert, you will benefit from learning about the structure of data science projects, the steps in the data science pipeline, and the programming examples presented in this book. Since the book is formatted to walk you through the projects with examples and explanations along the way, no prior programming experience is required.

Table of Contents

Chapter 1: Preparing Your Data Science Environment
Understanding the data science pipeline
Installing R on Windows, Mac OS X, and Linux
Installing libraries in R and RStudio
Installing Python on Linux and Mac OS X
Installing Python on Windows
Installing the Python data stack on Mac OS X and Linux
Installing extra Python packages
Installing and using virtualenv
Chapter 2: Driving Visual Analysis with Automobile Data (R)
Acquiring automobile fuel efficiency data
Preparing R for your first project
Importing automobile fuel efficiency data into R
Exploring and describing fuel efficiency data
Analyzing automobile fuel efficiency over time
Investigating the makes and models of automobiles
Chapter 3: Simulating American Football Data (R)
Acquiring and cleaning football data
Analyzing and understanding football data
Constructing indexes to measure offensive and defensive strength
Simulating a single game with outcomes decided by calculations
Simulating multiple games with outcomes decided by calculations
Chapter 4: Modeling Stock Market Data (R)
Acquiring stock market data
Summarizing the data
Cleaning and exploring the data
Generating relative valuations
Screening stocks and analyzing historical prices
Chapter 5: Visually Exploring Employment Data (R)
Preparing for analysis
Importing employment data into R
Exploring the employment data
Obtaining and merging additional data
Adding geographical information
Extracting state- and county-level wage and employment information
Visualizing geographical distributions of pay
Exploring where the jobs are, by industry
Animating maps for a geospatial time series
Benchmarking performance for some common tasks
Chapter 6: Creating Application-oriented Analyses Using Tax Data (Python)
Preparing for the analysis of top incomes
Importing and exploring the world's top incomes dataset
Analyzing and visualizing the top income data of the US
Furthering the analysis of the top income groups of the US
Reporting with Jinja2
Chapter 7: Driving Visual Analyses with Automobile Data (Python)
Getting started with IPython
Exploring IPython Notebook
Preparing to analyze automobile fuel efficiencies
Exploring and describing fuel efficiency data with Python
Analyzing automobile fuel efficiency over time with Python
Investigating the makes and models of automobiles with Python
Chapter 8: Working with Social Graphs (Python)
Preparing to work with social networks in Python
Importing networks
Exploring subgraphs within a heroic network
Finding strong ties
Finding key players
Exploring the characteristics of entire networks
Clustering and community detection in social networks
Visualizing graphs
Chapter 9: Recommending Movies at Scale (Python)
Modeling preference expressions
Understanding the data
Ingesting the movie review data
Finding the highest-scoring movies
Improving the movie-rating system
Measuring the distance between users in the preference space
Computing the correlation between users
Finding the best critic for a user
Predicting movie ratings for users
Collaboratively filtering item by item
Building a nonnegative matrix factorization model
Loading the entire dataset into the memory
Dumping the SVD-based model to the disk
Training the SVD-based model
Testing the SVD-based model
Chapter 10: Harvesting and Geolocating Twitter Data (Python)
Creating a Twitter application
Understanding the Twitter API v1.1
Determining your Twitter followers and friends
Pulling Twitter user profiles
Making requests without running afoul of Twitter's rate limits
Storing JSON data to the disk
Setting up MongoDB for storing Twitter data
Storing user profiles in MongoDB using PyMongo
Exploring the geographic information available in profiles
Plotting geospatial data in Python
Chapter 11: Optimizing Numerical Code with NumPy and SciPy (Python)
Understanding the optimization process
Identifying common performance bottlenecks in code
Reading through the code
Profiling Python code with the Unix time function
Profiling Python code using built-in Python functions
Profiling Python code using IPython's %timeit function
Profiling Python code using line_profiler
Plucking the low-hanging (optimization) fruit
Testing the performance benefits of NumPy
Rewriting simple functions with NumPy
Optimizing the innermost loop with NumPy

What You Will Learn

  • Structure a data science project by using the data science pipeline
  • Acquire and ingest data from files, data stores, and directly from the Web
  • Clean, munge, and manipulate data into shape so that it is ready for analysis
  • Draw insights from the data and conduct analyses that will deliver those insights
  • Determine and apply the most appropriate model to your data
  • Interpret the results of your analysis and modeling
  • Communicate your results through a visualization, report, or application

In Detail

As increasing amounts of data is generated each year, the need to analyze and operationalize it is more important than ever. Companies that know what to do with their data will have a competitive advantage over companies that don't, and this will drive a higher demand for knowledgeable and competent data professionals.

Starting with the basics, this book will cover how to set up your numerical programming environment, introduce you to the data science pipeline (an iterative process by which data science projects are completed), and guide you through several data projects in a step-by-step format. By sequentially working through the steps in each chapter, you will quickly familiarize yourself with the process and learn how to apply it to a variety of situations with examples in the two most popular programming languages for data analysis—R and Python.


Read More