Mastering pandas - Second Edition

4.6 (5 reviews total)
By Ashish Kumar
  • Instant online access to over 8,000+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Introduction to pandas and Data Analysis

About this book

pandas is a popular Python library used by data scientists and analysts worldwide to manipulate and analyze their data. This book presents useful data manipulation techniques in pandas to perform complex data analysis in various domains.

An update to our highly successful previous edition with new features, examples, updated code, and more, this book is an in-depth guide to get the most out of pandas for data analysis. Designed for both intermediate users as well as seasoned practitioners, you will learn advanced data manipulation techniques, such as multi-indexing, modifying data structures, and sampling your data, which allow for powerful analysis and help you gain accurate insights from it. With the help of this book, you will apply pandas to different domains, such as Bayesian statistics, predictive analytics, and time series analysis using an example-based approach. And not just that; you will also learn how to prepare powerful, interactive business reports in pandas using the Jupyter notebook.

By the end of this book, you will learn how to perform efficient data analysis using pandas on complex data, and become an expert data analyst or data scientist in the process.

Publication date:
October 2019


Introduction to pandas and Data Analysis

We start the book and this chapter by discussing the contemporary data analytics landscape and how pandas fits into that landscape. pandas is the go-to tool for data scientists for data pre-processing tasks. We will learn about the technicalities of pandas in the later chapters. This chapter covers the context, origin, history, market share, and current standing of pandas.

The chapter has been divided into the following headers:

  • Motivation for data analysis
  • How Python and pandas can be used for data analysis
  • Description of the pandas library
  • Benefits of using pandas

Motivation for data analysis

In this section, we discuss the trends that are making data analysis an increasingly important field in today's fast-moving technological landscape.

We live in a big data world

The term big data has become one of the hottest technology buzzwords in the past two years. We now increasingly hear about big data in various media outlets, and big data start-ups have increasingly been attracting venture capital. A good example in the area of retail is Target Corporation, which has invested substantially in big data and is now able to identify potential customers by using big data to analyze people's shopping habits online; refer to a related article at

Loosely speaking, big data refers to the phenomenon wherein the amount of data exceeds the capability of the recipients of the data to process it. Here is an article on big data that sums it up nicely:

The four V's of big data

A good way to start thinking about the complexities of big data is called the four dimensions, or Four V's of big data. This model was first introduced as the three V's by Gartner analyst Doug Laney in 2001. The three V's stood for Volume, Velocity, and Variety, and the fourth V, Veracity, was added later by IBM. Gartner's official definition states the following:

"Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization."
                        Laney, Douglas. "The Importance of 'Big Data': A Definition", Gartner

Volume of big data

The volume of data in the big data age is simply mind-boggling. According to IBM, by 2020, the total amount of data on the planet will have ballooned to 40 zettabytes. You heard that right! 40 zettabytes is 43 trillion gigabytes. For more information on this, refer to the Wikipedia page on the zettabyte:

To get a handle on how much data this is, let me refer to an EMC press release published in 2010, which stated what 1 zettabyte was approximately equal to:

"The digital information created by every man, woman and child on Earth 'Tweeting' continuously for 100 years " or "75 billion fully-loaded 16 GB Apple iPads, which would fill the entire area of Wembley Stadium to the brim 41 times, the Mont Blanc Tunnel 84 times, CERN's Large Hadron Collider tunnel 151 times, Beijing National Stadium 15.5 times or the Taipei 101 Tower 23 times..."
                                                                                                                                                                       EMC study projects 45× data growth by 2020

The growth rate of data has been fuelled largely by a few factors, such as the following:

  • The rapid growth of the internet.
  • The conversion from analog to digital media, coupled with an increased ability to capture and store data, which in turn has been made possible with cheaper and better storage technology. There has been a proliferation of digital data input devices, such as cameras and wearables, and the cost of huge data storage has fallen rapidly. Amazon Web Services is a prime example of the trend toward much cheaper storage.

The internetification of devices, or rather the Internet of Things, is the phenomenon wherein common household devices, such as our refrigerators and cars, will be connected to the internet. This phenomenon will only accelerate the above trend.

Velocity of big data

From a purely technological point of view, velocity refers to the throughput of big data, or how fast the data is coming in and is being processed. This has ramifications on how fast the recipient of the data needs to process it to keep up. Real-time analytics is one attempt to handle this characteristic. Tools that can enable this include Amazon Web Services Elastic MapReduce.

At a more macro level, the velocity of data can also be regarded as the increased speed at which data and information can now be transferred and processed faster and at greater distances than ever before.

The proliferation of high-speed data and communication networks coupled with the advent of cell phones, tablets, and other connected devices are primary factors driving information velocity. Some measures of velocity include the number of tweets per second and the number of emails per minute.

Variety of big data

The variety of big data comes from having a multiplicity of data sources that generate data and the different formats of data that are produced.

This results in a technological challenge for the recipients of the data who have to process it. Digital cameras, sensors, the web, cell phones, and so on are some of the data generators that produce data in differing formats, and the challenge is being able to handle all these formats and extract meaningful information from the data. The ever-changing nature of data formats with the dawn of the big data era has led to a revolution in the database technology industry with the rise of NoSQL databases, which handle what is known as unstructured data or rather data whose format is fungible or constantly changing. 

Veracity of big data

The fourth characteristic of big data—veracity, which was added later—refers to the need to validate or confirm the correctness of the data or the fact that the data represents the truth. The sources of data must be verified and errors kept to a minimum. According to an estimate by IBM, poor data quality costs the US economy about $3.1 trillion dollars a year. For example, medical errors cost the United States $19.5 billion in 2008; you can refer to a related article at for more information.

The following link provides an infographic by IBM that summarizes the four V's of big data:

So much data, so little time for analysis

Data analytics has been described by Eric Schmidt, the former CEO of Google, as the Future of Everything. For more information, check out a YouTube video called Why Data Analytics is the Future of Everything at

The volume and velocity of data will continue to increase in the big data age. Companies that can efficiently collect, filter, and analyze data that results in information that allows them to better meet the needs of their customers in a much quicker timeframe will gain a significant advantage over their competitors. For example, data analytics (the Culture of Metrics) plays a very key role in the business strategy of Amazon. For more information, refer to the case study by Smart Insights at

The move towards real-time analytics

As technologies and tools have evolved to meet the ever-increasing demands of business, there has been a move towards what is known as real-time analytics. More information on this is available from Intel in their Insight Everywhere whitepaper at

In the big data internet era, here are some examples of real-time analytics on big data:

  • Online businesses demand instantaneous insights into how the new products/features they have introduced online are doing and can adjust their online product mix accordingly. Amazon is a prime example of this with their Customers Who Viewed This Item Also Viewed feature.
  • In finance, risk management and trading systems demand almost instantaneous analysis in order to make effective decisions based on data-driven insights.

Data analytics pipeline

Data modeling is the process of using data to build predictive models. Data can also be used for descriptive and prescriptive analysis. But before we make use of data, it has to be fetched from several sources, stored, assimilated, cleaned, and engineered to suit our goal. The sequential operations that need to be performed on data are akin to a manufacturing pipeline, where each subsequent step adds value to the potential end product and each progression requires a new person or skill set.

The various steps in a data analytics pipeline are shown in the following diagram: 

Steps in data analytics pipeline
  1. Extract Data
  2. Transform Data
  3. Load Data
  4. Read & Process Data
  5. Exploratory Data Analysis
  6. Create Features
  7. Build Predictive Models
  8. Validate Models
  9. Build Products

These steps can be combined into three high-level categories: data engineering, data science, and product development.

  • Data Engineering: Step 1 to Step 3 in the preceding diagram fall into this category. It deals with sourcing data from a variety of sources, creating a suitable database and table schema, and loading the data in a suitable database. There can be many approaches to this step depending on the following:
    • Type of data: Structured (tabular data) versus unstructured (such as images and text) versus semi-structured (such as JSON and XML)
    • Velocity of data upgrade: Batch processing versus real-time data streaming
    • Volume of data: Distributed (or cluster-based) storage versus single instance databases
    • Variety of data: Document storage, blob storage, or data lake
  • Data Science: Step 4 to Step 8 in figure 1.2 fall into the category of data science. This is the phase where the data is made usable and used to predict the future, learn patterns, and extrapolate these patterns. Data science can further be sub-divided into two phases.

Step 4 to Step 6 comprise the first phase, wherein the goal is to understand the data better and make it usable. Making the data usable requires considerable effort to clean it by removing invalid characters and missing values. It also involves understanding the nitty-gritty of the data at hand—what is the distribution of data, what is the relationship between different data variables, is there a causatory relationship between the input and outcome variable, and so on. It also involves exploring numerical transformations (features) that might explain this causation (between input and outcome variables) better. This phase entails the real forensic effort that goes into the ultimate use of data. To use an analogy, bamboo seeds remain buried in the soil for years with no signs of a sapling growing, and suddenly a sapling grows, and within months a full bamboo tree is ready. This phase of data science is akin to the underground preparation the bamboo seeds undergo before the rapid growth. This is like the stealth mode of a start up wherein a lot of time and effort is committed. And this is where the pandas library, protagonist of this book, finds it raison d'etre and sweet spot.

Step 7 to Step 8 constitute the part where patterns (the parameters of a mathematical expression) are learned from historic data and extrapolated to future data. It involves a lot of experimentation and iterations to get to the optimal results. But if Step 4 to Step 6 have been done with the utmost care, this phase can be implemented pretty quickly thanks to the number of packages in Python, R, and many other data science tools. Of course, it requires a sound understanding of the math and algorithms behind the applied model in order to tweak its parameters to perfection.

  • Product Development: This is the phase where all the hard work bears fruit and all the insights, results, and patterns are served to the users in a way that they can consume, understand, and act upon. It might range from building a dashboard on data with additional derived fields to an API that calls a trained model and returns an output on incoming data. A product can also be built to encompass all the stages of the data pipeline, from extracting the data to building a predictive model or creating an interactive dashboard.

Apart from these steps in the pipeline, there are some additional steps that might come into the picture. This is due to the highly evolving nature of the data landscape. For example, deep learning, which is used extensively to build intelligent products around image, text, and audio data, often requires the training data to be labeled into a category or augmented if the quantity is too small to create an accurate model.

For example, an object detection task on video data might require the creation of training data for object boundaries and object classes using some tools, or even manually. Data augmentation helps with image data by creating slightly perturbed data (rotated or grained images, for example) and adding it to training data. For a supervised learning task, labels are mandatory. This label is generally generated together with the data. For example, to train a churn model, a dataset with customer descriptions and when they churned out is required. This information is generally available in the company's CRM tool.


How Python and pandas fit into the data analytics pipeline

The Python programming language is one of the fastest-growing languages today in the emerging field of data science and analytics. Python was created by Guido van Rossum in 1991, and its key features include the following:

  • Interpreted rather than compiled
  • Dynamic type system
  • Pass by value with object references
  • Modular capability
  • Comprehensive libraries
  • Extensibility with respect to other languages
  • Object orientation
  • Most of the major programming paradigms: procedural, object-oriented, and, to a lesser extent, functional

For more information, refer to the following article on Python at

Among the characteristics that make Python popular for data science are its very user-friendly (human-readable) syntax, the fact that it is interpreted rather than compiled (leading to faster development time), and it has very comprehensive libraries for parsing and analyzing data, as well as its capacity for numerical and statistical computations. Python has libraries that provide a complete toolkit for data science and analysis. The major ones are as follows:

  • NumPy: The general-purpose array functionality with an emphasis on numeric computation
  • SciPy: Numerical computing
  • Matplotlib: Graphics
  • pandas: Series and data frames (1D and 2D array-like types)
  • Scikit-learn: Machine learning
  • NLTK: Natural language processing
  • Statstool: Statistical analysis

For this book, we will be focusing on the fourth library in the preceding list, pandas.


What is pandas?

The pandas we are going to obsess over in this book are not the cute and lazy animals that also do kung fu when needed.

pandas is a high-performance open source library for data analysis in Python developed by Wes McKinney in 2008. pandas stands for panel data, a reference to the tabular format in which it processes the data. It is available for free and is distributed with a 3-Clause BSD License under the open source initiative.

Over the years, it has become the de-facto standard library for data analysis using Python. There's been great adoption of the tool, and there's a large community behind it, (1,200+ contributors, 17,000+ commits, 23 versions, and 15,000+ stars) rapid iteration, features, and enhancements are continuously made.

Some key features of pandas include the following:

  • It can process a variety of datasets in different formats: time series, tabular heterogeneous, and matrix data.
  • It facilitates loading/importing data from varied sources, such as CSV and databases such as SQL.
  • It can handle myriad operations on datasets: subsetting, slicing, filtering, merging, groupBy, re-ordering, and re-shaping.
  • It can deal with missing data according to rules defined by the user/developer, such as ignore, convert to 0, and so on.
  • It can be used for parsing and munging (conversion) of data as well as modeling and statistical analysis.
  • It integrates well with other Python libraries such as statsmodels, SciPy, and scikit-learn.
  • It delivers fast performance and can be sped up even more by making use of Cython (C extensions to Python).

For more information, go through the official pandas documentation at


Where does pandas fit in the pipeline?

As discussed in the previous section, pandas can be used to perform Step 4 to Step 6 in the pipeline. And Step 4 to Step 6 are the backbone of any data science process, application, or product:

Where does pandas fit in the data analytics pipeline?

The Step 1 to Step 6 can be performed in pandas by some methods. Those in the Step 4 to Step 6 are the primary tasks while the Step 1 to Step 3 can also be done in some way or other in pandas.

pandas is an indispensable library if you're working with data, and it would be near impossible to find code for data modeling that doesn't import pandas into the working environment. Easy-to-use syntax in Python and the availability of a spreadsheet-like data structure called a dataframe make it amenable even to users who are too comfortable and too unwilling to move away from Excel. At the same time, it is loved by scientists and researchers to handle exotic file formats such as parquet, feather file, and many more. It can read data in batch mode without clogging all the machine's memory. No wonder the famous news aggregator Quartz called it the most important tool in data science.

pandas is suited well for the following types of dataset:

  • Tabular with heterogeneous type columns
  • Ordered and unordered time series
  • Matrix/array data with labeled or unlabeled rows and columns

pandas can perform the following operations on data with finesse:

  • Easy handling of missing and NaN data
  • Addition and deletion of columns
  • Automatic and explicit data alignment with labels
  • GroupBy for aggregating and transforming data using split-apply-combine
  • Converting differently indexed Python or NumPy data to DataFrame
  • Slicing, indexing, hierarchical indexing, and subsetting of data
  • Merging, joining, and concatenating data
  • I/O methods for flat files, HDF5, feather, and parquet formats
  • Time series functionality

Benefits of using pandas

pandas forms a core component of the Python data analysis corpus. The distinguishing feature of pandas is that the suite of data structures that it provides is naturally suited to data analysis, primarily the DataFrame and, to a lesser extent, series (1-D vectors) and panel (3D tables).

Simply put, pandas and statstools can be described as Python's answer to R, the data analysis and statistical programming language that provides both data structures, such as R-dataframes, and a rich statistical library for data analysis.

The benefits of pandas compared to using a language such as Java, C, or C++ for data analysis are manifold:

  • Data representation: It can easily represent data in a form that's naturally suited for data analysis via its DataFrame and series data structures in a concise manner. Doing the equivalent in Java/C/C++ requires many lines of custom code as these languages were not built for data analysis but rather networking and kernel development.
  • Data subsetting and filtering: It permits easy subsetting and filtering of data, procedures that are a staple of doing data analysis.
  • Concise and clear code: Its concise and clear API allows the user to focus more on their core goal, rather than having to write a lot of scaffolding code in order to perform routine tasks. For example, reading a CSV file into a DataFrame data structure in memory takes two lines of code, while doing the same task in Java/C/C++ requires many more lines of code or calls to non-standard libraries, as illustrated below. Let's suppose that we had the following data to read:




















































In a CSV file, this data that we wish to read would look like the following:

InternetUsagePer1000, LifeExpectancy, Population

The data here is taken from World Bank Economic data, available at

In Java, we would have to write the following code:

public class CSVReader { 
public static void main(String[] args) { 
        String[] csvFile=args[1];
CSVReader csvReader = new csvReader();
public void readCSV(String[] csvFile)
BufferedReader bReader=null;
String line="";
String delim=","; //Initialize List of maps, each representing a line of the csv file
List<Map> data=new ArrayList<Map>(); try {
bufferedReader = new BufferedReader(new FileReader(csvFile));
// Read the csv file, line by line
while ((line = br.readLine()) != null){
String[] row = line.split(delim);
Map<String,String> csvRow=new HashMap<String,String>(); csvRow.put('Country')=row[0];
csvRow.put('CO2Emissions')=row[2]; csvRow.put('PowerConsumption')=row[3];
data.add(csvRow); } } catch (FileNotFoundException e) {
e.printStackTrace(); } catch (IOException e) {
e.printStackTrace(); } return data;

But, using pandas, it would take just two lines of code:

import pandas as pd

In addition, pandas is built upon the NumPy library and hence inherits many of the performance benefits of this package, especially when it comes to numerical and scientific computing. One oft-touted drawback of using Python is that as a scripting language, its performance relative to languages such as Java/C/C++ has been rather slow. However, this is not really the case for pandas.


History of pandas

The basic version of pandas was built in 2008 by Wes McKinney, an MIT grad with heavy quantitative finance experience. Now a celebrity in his own right, thanks to his open source contributions and the wildly popular book called Data Analysis with Python, he was reportedly frustrated with the time he had to waste doing simple data manipulation tasks at his job, such as reading a CSV file, with the popular tools at that time. He said he quickly fell in love with Python for its intuitive and accessible nature after not finding Excel and R suitable for his needs. But he found that it was missing key features that would make it the go-to tool for data analysis—for example, an intuitive format to deal with spreadsheet data or to create new calculated columns from existing columns.

According to an interview he gave to Quartz, the design considerations and vision that he had in mind while creating the tool were the following:

  • Quality of data is far more important than any fancy analysis
  • Treating in-memory data like a SQL table or an Excel spreadsheet
  • Intuitive analysis and exploration with minimal and elegant code
  • Easier compatibility with other libraries used for the same or different steps in the data pipeline

After building the basic version, he went on to pursue a PhD at Duke University but dropped out in a quest to make the tool he had created a cornerstone for data science and Python. With his dedicated contribution, together with the release of popular Python visualization libraries such as Matplotlib, followed by machine learning libraries such as Scikit-Learn and interactive user interfaces such as Jupyter and Spyder, pandas and eventually Python became the hottest tool in the armory of any data scientist.

Wes is heavily invested in the constant improvement of the tool he created from scratch. He coordinates the development of new features and the improvement of existing ones. The data science community owes him big time.


Usage pattern and adoption of pandas

The popularity of Python has skyrocketed over the years, especially after 2012; a lot of this can be attributed to the popularity of pandas. Python-related questions make up around 12% of the total questions asked from high-income countries on Stack Overflow, a popular platform for developers to ask questions and get answers from other people in the community about how to get things done and fix bugs in different programming languages. Given that there are hundreds of programming languages, one language occupying 12% of market share is an extraordinary achievement:

The most popular data analytics tools based on a survey of Kaggle users conducted in 2017-18

According to this survey conducted by Kaggle, 60% of the respondents said that they were aware of or have used Python for their data science jobs.

According to the data recorded by Stack Overflow about the types of question asked on their platform, Python and pandas have registered steady growth year on year, while some of the other programming languages, such as Java and C, have declined in popularity and are playing catch-up. Python has almost caught up with the number of questions asked about Java on the platform, while the number for Java has shown a negative trend. pandas has been showing constant growth in numbers.

The following chart is based on data gathered from the SQL API exposed by Stack Overflow. The axis represents the number of questions asked about that topic on Stack Overflow in a particular year:

Popularity of tools across years based on the # questions asked on Stack Overflow

Google Trend also shows a surge in popularity for pandas, as demonstrated in the following chart. Numbers represent surge in interest for pandas relative to the highest point (historically) on the chart for the given region and time.

Popularity of pandas based on data from Google Trends

The geographical split of the popularity of pandas is even more interesting. The highest interest has come from China, which might be an indicator of the high adoption of open source tools and/or a very high inclination towards building powerful tech for data science:

Popularity of pandas across geographies based on Google Trends data

Apart from the popularity with its users, pandas (owing to its open source origins) also has a thriving community that is committed to constantly improving it and making it easier for the users to get answers about the issues. The following chart shows the weekly modifications (additions/deletions) to the pandas source code by the contributors:

Number of additions/deletions done to the pandas source code by contributors

pandas on the technology adoption curve

According to a popular framework called Gartner Hype Cycle, there are five phases in the process of the proliferation and adoption of technologies:

  • Technology trigger
  • Peak of inflated expectations
  • Trough of disillusionment
  • Slope of enlightenment
  • Plateau of productivity

The following link contains a chart that shows different technologies and the stage they are at on the technology adoption curve

As can be seen, Predictive Analytics has already reached the steady plateau of productivity, which is where the optimum and stable return on investment can be extracted from a technology. Since pandas is an essential component of most predictive analytics initiatives, it is safe to say that pandas has reached the plateau of productivity.


Popular applications of pandas

pandas is built on top of NumPy. Some of the noteworthy uses of the pandas, apart from every other data science project of course, are the following:

  • pandas is a dependency of statsmodels (, making it a significant part of Python's numerical computing ecosystem.
  • pandas has been used extensively in the production of many financial applications.


We live in a big data era characterized by the four V's- volume, velocity, variety, and veracity. The volume and velocity of data are set to increase for the foreseeable future. Companies that can harness and analyze big data to extract information and take actionable decisions based on this information will be the winners in the marketplace. Python is a fast-growing, user-friendly, extensible language that is very popular for data analysis.

pandas is a core library of the Python toolkit for data analysis. It provides features and capabilities that make it much easier and faster than many other popular languages, such as Java, C, C++, and Ruby.

Thus, given the strengths of Python outlined in this chapter as a choice for the analysis of data and the popularity it has gained from users, contributors, and industry leaders, data analysis practitioners utilizing Python should become adept at pandas in order to become more effective. This book aims to help you achieve this goal.

In the next chapter, we proceed towards this goal by first setting up the infrastructure required to run pandas on your computer. We will also see different ways and scenarios in which pandas can be used and run.


About the Author

  • Ashish Kumar

    Ashish Kumar is a seasoned data science professional, a publisher author and a thought leader in the field of data science and machine learning. An IIT Madras graduate and a Young India Fellow, he has around 7 years of experience in implementing and deploying data science and machine learning solutions for challenging industry problems in both hands-on and leadership roles. Natural Language Procession, IoT Analytics, R Shiny product development, Ensemble ML methods etc. are his core areas of expertise. He is fluent in Python and R and teaches a popular ML course at Simplilearn. When not crunching data, Ashish sneaks off to the next hip beach around and enjoys the company of his Kindle. He also trains and mentors data science aspirants and fledgling start-ups.

    Browse publications by this author

Latest Reviews

(5 reviews total)
I haven't spent much time with it so far, but a helpful book and definitely worth the sale price.
An excellent overview of Pandas and the most prominent adjacent libraries, including interactive examples further exploring the underlying concepts. A few editing errors meant that some code examples need changes to work and some data sets are not included in the supplementary code. Overall this isn't a major problem for understanding the material.
Very reasonable price for the contents.

Recommended For You

Book Title
Access this book and the full library for just $5/m.
Access now