Mastering pandas

4.9 (9 reviews total)
By Femi Anthony
  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Introduction to pandas and Data Analysis

About this book

Python is a ground breaking language for its simplicity and succinctness, allowing the user to achieve a great deal with a few lines of code, especially compared to other programming languages. The pandas brings these features of Python into the data analysis realm, by providing expressiveness, simplicity, and powerful capabilities for the task of data analysis. By mastering pandas, users will be able to do complex data analysis in a short period of time, as well as illustrate their findings using the rich visualization capabilities of related tools such as IPython and matplotlib.

This book is an in-depth guide to the use of pandas for data analysis, for either the seasoned data analysis practitioner or the novice user. It provides a basic introduction to the pandas framework, and takes users through the installation of the library and the IPython interactive environment. Thereafter, you will learn basic as well as advanced features, such as MultiIndexing, modifying data structures, and sampling data, which provide powerful capabilities for data analysis.

Publication date:
June 2015
Publisher
Packt
Pages
364
ISBN
9781783981960

 

Chapter 1. Introduction to pandas and Data Analysis

In this chapter, we address the following:

  • Motivation for data analysis

  • How Python and pandas can be used for data analysis

  • Description of the pandas library

  • Benefits of using pandas

 

Motivation for data analysis


In this section, we will discuss the trends that are making data analysis an increasingly important field of endeavor in today's fast-moving technological landscape.

We live in a big data world

The term big data has become one of the hottest technology buzzwords in the past two years. We now increasingly hear about big data in various media outlets, and big data startup companies have increasingly been attracting venture capital. A good example in the area of retail would be Target Corporation, which has invested substantially in big data and is now able to identify potential customers by using big data to analyze people's shopping habits online; refer to a related article at http://nyti.ms/19LT8ic.

Loosely speaking, big data refers to the phenomenon wherein the amount of data exceeds the capability of the recipients of the data to process it. Here is a Wikipedia entry on big data that sums it up nicely: http://en.wikipedia.org/wiki/Big_data.

4 V's of big data

A good way to start thinking about the complexities of big data is along what are called the 4 dimensions, or 4 V's of big data. This model was first introduced as the 3V's by Gartner analyst Doug Laney in 2001. The 3V's stood for Volume, Velocity, and Variety, and the 4th V, Veracity, was added later by IBM. Gartner's official definition is as follows:

 

"Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization."

 
 --Laney, Douglas. "The Importance of 'Big Data': A Definition", Gartner

Volume of big data

The volume of data in the big data age is simply mind-boggling. According to IBM, by 2020, the total amount of data on the planet would have ballooned to 40 zettabytes. You heard that right-40 zettabytes is 43 trillion gigabytes, which is about 4 × 1021 bytes. For more information on this refer to the Wikipedia page on Zettabyte - http://en.wikipedia.org/wiki/Zettabyte.

To get a handle of how much data this would be, let me refer to an EMC press release published in 2010, which stated what 1 zettabyte was approximately equal to:

 

"The digital information created by every man, woman and child on Earth 'Tweeting' continuously for 100 years " or "75 billion fully-loaded 16 GB Apple iPads, which would fill the entire area of Wembley Stadium to the brim 41 times, the Mont Blanc Tunnel 84 times, CERN's Large Hadron Collider tunnel 151 times, Beijing National Stadium 15.5 times or the Taipei 101 Tower 23 times..."

 
 --EMC study projects 45× data growth by 2020

The growth rate of data has been fuelled largely by a few factors, such as the following:

  • The rapid growth of the Internet.

  • The conversion from analog to digital media coupled with an increased capability to capture and store data, which in turn has been made possible with cheaper and more capable storage technology. There has been a proliferation of digital data input devices such as cameras and wearables, and the cost of huge data storage has fallen rapidly. Amazon Web Services is a prime example of the trend toward much cheaper storage.

The Internetification of devices, or rather Internet of Things, is the phenomenon wherein common household devices, such as our refrigerators and cars, will be connected to the Internet. This phenomenon will only accelerate the above trend.

Velocity of big data

From a purely technological point of view, velocity refers to the throughput of big data, or how fast the data is coming in and is being processed. This has ramifications on how fast the recipient of the data needs to process it to keep up. Real-time analytics is one attempt to handle this characteristic. Tools that can help enable this include Amazon Web Services Elastic Map Reduce.

At a more macro level, the velocity of data can also be regarded as the increased speed at which data and information can now be transferred and processed faster and at greater distances than ever before.

The proliferation of high-speed data and communication networks coupled with the advent of cell phones, tablets, and other connected devices, are primary factors driving information velocity. Some measures of velocity include the number of tweets per second and the number of emails per minute.

Variety of big data

The variety of big data comes from having a multiplicity of data sources that generate the data, and the different formats of the data that are produced.

This results in a technological challenge for the recipients of the data who have to process it. Digital cameras, sensors, the web, cell phones, and so on are some of the data generators that produce data in differing formats, and the challenge comes in being able to handle all these formats and extract meaningful information from the data. The ever-changing nature of the data formats with the dawn of the big data era has led to a revolution in the database technology industry, with the rise of NoSQL databases to handle what is known as unstructured data or rather data whose format is fungible or constantly changing. For more information on Couchbase, refer to "Why NoSQL- http://bit.ly/1c3iVEc.

Veracity of big data

The 4th characteristic of big data – veracity, which was added later, refers to the need to validate or confirm the correctness of the data or the fact that the data represents the truth. The sources of data must be verified and the errors kept to a minimum. According to an estimate by IBM, poor data quality costs the US economy about $3.1 trillion dollars a year. For example, medical errors cost the United States $19.5 billion in 2008; for more information you can refer to a related article at http://bit.ly/1CTah5r. Here is an info-graphic by IBM that summarizes the 4V's of big data:

IBM on the 4 V's of big data

So much data, so little time for analysis

Data analytics has been described by Eric Schmidt, the former CEO of Google, as the Future of Everything. For reference, you can check out a YouTube video called Why Data Analytics is the Future of Everything at http://bit.ly/1KmqGCP.

The volume and velocity of data will continue to increase in the big data age. Companies that can efficiently collect, filter, and analyze data results in information that allows them to better meet the needs of their customers in a much quicker timeframe will gain a significant competitive advantage over their competitors. For example, data analytics (Culture of Metrics) plays a very key role in the business strategy of http://www.amazon.com/. For more information refer to Amazon.com Case Study, Smart Insights at http://bit.ly/1glnA1u.

The move towards real-time analytics

As technologies and tools have evolved, to meet the ever-increasing demands of business, there has been a move towards what is known as real-time analytics. More information on Insight Everywhere, Intel available at http://intel.ly/1899xqo.

In the big data Internet era, here are some examples:

  • Online businesses demand instantaneous insights into how the new products/features they have introduced in their online market are doing and how they can adjust their online product mix accordingly. Amazon is a prime example of this with their Customers Who Viewed This Item Also Viewed feature.

  • In finance, risk management and trading systems demand almost instantaneous analysis in order to make effective decisions based on data-driven insights.

 

How Python and pandas fit into the data analytics mix


The Python programming language is one of the fastest growing languages today in the emerging field of data science and analytics. Python was created by Guido von Russom in 1991, and its key features include the following:

  • Interpreted rather than compiled

  • Dynamic type system

  • Pass by value with object references

  • Modular capability

  • Comprehensive libraries

  • Extensibility with respect to other languages

  • Object orientation

  • Most of the major programming paradigms-procedural, object-oriented, and to a lesser extent, functional.

Note

For more information, refer the Wikipedia page on Python at http://en.wikipedia.org/wiki/Python_%28programming_language%29.

Among the characteristics that make Python popular for data science are its very user-friendly (human-readable) syntax, the fact that it is interpreted rather than compiled (leading to faster development time), and its very comprehensive library for parsing and analyzing data, as well as its capacity for doing numerical and statistical computations. Python has libraries that provide a complete toolkit for data science and analysis. The major ones are as follows:

  • NumPy: The general-purpose array functionality with emphasis on numeric computation

  • SciPy: Numerical computing

  • Matplotlib: Graphics

  • pandas: Series and data frames (1D and 2D array-like types)

  • Scikit-Learn: Machine learning

  • NLTK: Natural language processing

  • Statstool: Statistical analysis

For this book, we will be focusing on the 4th library listed in the preceding list, pandas.

 

What is pandas?


The pandas is a high-performance open source library for data analysis in Python developed by Wes McKinney in 2008. Over the years, it has become the de-facto standard library for data analysis using Python. There's been great adoption of the tool, a large community behind it, (220+ contributors and 9000+ commits by 03/2014), rapid iteration, features, and enhancements continuously made.

Some key features of pandas include the following:

  • It can process a variety of data sets in different formats: time series, tabular heterogeneous, and matrix data.

  • It facilitates loading/importing data from varied sources such as CSV and DB/SQL.

  • It can handle a myriad of operations on data sets: subsetting, slicing, filtering, merging, groupBy, re-ordering, and re-shaping.

  • It can deal with missing data according to rules defined by the user/developer: ignore, convert to 0, and so on.

  • It can be used for parsing and munging (conversion) of data as well as modeling and statistical analysis.

  • It integrates well with other Python libraries such as statsmodels, SciPy, and scikit-learn.

  • It delivers fast performance and can be speeded up even more by making use of Cython (C extensions to Python).

For more information go through the official pandas documentation available at http://pandas.pydata.org/pandas-docs/stable/.

 

Benefits of using pandas


The pandas forms a core component of the Python data analysis corpus. The distinguishing feature of pandas is the suite of data structures that it provides, which is naturally suited to data analysis, primarily the DataFrame and to a lesser extent Series (1-D vectors) and Panel (3D tables).

Simply put, pandas and statstools can be described as Python's answer to R, the data analysis and statistical programming language that provides both the data structures, such as R-data frames, and a rich statistical library for data analysis.

The benefits of pandas over using a language such as Java, C, or C++ for data analysis are manifold:

  • Data representation: It can easily represent data in a form naturally suited for data analysis via its DataFrame and Series data structures in a concise manner. Doing the equivalent in Java/C/C++ would require many lines of custom code, as these languages were not built for data analysis but rather networking and kernel development.

  • Data subsetting and filtering: It provides for easy subsetting and filtering of data, procedures that are a staple of doing data analysis.

  • Concise and clear code: Its concise and clear API allows the user to focus more on the core goal at hand, rather than have to write a lot of scaffolding code in order to perform routine tasks. For example, reading a CSV file into a DataFrame data structure in memory takes two lines of code, while doing the same task in Java/C/C++ would require many more lines of code or calls to non-standard libraries, as illustrated in the following table. Here, let's suppose that we had the following data:

    Country

    Year

    CO2 Emissions

    Power Consumption

    Fertility Rate

    Internet Usage Per 1000 People

    Life Expectancy

    Population

    Belarus

    2000

    5.91

    2988.71

    1.29

    18.69

    68.01

    1.00E+07

    Belarus

    2001

    5.87

    2996.81

     

    43.15

     

    9970260

    Belarus

    2002

    6.03

    2982.77

    1.25

    89.8

    68.21

    9925000

    Belarus

    2003

    6.33

    3039.1

    1.25

    162.76

     

    9873968

    Belarus

    2004

     

    3143.58

    1.24

    250.51

    68.39

    9824469

    Belarus

    2005

      

    1.24

    347.23

    68.48

    9775591

In a CSV file, this data that we wish to read would look like the following:

Country,Year,CO2Emissions,PowerConsumption,FertilityRate,InternetUsagePer1000, LifeExpectancy, PopulationBelarus,2000,5.91,2988.71,1.29,18.69,68.01,1.00E+07Belarus,2001,5.87,2996.81,,43.15,,9970260Belarus,2002,6.03,2982.77,1.25,89.8,68.21,9925000...Philippines,2000,1.03,514.02,,20.33,69.53,7.58E+07Philippines,2001,0.99,535.18,,25.89,,7.72E+07Philippines,2002,0.99,539.74,3.5,44.47,70.19,7.87E+07...Morocco,2000,1.2,489.04,2.62,7.03,68.81,2.85E+07Morocco,2001,1.32,508.1,2.5,13.87,,2.88E+07Morocco,2002,1.32,526.4,2.5,23.99,69.48,2.92E+07..

Note

The data here is taken from World Bank Economic data available at: http://data.worldbank.org.

In Java, we would have to write the following code:

public class CSVReader {
public static void main(String[] args) {
        String[] csvFile=args[1];CSVReader csvReader = new csvReader();List<Map>dataTable=csvReader.readCSV(csvFile);}public void readCSV(String[] csvFile){BufferedReader bReader=null;String line="";String delim=",";
  //Initialize List of maps, each representing a line of the csv fileList<Map> data=new ArrayList<Map>();
  try {bufferedReader = new BufferedReader(new   FileReader(csvFile));// Read the csv file, line by linewhile ((line = br.readLine()) != null){String[] row = line.split(delim);Map<String,String> csvRow=new HashMap<String,String>();
           csvRow.put('Country')=row[0]; csvRow.put('Year')=row[1];csvRow.put('CO2Emissions')=row[2]; csvRow.put('PowerConsumption')=row[3];csvRow.put('FertilityRate')=row[4];csvRow.put('InternetUsage')=row[1];csvRow.put('LifeExpectancy')=row[6];csvRow.put('Population')=row[7];data.add(csvRow);
        }
     } catch (FileNotFoundException e) {e.printStackTrace();
     } catch (IOException e) {e.printStackTrace();
    } 
 return data;}

But, using pandas, it would take just two lines of code:

import pandas as pdworldBankDF=pd.read_csv('worldbank.csv')

In addition, pandas is built upon the NumPy libraries and hence, inherits many of the performance benefits of this package, especially when it comes to numerical and scientific computing. One oft-touted drawback of using Python is that as a scripting language, its performance relative to languages like Java/C/C++ has been rather slow. However, this is not really the case for pandas.

 

Summary


We live in a big data era characterized by the 4V's- volume, velocity, variety, and veracity. The volume and velocity of data are ever increasing for the foreseeable future. Companies that can harness and analyze big data to extract information and take actionable decisions based on this information will be the winners in the marketplace. Python is a fast-growing, user-friendly, extensible language that is very popular for data analysis.

The pandas is a core library of the Python toolkit for data analysis. It provides features and capabilities that make it much easier and faster for data analysis than many other popular languages such as Java, C, C++, and Ruby.

Thus, given the strengths of Python listed in the preceding section as a choice for the analysis of data, the data analysis practitioner utilizing Python should become quite adept at pandas in order to become more effective. This book aims to assist the user in achieving this goal.

About the Author

  • Femi Anthony

    Femi Anthony is a seasoned and knowledgeable software programmer, with over 15 years experience in a vast array of languages, including Perl, C, C++, Java, and Python. He has worked in both the Internet space and financial services space for many years and is now working for a well-known financial data company. He holds a bachelor's degree in mathematics with computer science from MIT and a master's degree from the University of Pennsylvania. His pet interests include data science, machine learning, and Python. Femi is working on a few side projects in these areas. His hobbies include reading, soccer, and road cycling. You can follow him at @dataphanatik, and for any queries, contact him at [email protected].

    Browse publications by this author

Latest Reviews

(9 reviews total)
Excellent price for some fantastic books!
good book xxxxxxxxxxxxxxxxxxxxxxxxxxxx
Great help, and with great examples, code is fantastic.
Book Title
Access this book, plus 7,500 other titles for FREE
Access now