Reader small image

You're reading from  Mastering pandas. - Second Edition

Product typeBook
Published inOct 2019
Reading LevelIntermediate
Publisher
ISBN-139781789343236
Edition2nd Edition
Languages
Tools
Right arrow
Author (1)
Ashish Kumar
Ashish Kumar
author image
Ashish Kumar

Ashish Kumar is a seasoned data science professional, a publisher author and a thought leader in the field of data science and machine learning. An IIT Madras graduate and a Young India Fellow, he has around 7 years of experience in implementing and deploying data science and machine learning solutions for challenging industry problems in both hands-on and leadership roles. Natural Language Procession, IoT Analytics, R Shiny product development, Ensemble ML methods etc. are his core areas of expertise. He is fluent in Python and R and teaches a popular ML course at Simplilearn. When not crunching data, Ashish sneaks off to the next hip beach around and enjoys the company of his Kindle. He also trains and mentors data science aspirants and fledgling start-ups.
Read more about Ashish Kumar

Right arrow

Data Case Studies Using pandas

So far, we have covered the extensive functionalities of pandas. We'll try to implement these functionalities in some case studies. These case studies will give us an overview of the use of each functionality and help us determine the pivotal points in handling a DataFrame. Moreover, the step-by-step approach of the case studies helps us to deepen our understanding of the pandas functions. This chapter is equipped with practical examples along with code snippets to ensure that, by the end, you understand the pandas approach to solving the DataFrame problems.

We will cover the following case studies:

  • End-to-end exploratory data analysis
  • Web scraping with Python
  • Data validation

End-to-end exploratory data analysis

Exploratory data analysis refers to the critical process of understanding the quirks of data—the outliers, the columns containing the most relevant information, and determining the relationship between the variables using statistics and graphical representations.

Let's consider the following DataFrame to perform exploratory data analysis:

df = pd.read_csv("data.csv")
df

The following screenshot shows the DataFrame loaded in Jupyter Notebook:

DataFrame loaded in Jupyter Notebook

Data overview

The preceding DataFrame is the customer data of an automobile servicing firm. They basically provide services to their clients on a periodic basis. Each row in the DataFrame corresponds...

Web scraping with Python

Web scraping deals with extracting large amounts of data from websites in either structured or unstructured forms. For example, a website might have some data already present in an HTML table element or as a CSV file. This is an example of structured data on website. But, in most cases, the required information would be scattered across the content of the web page. Web scraping helps collect these data and store it in a structured form. There are different ways to scrape websites such as online services, APIs, or writing your own code.

Here are some important notes about web scraping:

  • Read through the website's terms and conditions to understand how you can legally use the data. Most sites prohibit you from using the data for commercial purposes.
  • Make sure you are not downloading data at a rapid rate because this may break the website. You may...

Data validation

Data validation is the process of examining the quality of data to ensure it is both correct and useful for performing analysis. It uses routines, often called validation rules, that check for the genuineness of the data that is input to the models. In the age of big data, where vast caches of information are generated by computers and other forms of technology that contribute to the quantity of data being produced, it would be incompetent to use such data if it lacks quality, highlighting the importance of data validation.

In this case study, we are going to consider two DataFrames:

  • Test DataFrame (from a flat file)
  • Validation DataFrame (from MongoDB)

Validation routines are performed on the test DataFrame, keeping its counterpart as the reference.

Data overview...

Summary

pandas is useful for a lot of ancillary data activities, such as exploratory data analysis, validating the sanctity (such as the data type or count) of data between two data sources, and structuring and shaping data obtained from another source, such as scraping a website or a database. In this chapter, we dealt with some case studies on these topics. A data scientist performs these activities on a day-to-day basis, and this chapter should give a flavor of what it is like to perform them on a real dataset.

In the next chapter, we will discuss the architecture and code structure of the pandas library. This will help us develop an exhaustive understanding of the functionalities of the library and enable us to do better troubleshooting.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Mastering pandas. - Second Edition
Published in: Oct 2019Publisher: ISBN-13: 9781789343236
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Ashish Kumar

Ashish Kumar is a seasoned data science professional, a publisher author and a thought leader in the field of data science and machine learning. An IIT Madras graduate and a Young India Fellow, he has around 7 years of experience in implementing and deploying data science and machine learning solutions for challenging industry problems in both hands-on and leadership roles. Natural Language Procession, IoT Analytics, R Shiny product development, Ensemble ML methods etc. are his core areas of expertise. He is fluent in Python and R and teaches a popular ML course at Simplilearn. When not crunching data, Ashish sneaks off to the next hip beach around and enjoys the company of his Kindle. He also trains and mentors data science aspirants and fledgling start-ups.
Read more about Ashish Kumar