You're reading from Pandas 1.x Cookbook - Second Edition

Product typeBook

Published inFeb 2020

Reading LevelBeginner

PublisherPackt

ISBN-139781839213106

Edition2nd Edition

Languages

Python

Tools

Pandas

Concepts

Programming Language

Authors (2):

Matt Harrison

Theodore Petrou

View More author details

Introduction

It is important to consider the steps that you, as an analyst, take when you first encounter a dataset after importing it into your workspace as a DataFrame. Is there a set of tasks that you usually undertake to examine the data? Are you aware of all the possible data types? This chapter begins by covering the tasks you might want to undertake when first encountering a new dataset. The chapter proceeds by answering common questions about things that are not that simple to do in pandas.

Developing a data analysis routine

Although there is no standard approach when beginning a data analysis, it is typically a good idea to develop a routine for yourself when first examining a dataset. Similar to everyday routines that we have for waking up, showering, going to work, eating, and so on, a data analysis routine helps you to quickly get acquainted with a new dataset. This routine can manifest itself as a dynamic checklist of tasks that evolves as your familiarity with pandas and data analysis expands.

Exploratory Data Analysis (EDA) is a term used to describe the process of analyzing datasets. Typically it does not involve model creation, but summarizing the characteristics of the data and visualizing them. This is not new and was promoted by John Tukey in his book Exploratory Data Analysis in 1977.

Many of these same processes are still applicable and useful to understand a dataset. Indeed, they can also help with creating machine learning models later...

Data dictionaries

A crucial part of data analysis involves creating and maintaining a data dictionary. A data dictionary is a table of metadata and notes on each column of data. One of the primary purposes of a data dictionary is to explain the meaning of the column names. The college dataset uses a lot of abbreviations that are likely to be unfamiliar to an analyst who is inspecting it for the first time.

A data dictionary for the college dataset is provided in the following college_data_dictionary.csv file:

>>> pd.read_csv("data/college_data_dictionary.csv")
    column_name  description
0        INSTNM  Institut...
1          CITY  City Loc...
2        STABBR  State Ab...
3          HBCU  Historic...
4       MENONLY  0/1 Men ...
..          ...          ...
22      PCTPELL  Percent ...
23     PCTFLOAN  Percent ...
24      UG25ABV  Percent ...
25  MD_EARN_...  Median E...
26  GRAD_DEB...  Median d...

As you can see, it is immensely helpful in deciphering...

Reducing memory by changing data types

pandas has precise technical definitions for many data types. However, when you load data from type-less formats such as CSV, pandas has to infer the type.

This recipe changes the data type of one of the object columns from the college dataset to the special pandas categorical data type to drastically reduce its memory usage.

How to do it…

After reading in our college dataset, we select a few columns of different data types that will clearly show how much memory may be saved:

>>> college = pd.read_csv("data/college.csv")
>>> different_cols = [
...     "RELAFFIL",
...     "SATMTMID",
...     "CURROPER",
...     "INSTNM",
...     "STABBR",
... ]
>>> col2 = college.loc[:, different_cols]
>>> col2.head()
   RELAFFIL  SATMTMID  ...       INSTNM STABBR
0         0     420.0  ...  Alabama ...     AL
1         0     565.0...

Selecting the smallest of the largest

This recipe can be used to create catchy news headlines such as Out of the Top 100 Universities, These 5 have the Lowest Tuition, or From the Top 50 Cities to Live, these 10 are the Most Affordable.

During analysis, it is possible that you will first need to find a grouping of data that contains the top n values in a single column and, from this subset, find the bottom m values based on a different column.

In this recipe, we find the five lowest budget movies from the top 100 scoring movies by taking advantage of the convenience methods: .nlargest and .nsmallest.

How to do it…

Read in the movie dataset, and select the columns: movie_title, imdb_score, and budget:

>>> movie = pd.read_csv("data/movie.csv")
>>> movie2 = movie[["movie_title", "imdb_score", "budget"]]
>>> movie2.head()
   movie_title  imdb_score       budget
0       Avatar    ...

Selecting the largest of each group by sorting

One of the most basic and common operations to perform during data analysis is to select rows containing the largest value of some column within a group. For instance, this would be like finding the highest-rated film of each year or the highest-grossing film by content rating. To accomplish this task, we need to sort the groups as well as the column used to rank each member of the group, and then extract the highest member of each group.

In this recipe, we will find the highest-rated film of each year.

How to do it…

Read in the movie dataset and slim it down to just the three columns we care about: movie_title, title_year, and imdb_score:

>>> movie = pd.read_csv("data/movie.csv")
>>> movie[["movie_title", "title_year", "imdb_score"]]
                                     movie_title  ...
0                                         Avatar  ...
1 ...

Replicating nlargest with sort_values

The previous two recipes work similarly by sorting values in slightly different manners. Finding the top n values of a column of data is equivalent to sorting the entire column in descending order and taking the first n values. pandas has many operations that are capable of doing this in a variety of ways.

In this recipe, we will replicate the Selecting the smallest of the largest recipe with the .sort_values method and explore the differences between the two.

How to do it…

Let's recreate the result from the final step of the Selecting the smallest of the largest recipe:

>>> movie = pd.read_csv("data/movie.csv")
>>> (
...     movie[["movie_title", "imdb_score", "budget"]]
...     .nlargest(100, "imdb_score")
...     .nsmallest(5, "budget")
... )
               movie_title  imdb_score    budget
4804        Butterfly Girl   ...

Calculating a trailing stop order price

There are many strategies to trade stocks. One basic type of trade that many investors employ is the stop order. A stop order is an order placed by an investor to buy or sell a stock that executes whenever the market price reaches a certain point. Stop orders are useful to both prevent huge losses and protect gains.

For this recipe, we will only be examining stop orders used to sell currently owned stocks. In a typical stop order, the price does not change throughout the lifetime of the order. For instance, if you purchased a stock for $100 per share, you might want to set a stop order at $90 per share to limit your downside to 10%.

A more advanced strategy would be to continually modify the sale price of the stop order to track the value of the stock if it increases in value. This is called a trailing stop order. Concretely, if the same $100 stock increases to $120, then a trailing stop order 10% below the current market value...

The rest of the chapter is locked

You have been reading a chapter from

Pandas 1.x Cookbook - Second Edition

Published in: Feb 2020Publisher: PacktISBN-13: 9781839213106

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (2)

Matt Harrison

Matt Harrison is an author, speaker, corporate trainer, and consultant. He authored the popular Learning the Pandas Library and Illustrated Guide to Python 3. He runs MetaSnake, which provides corporate and online training on Python and Data Science. In addition, he offers consulting services. He has worked on search engines, configuration management, storage, BI, predictive modeling, and in a variety of domains.
Read more about Matt Harrison

Theodore Petrou

Theodore Petrou is the founder of Dunder Data, a training company dedicated to helping teach the Python data science ecosystem effectively to individuals and corporations. Read his tutorials and attempt his data science challenges at the Dunder Data website.
Read more about Theodore Petrou

Other recommended products

Related to this chapter

Mastering Exploratory Analysis with pandas

Exploratory data analysis exploits the visual properties of the datasets that are commonly used by data scientists. It helps you build custom data pipelines to address data analysis tasks. This book uses pandas, the most popular Python library for data analysis, and helps you build end-to-end exploratory data-analysis solutions

BookSep 2018140 pages

Python Data Cleaning Cookbook

The book shows you how to view data from multiple perspectives, including data frame and column attributes. You will cover common and not-so-common challenges that are faced while cleaning messy data for complex situations. You will learn to manipulate data and get them down to a form that can be useful for making the right decisions.

BookDec 2020436 pages

Learning pandas

Pandas is a popular Python package used for practical, real world data analysis. It provides efficient fast, high-performance data structures that makes data exploration and analysis very easy. This learner's guide will help you through a comprehensive set of features provided by the pandas library to perform efficient data manipulation and analysis.

BookJun 2017446 pages

Hands-On Data Analysis with NumPy and Pandas

In this book, you will explore two important Python packages used by Data Analysts, NumPy & pandas. You will dive into different concepts such as reading, sorting, grouping of data, and also learn how to work with different data formats for your data analysis projects.

BookJun 2018168 pages5

Mastering pandas

pandas is a popular Python library used by data scientists and analysts worldwide to manipulate and analyze their data. This book presents useful techniques and real-world examples on getting the most out of pandas for expert-level data manipulation, analysis and visualization.

BookOct 2019674 pages

Hands-On Data Analysis with Pandas

This book will be a handy guide to quickly learn pandas and understand how it can empower you in the exciting world of data manipulation, analysis, and data science. You will learn how to use pandas to perform numeric and statistical analysis using real-world examples. You will also visualize statistical data and apply pandas to different domains.

BookJul 2019740 pages

Hands-On Data Analysis with Pandas

Knowing how to work with data to extract insights generates significant value. This book will help you to develop data analysis skills using a hands-on approach and real-world data. You’ll get up to speed with pandas 1.x in no time and build some software engineering skills in the process, vastly expanding your data science toolbox.

BookApr 2021788 pages5

Personalised recommendations for you

Based on your interests and search pattern

C++ Programming for Linux Systems

This book covers the essential system programming tools and helps you explore the features of C++20. It emphasizes important details to maintain code quality and tackle everyday challenges of developing software for high performance, optimization, and more.

BookSep 2023288 pages

Expert C++

Discover advanced programming techniques, the latest features of C++17 and C++20, and best practices for memory management, debugging, testing, and large-scale application design with Expert C++. Ideal for experienced developers advancing to proficient programmers and building professional-grade C++ applications.

BookAug 2023604 pages

iOS 17 Programming for Beginners

iOS 17 Programming for Beginners, Eighth Edition is your comprehensive guide to learning the art of iOS app development. Whether you dream of creating the next chart-topping app or simply want to enhance your programming skills, this book is your trusted companion on this exciting journey.

BookOct 2023604 pages4

Developer Career Masterplan

Written by industry experts that have spent the last 20+ years helping developers grow their career path towards senior developer positions and beyond. This book provides a comprehensive guide, sharing examples and stories from their global careers. By the end, you’ll have the knowledge to create a clear career progression plan as a technical professional.

BookSep 2023310 pages

Refactoring with C#

In Refactoring with C#, you’ll explore the process of safely refactoring modern .NET code using Visual Studio features, advanced unit tests, AI assistance, and custom Roslyn analyzers.

BookNov 2023434 pages

Python Real-World Projects

Amplify your developer journey by curating a dynamic project portfolio that outshines traditional resumes. Delve into the Python realm through immersive projects, mastering core concepts while constructing comprehensive modules and applications. From data acquisition prowess to impactful data visualization, Python Real-World Projects arms you with essential skills to beat the competition.

BookSep 2023478 pages5

The MVVM Pattern in .NET MAUI

The MVVM Pattern in .NET MAUI enables developers to master MVVM principles and effectively apply them to .NET MAUI. This book uses real-life examples and covers complex problems to help you successfully apply MVVM with .NET MAUI to confidently develop robust and high-performing cross-platform apps.

BookNov 2023386 pages

Extending Microsoft Business Central with Power Platform

Extending Business Central with the Power Platform is a step-by-step guide for Business Central professionals to create solutions that automate business processes, explain complex workflow approvals, and integrate with hundreds of other systems, without traditional development. It’ll guide you in customizing Business Central with Power Platform.

BookAug 2023458 pages5

Extending Microsoft Business Central with Power Platform

Extending Business Central with the Power Platform is a step-by-step guide for Business Central professionals to create solutions that automate business processes, explain complex workflow approvals, and integrate with hundreds of other systems, without traditional development. It’ll guide you in customizing Business Central with Power Platform.

BookAug 2023458 pages5

Quantum Computing Algorithms

The book emphasizes intuitive ideas behind quantum algorithms in ways that other books don’t cover, striking a careful balance between no math and too much math. To get the most from this book, you should be comfortable with basic algebra and writing simple computer code. No prior understanding of quantum physics is needed to get started.

BookSep 2023342 pages

Python – Complete Python, Django, Data Science and ML Guide

Unlock Python's full potential with this 50+ hour course! From programming to web and game development, data manipulation, and machine learning, gain the skills required to succeed in various Python-related careers. With practical tasks, hands-on experience, and a strong foundation in Python, you'll be ready to tackle real-world challenges and take advantage of the many opportunities this versatile language offers.

VideoNov 202350 hours 30 minutes5

Python – Complete Python, Django, Data Science and ML Guide

Unlock Python's full potential with this 50+ hour course! From programming to web and game development, data manipulation, and machine learning, gain the skills required to succeed in various Python-related careers. With practical tasks, hands-on experience, and a strong foundation in Python, you'll be ready to tackle real-world challenges and take advantage of the many opportunities this versatile language offers.

VideoNov 202350 hours 30 minutes5

You're reading from Pandas 1.x Cookbook - Second Edition

Introduction

Developing a data analysis routine

Data dictionaries

Reducing memory by changing data types

How to do it…

Selecting the smallest of the largest

How to do it…

Selecting the largest of each group by sorting

How to do it…

Replicating nlargest with sort_values

How to do it…

Calculating a trailing stop order price

Unlock this book and the full library FREE for 7 days

Authors (2)

Mastering Exploratory Analysis with pandas

Python Data Cleaning Cookbook

Learning pandas

Hands-On Data Analysis with NumPy and Pandas

In this book, you will explore two important Python packages used by Data Analysts, NumPy &amp; pandas. You will dive into different concepts such as reading, sorting, grouping of data, and also learn how to work with different data formats for your data analysis projects.

Mastering pandas

pandas is a popular Python library used by data scientists and analysts worldwide to manipulate and analyze their data. This book presents useful techniques and real-world examples on getting the most out of pandas for expert-level data manipulation, analysis and visualization.

Hands-On Data Analysis with Pandas

Hands-On Data Analysis with Pandas

C++ Programming for Linux Systems

This book covers the essential system programming tools and helps you explore the features of C++20. It emphasizes important details to maintain code quality and tackle everyday challenges of developing software for high performance, optimization, and more.

Expert C++

iOS 17 Programming for Beginners

iOS 17 Programming for Beginners, Eighth Edition is your comprehensive guide to learning the art of iOS app development. Whether you dream of creating the next chart-topping app or simply want to enhance your programming skills, this book is your trusted companion on this exciting journey.

Developer Career Masterplan

Refactoring with C#

In Refactoring with C#, you’ll explore the process of safely refactoring modern .NET code using Visual Studio features, advanced unit tests, AI assistance, and custom Roslyn analyzers.

Python Real-World Projects

The MVVM Pattern in .NET MAUI

The MVVM Pattern in .NET MAUI enables developers to master MVVM principles and effectively apply them to .NET MAUI. This book uses real-life examples and covers complex problems to help you successfully apply MVVM with .NET MAUI to confidently develop robust and high-performing cross-platform apps.

Extending Microsoft Business Central with Power Platform

Extending Microsoft Business Central with Power Platform

Quantum Computing Algorithms

Python – Complete Python, Django, Data Science and ML Guide

Python – Complete Python, Django, Data Science and ML Guide

In this book, you will explore two important Python packages used by Data Analysts, NumPy & pandas. You will dive into different concepts such as reading, sorting, grouping of data, and also learn how to work with different data formats for your data analysis projects.