You're reading from Pandas 1.x Cookbook - Second Edition

Product typeBook

Published inFeb 2020

Reading LevelBeginner

PublisherPackt

ISBN-139781839213106

Edition2nd Edition

Languages

Python

Tools

Pandas

Concepts

Programming Language

Authors (2):

Matt Harrison

Theodore Petrou

View More author details

Introduction

There are many ways to create a DataFrame. This chapter will cover some of the most common ones. It will also show how to persist them.

Creating DataFrames from scratch

Usually, we create a DataFrame from an existing file or a database, but we can also create one from scratch. We can create a DataFrame from parallel lists of data.

How to do it...

Create parallel lists with your data in them. Each of these lists will be a column in the DataFrame, so they should have the same type:

>>> import pandas as pd
>>> import numpy as np
>>> fname = ["Paul", "John", "Richard", "George"]
>>> lname = ["McCartney", "Lennon", "Starkey", "Harrison"]
>>> birth = [1942, 1940, 1940, 1943]

Create a dictionary from the lists, mapping the column name to the list:
```
>>> people = {"first": fname, "last": lname, "birth": birth}
```
Create a DataFrame from the dictionary:
```
>>> beatles = pd.DataFrame(people...
```

Writing CSV

For better or worse, there are a lot of CSV files in the world. Like most technologies, there are good and bad parts to CSV files. On the plus side, they are human-readable, can be opened in any text editor, and most spreadsheet software can load them. On the downside, there is no standard for CSV files, so encoding may be weird, there is no way to enforce types, and they can be large because they are text-based (though they can be compressed).

In this recipe, we will show how to create a CSV file from a pandas DataFrame.

There are a few methods on the DataFrame that start with to_. These are methods that export DataFrames. We are going to use the .to_csv method. We will write out to a string buffer in the examples, but you will usually use a filename instead.

How to do it...

Write the DataFrame to a CSV file:

>>> beatles
     first       last  birth
0     Paul  McCartney   1942
1     John     Lennon   1940
2  Richard    Starkey...

Reading large CSV files

The pandas library is an in-memory tool. You need to be able to fit your data in memory to use pandas with it. If you come across a large CSV file that you want to process, you have a few options. If you can process portions of it at a time, you can read it into chunks and process each chunk. Alternatively, if you know that you should have enough memory to load the file, there are a few hints to help pare down the file size.

Note that in general, you should have three to ten times the amount of memory as the size of the DataFrame that you want to manipulate. Extra memory should give you enough extra space to perform many of the common operations.

How to do it...

In this section, we will look at the diamonds dataset. This dataset easily fits into the memory of my 2015 MacBook, but let's pretend that the file is a lot bigger than it is, or that the memory of my machine is limited such that when pandas tries to load it with the read_csv function...

Using Excel files

While CSV files are common, it seems that the world is ruled by Excel. I've been surprised in my consulting work to see how many companies are using Excel as a critical if not the critical tool for making decisions.

In this recipe, we will show how to create and read Excel files. You may need to install xlwt or openpyxl to write XLS or XLSX files, respectively.

How to do it...

Create an Excel file using the .to_excel method. You can write either xls files or xlsx files:
```
>>> beatles.to_excel("beat.xls")
>>> beatles.to_excel("beat.xlsx")
```
Excel file

Read the Excel file with the read_excel function:

>>> beat2 = pd.read_excel("/tmp/beat.xls")
>>> beat2
   Unnamed: 0    first       last  birth
0           0     Paul  McCartney   1942
1           1     John     Lennon   1940
2           2  Richard    Starkey   1940
3           3 ...

Working with ZIP files

As was mentioned previously, CSV files are very common for sharing data. Because they are plain text files, they can get big. One solution for managing the size of CSV files is to compress them. In this recipe, we will look at loading files from ZIP files.

We will load a CSV file that is compressed as the only thing in the ZIP file. This is the behavior that you get if you were to right-click on a file in the Finder on Mac and click Compress beatles.csv. We will also look at reading a CSV file from a ZIP file with multiple files in it.

The first file is from the fueleconomy.gov website. It is a list of all car makes that have been available in the US market from 1984-2018.

The second file is a survey of users of the Kaggle website. It was intended to get information about the users, their background, and the tools that they prefer.

How to do it...

If the CSV file is the only file in the ZIP file, you can just call the read_csv function...

Working with databases

We mentioned that pandas is useful for tabular or structured data. Many organizations use databases to store tabular data. In this recipe, we will work with databases to insert and read data.

Note that this example uses the SQLite database, which is included with Python. However, Python has the ability to connect with most SQL databases and pandas, in turn, can leverage that.

How to do it...

Create a SQLite database to store the Beatles information:

>>> import sqlite3
>>> con = sqlite3.connect("data/beat.db")
>>> with con:
...     cur = con.cursor()
...     cur.execute("""DROP TABLE Band""")
...     cur.execute(
...         """CREATE TABLE Band(id INTEGER PRIMARY KEY,
...         fname TEXT, lname TEXT, birthyear INT)"""
...     )
...     cur.execute(
...         """INSERT INTO Band VALUES(
...         0, &apos...

Reading JSON

JavaScript Object Notation (JSON) is a common format used for transferring data over the internet. Contrary to the name, it does not require JavaScript to read or create. The Python standard library ships with the json library that will encode and decode from JSON:

>>> import json
>>> encoded = json.dumps(people)
>>> encoded
'{"first": ["Paul", "John", "Richard", "George"], "last": ["McCartney", "Lennon", "Starkey", "Harrison"], "birth": [1942, 1940, 1940, 1943]}'
>>> json.loads(encoded)
{'first': ['Paull', 'John', 'Richard', 'George'], 'last': ['McCartney', 'Lennon', 'Starkey', 'Harrison'], 'birth': [1942, 1940, 1940, 1943]}

How to do it...

Read the data using the read_json function. If...

Reading HTML tables

You can use pandas to read HTML tables from websites. This makes it easy to ingest tables such as those found on Wikipedia or other websites.

In this recipe, we will scrape tables from the Wikipedia entry for The Beatles Discography. In particular, we want to scrape the table in the image that was in Wikipedia during 2019:

Wikipedia table for studio albums

How to do it...

Use the read_html function to load all of the tables from https://en.wikipedia.org/wiki/The_Beatles_discography:

>>> url = https://en.wikipedia.org/wiki/The_Beatles_discography
>>> dfs = pd.read_html(url)
>>> len(dfs)
51

Inspect the first DataFrame:

>>> dfs[0]
  The Beatles discography The Beatles discography.1
0  The Beat...             The Beat...
1  Studio a...                      23
2  Live albums                       5
3  Compilat...                      53
4  Video al...                 ...

The rest of the chapter is locked

You have been reading a chapter from

Pandas 1.x Cookbook - Second Edition

Published in: Feb 2020Publisher: PacktISBN-13: 9781839213106

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (2)

Matt Harrison

Matt Harrison is an author, speaker, corporate trainer, and consultant. He authored the popular Learning the Pandas Library and Illustrated Guide to Python 3. He runs MetaSnake, which provides corporate and online training on Python and Data Science. In addition, he offers consulting services. He has worked on search engines, configuration management, storage, BI, predictive modeling, and in a variety of domains.
Read more about Matt Harrison

Theodore Petrou

Theodore Petrou is the founder of Dunder Data, a training company dedicated to helping teach the Python data science ecosystem effectively to individuals and corporations. Read his tutorials and attempt his data science challenges at the Dunder Data website.
Read more about Theodore Petrou

Other recommended products

Related to this chapter

Mastering Exploratory Analysis with pandas

Exploratory data analysis exploits the visual properties of the datasets that are commonly used by data scientists. It helps you build custom data pipelines to address data analysis tasks. This book uses pandas, the most popular Python library for data analysis, and helps you build end-to-end exploratory data-analysis solutions

BookSep 2018140 pages

Python Data Cleaning Cookbook

The book shows you how to view data from multiple perspectives, including data frame and column attributes. You will cover common and not-so-common challenges that are faced while cleaning messy data for complex situations. You will learn to manipulate data and get them down to a form that can be useful for making the right decisions.

BookDec 2020436 pages

Learning pandas

Pandas is a popular Python package used for practical, real world data analysis. It provides efficient fast, high-performance data structures that makes data exploration and analysis very easy. This learner's guide will help you through a comprehensive set of features provided by the pandas library to perform efficient data manipulation and analysis.

BookJun 2017446 pages

Hands-On Data Analysis with NumPy and Pandas

In this book, you will explore two important Python packages used by Data Analysts, NumPy & pandas. You will dive into different concepts such as reading, sorting, grouping of data, and also learn how to work with different data formats for your data analysis projects.

BookJun 2018168 pages5

Mastering pandas

pandas is a popular Python library used by data scientists and analysts worldwide to manipulate and analyze their data. This book presents useful techniques and real-world examples on getting the most out of pandas for expert-level data manipulation, analysis and visualization.

BookOct 2019674 pages

Hands-On Data Analysis with Pandas

This book will be a handy guide to quickly learn pandas and understand how it can empower you in the exciting world of data manipulation, analysis, and data science. You will learn how to use pandas to perform numeric and statistical analysis using real-world examples. You will also visualize statistical data and apply pandas to different domains.

BookJul 2019740 pages

Hands-On Data Analysis with Pandas

Knowing how to work with data to extract insights generates significant value. This book will help you to develop data analysis skills using a hands-on approach and real-world data. You’ll get up to speed with pandas 1.x in no time and build some software engineering skills in the process, vastly expanding your data science toolbox.

BookApr 2021788 pages5

Personalised recommendations for you

Based on your interests and search pattern

C++ Programming for Linux Systems

This book covers the essential system programming tools and helps you explore the features of C++20. It emphasizes important details to maintain code quality and tackle everyday challenges of developing software for high performance, optimization, and more.

BookSep 2023288 pages

Expert C++

Discover advanced programming techniques, the latest features of C++17 and C++20, and best practices for memory management, debugging, testing, and large-scale application design with Expert C++. Ideal for experienced developers advancing to proficient programmers and building professional-grade C++ applications.

BookAug 2023604 pages

iOS 17 Programming for Beginners

iOS 17 Programming for Beginners, Eighth Edition is your comprehensive guide to learning the art of iOS app development. Whether you dream of creating the next chart-topping app or simply want to enhance your programming skills, this book is your trusted companion on this exciting journey.

BookOct 2023604 pages4

Developer Career Masterplan

Written by industry experts that have spent the last 20+ years helping developers grow their career path towards senior developer positions and beyond. This book provides a comprehensive guide, sharing examples and stories from their global careers. By the end, you’ll have the knowledge to create a clear career progression plan as a technical professional.

BookSep 2023310 pages

Refactoring with C#

In Refactoring with C#, you’ll explore the process of safely refactoring modern .NET code using Visual Studio features, advanced unit tests, AI assistance, and custom Roslyn analyzers.

BookNov 2023434 pages

Python Real-World Projects

Amplify your developer journey by curating a dynamic project portfolio that outshines traditional resumes. Delve into the Python realm through immersive projects, mastering core concepts while constructing comprehensive modules and applications. From data acquisition prowess to impactful data visualization, Python Real-World Projects arms you with essential skills to beat the competition.

BookSep 2023478 pages5

The MVVM Pattern in .NET MAUI

The MVVM Pattern in .NET MAUI enables developers to master MVVM principles and effectively apply them to .NET MAUI. This book uses real-life examples and covers complex problems to help you successfully apply MVVM with .NET MAUI to confidently develop robust and high-performing cross-platform apps.

BookNov 2023386 pages

Extending Microsoft Business Central with Power Platform

Extending Business Central with the Power Platform is a step-by-step guide for Business Central professionals to create solutions that automate business processes, explain complex workflow approvals, and integrate with hundreds of other systems, without traditional development. It’ll guide you in customizing Business Central with Power Platform.

BookAug 2023458 pages5

Extending Microsoft Business Central with Power Platform

Extending Business Central with the Power Platform is a step-by-step guide for Business Central professionals to create solutions that automate business processes, explain complex workflow approvals, and integrate with hundreds of other systems, without traditional development. It’ll guide you in customizing Business Central with Power Platform.

BookAug 2023458 pages5

Quantum Computing Algorithms

The book emphasizes intuitive ideas behind quantum algorithms in ways that other books don’t cover, striking a careful balance between no math and too much math. To get the most from this book, you should be comfortable with basic algebra and writing simple computer code. No prior understanding of quantum physics is needed to get started.

BookSep 2023342 pages

Python – Complete Python, Django, Data Science and ML Guide

Unlock Python's full potential with this 50+ hour course! From programming to web and game development, data manipulation, and machine learning, gain the skills required to succeed in various Python-related careers. With practical tasks, hands-on experience, and a strong foundation in Python, you'll be ready to tackle real-world challenges and take advantage of the many opportunities this versatile language offers.

VideoNov 202350 hours 30 minutes5

Python – Complete Python, Django, Data Science and ML Guide

Unlock Python's full potential with this 50+ hour course! From programming to web and game development, data manipulation, and machine learning, gain the skills required to succeed in various Python-related careers. With practical tasks, hands-on experience, and a strong foundation in Python, you'll be ready to tackle real-world challenges and take advantage of the many opportunities this versatile language offers.

VideoNov 202350 hours 30 minutes5

You're reading from Pandas 1.x Cookbook - Second Edition

Introduction

Creating DataFrames from scratch

How to do it...

Writing CSV

How to do it...

Reading large CSV files

How to do it...

Using Excel files

How to do it...

Working with ZIP files

How to do it...

Working with databases

How to do it...

Reading JSON

How to do it...

Reading HTML tables

How to do it...

Unlock this book and the full library FREE for 7 days

Authors (2)

Mastering Exploratory Analysis with pandas

Python Data Cleaning Cookbook

Learning pandas

Hands-On Data Analysis with NumPy and Pandas

In this book, you will explore two important Python packages used by Data Analysts, NumPy &amp; pandas. You will dive into different concepts such as reading, sorting, grouping of data, and also learn how to work with different data formats for your data analysis projects.

Mastering pandas

pandas is a popular Python library used by data scientists and analysts worldwide to manipulate and analyze their data. This book presents useful techniques and real-world examples on getting the most out of pandas for expert-level data manipulation, analysis and visualization.

Hands-On Data Analysis with Pandas

Hands-On Data Analysis with Pandas

C++ Programming for Linux Systems

This book covers the essential system programming tools and helps you explore the features of C++20. It emphasizes important details to maintain code quality and tackle everyday challenges of developing software for high performance, optimization, and more.

Expert C++

iOS 17 Programming for Beginners

iOS 17 Programming for Beginners, Eighth Edition is your comprehensive guide to learning the art of iOS app development. Whether you dream of creating the next chart-topping app or simply want to enhance your programming skills, this book is your trusted companion on this exciting journey.

Developer Career Masterplan

Refactoring with C#

In Refactoring with C#, you’ll explore the process of safely refactoring modern .NET code using Visual Studio features, advanced unit tests, AI assistance, and custom Roslyn analyzers.

Python Real-World Projects

The MVVM Pattern in .NET MAUI

The MVVM Pattern in .NET MAUI enables developers to master MVVM principles and effectively apply them to .NET MAUI. This book uses real-life examples and covers complex problems to help you successfully apply MVVM with .NET MAUI to confidently develop robust and high-performing cross-platform apps.

Extending Microsoft Business Central with Power Platform

Extending Microsoft Business Central with Power Platform

Quantum Computing Algorithms

Python – Complete Python, Django, Data Science and ML Guide

Python – Complete Python, Django, Data Science and ML Guide

In this book, you will explore two important Python packages used by Data Analysts, NumPy & pandas. You will dive into different concepts such as reading, sorting, grouping of data, and also learn how to work with different data formats for your data analysis projects.