Reader small image

You're reading from  Pandas 1.x Cookbook - Second Edition

Product typeBook
Published inFeb 2020
Reading LevelBeginner
PublisherPackt
ISBN-139781839213106
Edition2nd Edition
Languages
Tools
Right arrow
Authors (2):
Matt Harrison
Matt Harrison
author image
Matt Harrison

Matt Harrison is an author, speaker, corporate trainer, and consultant. He authored the popular Learning the Pandas Library and Illustrated Guide to Python 3. He runs MetaSnake, which provides corporate and online training on Python and Data Science. In addition, he offers consulting services. He has worked on search engines, configuration management, storage, BI, predictive modeling, and in a variety of domains.
Read more about Matt Harrison

Theodore Petrou
Theodore Petrou
author image
Theodore Petrou

Theodore Petrou is the founder of Dunder Data, a training company dedicated to helping teach the Python data science ecosystem effectively to individuals and corporations. Read his tutorials and attempt his data science challenges at the Dunder Data website.
Read more about Theodore Petrou

View More author details
Right arrow

Introduction

There are many ways to create a DataFrame. This chapter will cover some of the most common ones. It will also show how to persist them.

Creating DataFrames from scratch

Usually, we create a DataFrame from an existing file or a database, but we can also create one from scratch. We can create a DataFrame from parallel lists of data.

How to do it...

  1. Create parallel lists with your data in them. Each of these lists will be a column in the DataFrame, so they should have the same type:
    >>> import pandas as pd
    >>> import numpy as np
    >>> fname = ["Paul", "John", "Richard", "George"]
    >>> lname = ["McCartney", "Lennon", "Starkey", "Harrison"]
    >>> birth = [1942, 1940, 1940, 1943]
    
  2. Create a dictionary from the lists, mapping the column name to the list:
    >>> people = {"first": fname, "last": lname, "birth": birth}
    
  3. Create a DataFrame from the dictionary:
    >>> beatles = pd.DataFrame(people...

Writing CSV

For better or worse, there are a lot of CSV files in the world. Like most technologies, there are good and bad parts to CSV files. On the plus side, they are human-readable, can be opened in any text editor, and most spreadsheet software can load them. On the downside, there is no standard for CSV files, so encoding may be weird, there is no way to enforce types, and they can be large because they are text-based (though they can be compressed).

In this recipe, we will show how to create a CSV file from a pandas DataFrame.

There are a few methods on the DataFrame that start with to_. These are methods that export DataFrames. We are going to use the .to_csv method. We will write out to a string buffer in the examples, but you will usually use a filename instead.

How to do it...

  1. Write the DataFrame to a CSV file:
    >>> beatles
         first       last  birth
    0     Paul  McCartney   1942
    1     John     Lennon   1940
    2  Richard    Starkey...

Reading large CSV files

The pandas library is an in-memory tool. You need to be able to fit your data in memory to use pandas with it. If you come across a large CSV file that you want to process, you have a few options. If you can process portions of it at a time, you can read it into chunks and process each chunk. Alternatively, if you know that you should have enough memory to load the file, there are a few hints to help pare down the file size.

Note that in general, you should have three to ten times the amount of memory as the size of the DataFrame that you want to manipulate. Extra memory should give you enough extra space to perform many of the common operations.

How to do it...

In this section, we will look at the diamonds dataset. This dataset easily fits into the memory of my 2015 MacBook, but let's pretend that the file is a lot bigger than it is, or that the memory of my machine is limited such that when pandas tries to load it with the read_csv function...

Using Excel files

While CSV files are common, it seems that the world is ruled by Excel. I've been surprised in my consulting work to see how many companies are using Excel as a critical if not the critical tool for making decisions.

In this recipe, we will show how to create and read Excel files. You may need to install xlwt or openpyxl to write XLS or XLSX files, respectively.

How to do it...

  1. Create an Excel file using the .to_excel method. You can write either xls files or xlsx files:
    >>> beatles.to_excel("beat.xls")
    >>> beatles.to_excel("beat.xlsx")
    
    Excel file

    Excel file

  2. Read the Excel file with the read_excel function:
    >>> beat2 = pd.read_excel("/tmp/beat.xls")
    >>> beat2
       Unnamed: 0    first       last  birth
    0           0     Paul  McCartney   1942
    1           1     John     Lennon   1940
    2           2  Richard    Starkey   1940
    3           3 ...

Working with ZIP files

As was mentioned previously, CSV files are very common for sharing data. Because they are plain text files, they can get big. One solution for managing the size of CSV files is to compress them. In this recipe, we will look at loading files from ZIP files.

We will load a CSV file that is compressed as the only thing in the ZIP file. This is the behavior that you get if you were to right-click on a file in the Finder on Mac and click Compress beatles.csv. We will also look at reading a CSV file from a ZIP file with multiple files in it.

The first file is from the fueleconomy.gov website. It is a list of all car makes that have been available in the US market from 1984-2018.

The second file is a survey of users of the Kaggle website. It was intended to get information about the users, their background, and the tools that they prefer.

How to do it...

  1. If the CSV file is the only file in the ZIP file, you can just call the read_csv function...

Working with databases

We mentioned that pandas is useful for tabular or structured data. Many organizations use databases to store tabular data. In this recipe, we will work with databases to insert and read data.

Note that this example uses the SQLite database, which is included with Python. However, Python has the ability to connect with most SQL databases and pandas, in turn, can leverage that.

How to do it...

  1. Create a SQLite database to store the Beatles information:
    >>> import sqlite3
    >>> con = sqlite3.connect("data/beat.db")
    >>> with con:
    ...     cur = con.cursor()
    ...     cur.execute("""DROP TABLE Band""")
    ...     cur.execute(
    ...         """CREATE TABLE Band(id INTEGER PRIMARY KEY,
    ...         fname TEXT, lname TEXT, birthyear INT)"""
    ...     )
    ...     cur.execute(
    ...         """INSERT INTO Band VALUES(
    ...         0, &apos...

Reading JSON

JavaScript Object Notation (JSON) is a common format used for transferring data over the internet. Contrary to the name, it does not require JavaScript to read or create. The Python standard library ships with the json library that will encode and decode from JSON:

>>> import json
>>> encoded = json.dumps(people)
>>> encoded
'{"first": ["Paul", "John", "Richard", "George"], "last": ["McCartney", "Lennon", "Starkey", "Harrison"], "birth": [1942, 1940, 1940, 1943]}'
>>> json.loads(encoded)
{'first': ['Paull', 'John', 'Richard', 'George'], 'last': ['McCartney', 'Lennon', 'Starkey', 'Harrison'], 'birth': [1942, 1940, 1940, 1943]}

How to do it...

  1. Read the data using the read_json function. If...

Reading HTML tables

You can use pandas to read HTML tables from websites. This makes it easy to ingest tables such as those found on Wikipedia or other websites.

In this recipe, we will scrape tables from the Wikipedia entry for The Beatles Discography. In particular, we want to scrape the table in the image that was in Wikipedia during 2019:

Wikipedia table for studio albums

Wikipedia table for studio albums

How to do it...

  1. Use the read_html function to load all of the tables from https://en.wikipedia.org/wiki/The_Beatles_discography:
    >>> url = https://en.wikipedia.org/wiki/The_Beatles_discography
    >>> dfs = pd.read_html(url)
    >>> len(dfs)
    51
    
  2. Inspect the first DataFrame:
    >>> dfs[0]
      The Beatles discography The Beatles discography.1
    0  The Beat...             The Beat...
    1  Studio a...                      23
    2  Live albums                       5
    3  Compilat...                      53
    4  Video al...                 ...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Pandas 1.x Cookbook - Second Edition
Published in: Feb 2020Publisher: PacktISBN-13: 9781839213106
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Matt Harrison

Matt Harrison is an author, speaker, corporate trainer, and consultant. He authored the popular Learning the Pandas Library and Illustrated Guide to Python 3. He runs MetaSnake, which provides corporate and online training on Python and Data Science. In addition, he offers consulting services. He has worked on search engines, configuration management, storage, BI, predictive modeling, and in a variety of domains.
Read more about Matt Harrison

author image
Theodore Petrou

Theodore Petrou is the founder of Dunder Data, a training company dedicated to helping teach the Python data science ecosystem effectively to individuals and corporations. Read his tutorials and attempt his data science challenges at the Dunder Data website.
Read more about Theodore Petrou