Introduction
There are many ways to create a DataFrame. This chapter will cover some of the most common ones. It will also show how to persist them.
There are many ways to create a DataFrame. This chapter will cover some of the most common ones. It will also show how to persist them.
Usually, we create a DataFrame from an existing file or a database, but we can also create one from scratch. We can create a DataFrame from parallel lists of data.
>>> import pandas as pd
>>> import numpy as np
>>> fname = ["Paul", "John", "Richard", "George"]
>>> lname = ["McCartney", "Lennon", "Starkey", "Harrison"]
>>> birth = [1942, 1940, 1940, 1943]
>>> people = {"first": fname, "last": lname, "birth": birth}
>>> beatles = pd.DataFrame(people...
For better or worse, there are a lot of CSV files in the world. Like most technologies, there are good and bad parts to CSV files. On the plus side, they are human-readable, can be opened in any text editor, and most spreadsheet software can load them. On the downside, there is no standard for CSV files, so encoding may be weird, there is no way to enforce types, and they can be large because they are text-based (though they can be compressed).
In this recipe, we will show how to create a CSV file from a pandas DataFrame.
There are a few methods on the DataFrame that start with to_
. These are methods that export DataFrames. We are going to use the .to_csv
method. We will write out to a string buffer in the examples, but you will usually use a filename instead.
>>> beatles
first last birth
0 Paul McCartney 1942
1 John Lennon 1940
2 Richard Starkey...
The pandas library is an in-memory tool. You need to be able to fit your data in memory to use pandas with it. If you come across a large CSV file that you want to process, you have a few options. If you can process portions of it at a time, you can read it into chunks and process each chunk. Alternatively, if you know that you should have enough memory to load the file, there are a few hints to help pare down the file size.
Note that in general, you should have three to ten times the amount of memory as the size of the DataFrame that you want to manipulate. Extra memory should give you enough extra space to perform many of the common operations.
In this section, we will look at the diamonds dataset. This dataset easily fits into the memory of my 2015 MacBook, but let's pretend that the file is a lot bigger than it is, or that the memory of my machine is limited such that when pandas tries to load it with the read_csv
function...
While CSV files are common, it seems that the world is ruled by Excel. I've been surprised in my consulting work to see how many companies are using Excel as a critical if not the critical tool for making decisions.
In this recipe, we will show how to create and read Excel files. You may need to install xlwt
or openpyxl
to write XLS or XLSX files, respectively.
.to_excel
method. You can write either xls
files or xlsx
files:
>>> beatles.to_excel("beat.xls")
>>> beatles.to_excel("beat.xlsx")
Excel file
read_excel
function:
>>> beat2 = pd.read_excel("/tmp/beat.xls")
>>> beat2
Unnamed: 0 first last birth
0 0 Paul McCartney 1942
1 1 John Lennon 1940
2 2 Richard Starkey 1940
3 3 ...
As was mentioned previously, CSV files are very common for sharing data. Because they are plain text files, they can get big. One solution for managing the size of CSV files is to compress them. In this recipe, we will look at loading files from ZIP files.
We will load a CSV file that is compressed as the only thing in the ZIP file. This is the behavior that you get if you were to right-click on a file in the Finder on Mac and click Compress beatles.csv. We will also look at reading a CSV file from a ZIP file with multiple files in it.
The first file is from the fueleconomy.gov website. It is a list of all car makes that have been available in the US market from 1984-2018.
The second file is a survey of users of the Kaggle website. It was intended to get information about the users, their background, and the tools that they prefer.
read_csv
function...We mentioned that pandas is useful for tabular or structured data. Many organizations use databases to store tabular data. In this recipe, we will work with databases to insert and read data.
Note that this example uses the SQLite database, which is included with Python. However, Python has the ability to connect with most SQL databases and pandas, in turn, can leverage that.
>>> import sqlite3
>>> con = sqlite3.connect("data/beat.db")
>>> with con:
... cur = con.cursor()
... cur.execute("""DROP TABLE Band""")
... cur.execute(
... """CREATE TABLE Band(id INTEGER PRIMARY KEY,
... fname TEXT, lname TEXT, birthyear INT)"""
... )
... cur.execute(
... """INSERT INTO Band VALUES(
... 0, &apos...
JavaScript Object Notation (JSON) is a common format used for transferring data over the internet. Contrary to the name, it does not require JavaScript to read or create. The Python standard library ships with the json
library that will encode and decode from JSON:
>>> import json
>>> encoded = json.dumps(people)
>>> encoded
'{"first": ["Paul", "John", "Richard", "George"], "last": ["McCartney", "Lennon", "Starkey", "Harrison"], "birth": [1942, 1940, 1940, 1943]}'
>>> json.loads(encoded)
{'first': ['Paull', 'John', 'Richard', 'George'], 'last': ['McCartney', 'Lennon', 'Starkey', 'Harrison'], 'birth': [1942, 1940, 1940, 1943]}
read_json
function. If...You can use pandas to read HTML tables from websites. This makes it easy to ingest tables such as those found on Wikipedia or other websites.
In this recipe, we will scrape tables from the Wikipedia entry for The Beatles Discography. In particular, we want to scrape the table in the image that was in Wikipedia during 2019:
Wikipedia table for studio albums
read_html
function to load all of the tables from https://en.wikipedia.org/wiki/The_Beatles_discography:
>>> url = https://en.wikipedia.org/wiki/The_Beatles_discography
>>> dfs = pd.read_html(url)
>>> len(dfs)
51
>>> dfs[0]
The Beatles discography The Beatles discography.1
0 The Beat... The Beat...
1 Studio a... 23
2 Live albums 5
3 Compilat... 53
4 Video al... ...