Python Data Cleaning Cookbook

5 (3 reviews total)
By Michael Walker
    Advance your knowledge in tech with a Packt subscription

  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Chapter 2: Anticipating Data Cleaning Issues when Importing HTML and JSON into pandas

About this book

Getting clean data to reveal insights is essential, as directly jumping into data analysis without proper data cleaning may lead to incorrect results. This book shows you tools and techniques that you can apply to clean and handle data with Python. You'll begin by getting familiar with the shape of data by using practices that can be deployed routinely with most data sources. Then, the book teaches you how to manipulate data to get it into a useful form. You'll also learn how to filter and summarize data to gain insights and better understand what makes sense and what does not, along with discovering how to operate on data to address the issues you've identified. Moving on, you'll perform key tasks, such as handling missing values, validating errors, removing duplicate data, monitoring high volumes of data, and handling outliers and invalid dates. Next, you'll cover recipes on using supervised learning and Naive Bayes analysis to identify unexpected values and classification errors, and generate visualizations for exploratory data analysis (EDA) to visualize unexpected values. Finally, you'll build functions and classes that you can reuse without modification when you have new data.

By the end of this Python book, you'll be equipped with all the key skills that you need to clean data and diagnose problems within it.

Publication date:
December 2020
Publisher
Packt
Pages
436
ISBN
9781800565661

 

Chapter 2: Anticipating Data Cleaning Issues when Importing HTML and JSON into pandas

This chapter continues our work on importing data from a variety of sources, and the initial checks we should do on the data after importing it. Gradually, over the last 25 years, data analysts have found that they increasingly need to work with data in non-tabular, semi-structured forms. Sometimes they even create and persist data in those forms themselves. We work with a common alternative to traditional tabular datasets in this chapter, JSON, but the general concepts can be extended to XML and NoSQL data stores such as MongoDB. We also go over common issues that occur when scraping data from websites.

In this chapter, we will work through the following recipes:

  • Importing simple JSON data
  • Importing more complicated JSON data from an API
  • Importing data from web pages
  • Persisting JSON data
 

Technical requirements

The code and notebooks for this chapter are available on GitHub at https://github.com/PacktPublishing/Python-Data-Cleaning-Cookbook

 

Importing simple JSON data

JavaScript Object Notation (JSON) has turned out to be an incredibly useful standard for transferring data from one machine, process, or node to another. Often a client sends a data request to a server, upon which that server queries the data in the local storage and then converts it from something like a SQL Server table or tables into JSON, which the client can consume. This is sometimes complicated further by the first server (say, a web server) forwarding the request to a database server. JSON facilitates this, as does XML, by doing the following:

  • Being readable by humans
  • Being consumable by most client devices
  • Not being limited in structure

JSON is quite flexible, which means that it can accommodate just about anything. The structure can even change within a JSON file, so different keys might be present at different points. For example, the file might begin with some explanatory keys that have a very different structure than the remaining data keys. Or some keys might be present in some cases, but not others. We go over some approaches for dealing with that messiness (uh, I mean flexibility).

Getting ready

We are going to work with data on news stories about political candidates in this recipe. This data is made available for public use at dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/0ZLHOK. I have combined the JSON files there into one file and randomly selected 60,000 news stories from the combined data. This sample (allcandidatenewssample.json) is available in the GitHub repository of this book.

We will do a little work with list and dictionary comprehensions in this recipe. DataCamp has good guides to list comprehensions (https://www.datacamp.com/community/tutorials/python-list-comprehension) and dictionary comprehensions (https://www.datacamp.com/community/tutorials/python-dictionary-comprehension) if you are feeling a little rusty.

How to do it…

We will import a JSON file into pandas after doing some data checking and cleaning:

  1. Import the json and pprint libraries.

    pprint improves the display of the lists and dictionaries that are returned when we load JSON data:

    >>> import pandas as pd
    >>> import numpy as np
    >>> import json
    >>> import pprint
    >>> from collections import Counter
  2. Load the JSON data and look for potential issues.

    Use the json load method to return data on news stories about political candidates. load returns a list of dictionaries. Use len to get the size of the list, which is the total number of news stories in this case. (Each list item is a dictionary with keys for the title, source, and so on, and their respective values.) Use pprint to display the first two dictionaries. Get the value from the source key for the first list item:

    >>> with open('data/allcandidatenewssample.json') as f:
    ...   candidatenews = json.load(f)
    ... 
    >>> len(candidatenews)
    60000
    >>> pprint.pprint(candidatenews[0:2])
    [{'date': '2019-12-25 10:00:00',
      'domain': 'www.nbcnews.com',
      'panel_position': 1,
      'query': 'Michael Bloomberg',
      'source': 'NBC News',
      'story_position': 6,
      'time': '18 hours ago',
      'title': 'Bloomberg cuts ties with company using prison inmates to make '
               'campaign calls',
      'url': 'https://www.nbcnews.com/politics/2020-election/bloomberg-cuts-ties-company-using-prison-inmates-make-campaign-calls-n1106971'},
     {'date': '2019-11-09 08:00:00',
      'domain': 'www.townandcountrymag.com',
      'panel_position': 1,
      'query': 'Amy Klobuchar',
      'source': 'Town & Country Magazine',
      'story_position': 3,
      'time': '18 hours ago',
      'title': "Democratic Candidates React to Michael Bloomberg's Potential Run",
      'url': 'https://www.townandcountrymag.com/society/politics/a29739854/michael-bloomberg-democratic-candidates-campaign-reactions/'}]
    >>> pprint.pprint(candidatenews[0]['source'])
    'NBC News'
  3. Check for differences in the structure of the dictionaries.

    Use Counter to check for any dictionaries in the list with fewer than, or more than, the nine keys that is normal. Look at a few of the dictionaries with almost no data (those with just two keys) before removing them. Confirm that the remaining list of dictionaries has the expected length – 60000-2382=57618:

    >>> Counter([len(item) for item in candidatenews])
    Counter({9: 57202, 2: 2382, 10: 416})
    >>> pprint.pprint(next(item for item in candidatenews if len(item)<9))
    {'date': '2019-09-11 18:00:00', 'reason': 'Not collected'}
    >>> pprint.pprint(next(item for item in candidatenews if len(item)>9))
    {'category': 'Satire',
     'date': '2019-08-21 04:00:00',
     'domain': 'politics.theonion.com',
     'panel_position': 1,
     'query': 'John Hickenlooper',
     'source': 'Politics | The Onion',
     'story_position': 8,
     'time': '4 days ago',
     'title': ''And Then There Were 23,' Says Wayne Messam Crossing Out '
              'Hickenlooper Photo \n'
              'In Elaborate Grid Of Rivals',
     'url': 'https://politics.theonion.com/and-then-there-were-23-says-wayne-messam-crossing-ou-1837311060'}
    >>> pprint.pprint([item for item in candidatenews if len(item)==2][0:10])
    [{'date': '2019-09-11 18:00:00', 'reason': 'Not collected'},
     {'date': '2019-07-24 00:00:00', 'reason': 'No Top stories'},
    ... 
     {'date': '2019-01-03 00:00:00', 'reason': 'No Top stories'}]
    >>> candidatenews = [item for item in candidatenews if len(item)>2]
    >>> len(candidatenews)
    57618
  4. Generate counts from the JSON data.

    Get the dictionaries just for Politico (a website that covers political news) and display a couple of dictionaries:

    >>> politico = [item for item in candidatenews if item["source"] == "Politico"]
    >>> len(politico)
    2732
    >>> pprint.pprint(politico[0:2])
    [{'date': '2019-05-18 18:00:00',
      'domain': 'www.politico.com',
      'panel_position': 1,
      'query': 'Marianne Williamson',
      'source': 'Politico',
      'story_position': 7,
      'time': '1 week ago',
      'title': 'Marianne Williamson reaches donor threshold for Dem debates',
      'url': 'https://www.politico.com/story/2019/05/09/marianne-williamson-2020-election-1315133'},
     {'date': '2018-12-27 06:00:00',
      'domain': 'www.politico.com',
      'panel_position': 1,
      'query': 'Julian Castro',
      'source': 'Politico',
      'story_position': 1,
      'time': '1 hour ago',
      'title': "O'Rourke and Castro on collision course in Texas",
      'url': 'https://www.politico.com/story/2018/12/27/orourke-julian-castro-collision-texas-election-1073720'}]
  5. Get the source data and confirm that it has the anticipated length.

    Show the first few items in the new sources list. Generate a count of news stories by source and display the 10 most popular sources. Notice that stories from The Hill can have TheHill (without a space) or The Hill as the value for source:

    >>> sources = [item.get('source') for item in candidatenews]
    >>> type(sources)
    <class 'list'>
    >>> len(sources)
    57618
    >>> sources[0:5]
    ['NBC News', 'Town & Country Magazine', 'TheHill', 'CNBC.com', 'Fox News']
    >>> pprint.pprint(Counter(sources).most_common(10))
    [('Fox News', 3530),
     ('CNN.com', 2750),
     ('Politico', 2732),
     ('TheHill', 2383),
     ('The New York Times', 1804),
     ('Washington Post', 1770),
     ('Washington Examiner', 1655),
     ('The Hill', 1342),
     ('New York Post', 1275),
     ('Vox', 941)]
  6. Fix any errors in the values in the dictionary.

    Fix the source values for The Hill. Notice that The Hill is now the most frequent source for news stories:

    >>> for newsdict in candidatenews:
    ...     newsdict.update((k, "The Hill") for k, v in newsdict.items()
    ...      if k == "source" and v == "TheHill")
    ... 
    >>> sources = [item.get('source') for item in candidatenews]
    >>> pprint.pprint(Counter(sources).most_common(10))
    [('The Hill', 3725),
     ('Fox News', 3530),
     ('CNN.com', 2750),
     ('Politico', 2732),
     ('The New York Times', 1804),
     ('Washington Post', 1770),
     ('Washington Examiner', 1655),
     ('New York Post', 1275),
     ('Vox', 941),
     ('Breitbart', 799)]
  7. Create a pandas DataFrame.

    Pass the JSON data to the pandas DataFrame method. Convert the date column to a datetime data type:

    >>> candidatenewsdf = pd.DataFrame(candidatenews)
    >>> candidatenewsdf.dtypes
    title             object
    url               object
    source            object
    time              object
    date              object
    query             object
    story_position     int64
    panel_position    object
    domain            object
    category          object
    dtype: object
  8. Confirm that we are getting the expected values for source.

    Also, rename the date column:

    >>> candidatenewsdf.rename(columns={'date':'storydate'}, inplace=True)
    >>> candidatenewsdf.storydate = candidatenewsdf.storydate.astype('datetime64[ns]')
    >>> candidatenewsdf.shape
    (57618, 10)
    >>> candidatenewsdf.source.value_counts(sort=True).head(10)
    The Hill               3725
    Fox News               3530
    CNN.com                2750
    Politico               2732
    The New York Times     1804
    Washington Post        1770
    Washington Examiner    1655
    New York Post          1275
    Vox                     941
    Breitbart               799
    Name: source, dtype: int64

We now have a pandas DataFrame with only the news stories where there is meaningful data, and with the values for source fixed.

How it works…

The json.load method returns a list of dictionaries. This makes it possible to use a number of familiar tools when working with this data: list methods, slicing, list comprehensions, dictionary updates, and so on. There are times, maybe when you just have to populate a list or count the number of individuals in a given category, when there is no need to use pandas.

In steps 2 to 6, we use list methods to do many of the same checks we have done with pandas in previous recipes. In step 3 we use Counter with a list comprehension (Counter([len(item) for item in candidatenews])) to get the number of keys in each dictionary. This tells us that there are 2,382 dictionaries with just 2 keys and 416 with 10. We use next to look for an example of dictionaries with fewer than 9 keys or more than 9 keys to get a sense of the structure of those items. We use slicing to show 10 dictionaries with 2 keys to see if there is any data in those dictionaries. We then select only those dictionaries with more than 2 keys.

In step 4 we create a subset of the list of dictionaries, one that just has source equal to Politico, and take a look at a couple of items. We then create a list with just the source data and use Counter to list the 10 most common sources in step 5.

Step 6 demonstrates how to replace key values conditionally in a list of dictionaries. In this case, we update the key value to The Hill whenever key (k) is source and value (v) is TheHill. The for k, v in newsdict.items() section is the unsung hero of this line. It loops through all key/value pairs for all dictionaries in candidatenews.

It is easy to create a pandas DataFrame by passing the list of dictionaries to the pandas DataFrame method. We do this in step 7. The main complication is that we need to convert the date column from a string to a date, since dates are just strings in JSON.

There's more…

In steps 5 and 6 we use item.get('source') instead of item['source']. This is handy when there might be missing keys in a dictionary. get returns None when the key is missing, but we can use an optional second argument to specify a value to return.

I renamed the date column to storydate in step 8. This is not necessary, but is a good idea. Not only does date not tell you anything about what the dates actually represent, it is also so generic a column name that it is bound to cause problems at some point.

The news stories data fits nicely into a tabular structure. It makes sense to represent each list item as one row, and the key/value pairs as columns and column values for that row. There are no significant complications, such as key values that are themselves lists of dictionaries. Imagine an authors key for each story with a list item for each author as the key value, and that list item is a dictionary of information about the author. This is not at all unusual when working with JSON data in Python. The next recipe shows how to work with data structured in this way.

 

Importing more complicated JSON data from an API

In the previous recipe, we discussed one significant advantage (and challenge) of working with JSON data – its flexibility. A JSON file can have just about any structure its authors can imagine. This often means that this data does not have the tabular structure of the data sources we have discussed so far, and that pandas DataFrames have. Often, analysts and application developers use JSON precisely because it does not insist on a tabular structure. I know I do!

Retrieving data from multiple tables often requires us to do a one-to-many merge. Saving that data to one table or file means duplicating data on the "one" side of the one-to-many relationship. For example, student demographic data is merged with data on the courses studied, and the demographic data is repeated for each course. With JSON, duplication is not required to capture these items of data in one file. We can have data on the courses studied nested within the data for each student.

But doing analysis with JSON structured in this way will eventually require us to either: 1) manipulate the data in a very different way than we are used to doing; or 2) convert the JSON to a tabular form. We examine the first approach in the Classes that handle non-tabular data structures recipe in Chapter 10, User-Defined Functions and Classes to Automate Data Cleaning. This recipe takes the second approach. It uses a very handy tool for converting selected nodes of JSON to a tabular structure – json_normalize.

We first use an API to get JSON data because that is how JSON is frequently consumed. One advantage of retrieving the data with an API, rather than working from a file we have saved locally, is that it is easier to rerun our code when the source data is refreshed.

Getting ready

This recipe assumes you have the requests and pprint libraries already installed. If they are not installed, you can install them with pip. From the terminal (or PowerShell in Windows), enter pip install requests and pip install pprint.

The following is the structure of the JSON file that is created when using the collections API of the Cleveland Museum of Art. There is a helpful info section at the beginning, but we are interested in the data section. This data does not fit nicely into a tabular data structure. There may be several citations objects and several creators objects for each collection object. I have abbreviated the JSON file to save space:

{"info": { "total": 778, "parameters": {"african_american_artists": "" }}, 
"data": [
{
"id": 165157, 
"accession_number": "2007.158", 
"title": "Fulton and Nostrand", 
"creation_date": "1958", 
"citations": [
  {
   "citation": "Annual Exhibition: Sculpture, Paintings...", 
   "page_number": "Unpaginated, [8],[12]", 
   "url": null
   }, 
  {
   "citation": "\"Moscow to See Modern U.S. Art,\"<em> New York...",   
   "page_number": "P. 60",
   "url": null
  }]
"creators": [
      {
     "description": "Jacob Lawrence (American, 1917-2000)", 
     "extent": null, 
     "qualifier": null, 
     "role": "artist", 
     "birth_year": "1917", 
     "death_year": "2000"
     }
  ]
 }

Note

The API used in this recipe is provided by the Cleveland Museum of Art. It is available for public use at https://openaccess-api.clevelandart.org/.

How to do it...

Create a DataFrame from the museum's collections data with one row for each citation, and the title and creation_date duplicated:

  1. Import the json, requests, and pprint libraries.

    We need the requests library to use an API to retrieve JSON data. pprint improves the display of lists and dictionaries:

    >>> import pandas as pd
    >>> import numpy as np
    >>> import json
    >>> import pprint
    >>> import requests
  2. Use an API to load the JSON data.

    Make a get request to the collections API of the Cleveland Museum of Art. Use the query string to indicate that you just want collections from African-American artists. Display the first collection item. I have truncated the output for the first item to save space:

    >>> response = requests.get("https://openaccess-api.clevelandart.org/api/artworks/?african_american_artists")
    >>> camcollections = json.loads(response.text)
    >>> print(len(camcollections['data']))
    778
    >>> pprint.pprint(camcollections['data'][0])
    {'accession_number': '2007.158',
     'catalogue_raisonne': None,
     'citations': [{'citation': 'Annual Exhibition: Sculpture...',
                    'page_number': 'Unpaginated, [8],[12]',
                    'url': None},
                   {'citation': '"Moscow to See Modern U.S....',
                    'page_number': 'P. 60',
                    'url': None}]
     'collection': 'American - Painting',
     'creation_date': '1958',
     'creators': [{'biography': 'Jacob Lawrence (born 1917)...',
                   'birth_year': '1917',
                   'description': 'Jacob Lawrence (American...)',
                   'role': 'artist'}],
     'type': 'Painting'}
  3. Flatten the JSON data.

    Create a DataFrame from the JSON data using the json_normalize method. Indicate that the number of citations will determine the number of rows, and that accession_number, title, creation_date, collection, creators, and type will be repeated. Observe that the data has been flattened by displaying the first two observations, transposing them with the .T option to make it easier to view:

    >>> camcollectionsdf=pd.json_normalize(camcollections['data'],/
     'citations',['accession_number','title','creation_date',/
     'collection','creators','type'])
    >>> camcollectionsdf.head(2).T
                              0                       1
    citation        Annual Exhibiti...  "Moscow to See Modern...
    page_number           Unpaginated,                     P. 60
    url                          None                       None
    accession_number         2007.158                   2007.158
    title            Fulton and No...           Fulton and No...
    creation_date                1958                       1958
    collection       American - Pa...           American - Pa...
    creators   [{'description': 'J...     [{'description': 'J...
    type                     Painting                   Painting
  4. Pull the birth_year value from creators:
    >>> creator = camcollectionsdf[:1].creators[0]
    >>> type(creator[0])
    <class 'dict'>
    >>> pprint.pprint(creator)
    [{'biography': 'Jacob Lawrence (born 1917) has been a prominent art...',
      'birth_year': '1917',
      'death_year': '2000',
      'description': 'Jacob Lawrence (American, 1917-2000)',
      'extent': None,
      'name_in_original_language': None,
      'qualifier': None,
      'role': 'artist'}]
    >>> camcollectionsdf['birthyear'] = camcollectionsdf.\
    ...   creators.apply(lambda x: x[0]['birth_year'])
    >>> camcollectionsdf.birthyear.value_counts().\
    ...   sort_index().head()
    1821    18
    1886     2
    1888     1
    1892    13
    1899    17
    Name: birthyear, dtype: int64

This gives us a pandas DataFrame with one row for each citation for each collection item, with the collection information (title, creation_date, and so on) duplicated.

How it works…

We work with a much more interesting JSON file in this recipe than in the previous one. Each object in the JSON file is an item in the collection of the Cleveland Museum of Art. Nested within each collection item are one or more citations. The only way to capture this information in a tabular DataFrame is to flatten it. There are also one or more dictionaries for creators of the collection item (the artist or artists). That dictionary (or dictionaries) contains the birth_year value that we want.

We want one row for every citation for all collection items. To understand this, imagine that we are working with relational data and have a collections table and a citations table, and that we are doing a one-to-many merge from collections to citations. We do something similar with json_normalize by using citations as the second parameter. That tells json_normalize to create one row for each citation and use the key values in each citation dictionary – for citation, page_number, and url – as data values.

The third parameter in the call to json_normalize has the list of column names for the data that will be repeated with each citation. Notice that access_number, title, creation_date, collection, creators, and type are repeated in observations one and two. Citation and page_number change. (url is the same value for the first and second citations. Otherwise, it would also change.)

This still leaves us with the problem of the creators dictionaries (there can be more than one creator). When we ran json_normalize it grabbed the value for each key we indicated (in the third parameter) and stored it in the data for that column and row, whether that value was simple text or a list of dictionaries, as is the case for creators. We take a look at the first (and in this case, only) creators item for the first collections row in step 10, naming it creator. (Note that the creators list is duplicated across all citations for a collection item, just as the values for title, creation_date, and so on are.)

We want the birth year for the first creator for each collection item, which can be found at creator[0]['birth_year']. To create a birthyear series using this, we use apply and a lambda function:

>>> camcollectionsdf['birthyear'] = camcollectionsdf.\
...   creators.apply(lambda x: x[0]['birth_year'])

We take a closer look at lambda functions in Chapter 6, Cleaning and Exploring Data with Series Operations. Here, it is helpful to think of the x as representing the creators series, so x[0] gives us the list item we want, creators[0]. We grab the value from the birth_year key.

There's more…

You may have noticed that we left out some of the JSON returned by the API in our call to json_normalize. The first parameter that we passed to json_normalize was camcollections['data']. Effectively, we ignore the info object at the beginning of the JSON data. The information we want does not start until the data object. This is not very different conceptually from the skiprows parameter in the second recipe of the previous chapter. There is sometimes metadata like this at the beginning of JSON files.

See also

The preceding recipe demonstrates some useful techniques for doing data integrity checks without pandas, including list operations and comprehensions. Those are all relevant for the data in this recipe as well.

 

Importing data from web pages

We use Beautiful Soup in this recipe to scrape data from a web page and load that data into pandas. Web scraping is very useful when there is data at a website that is updated regularly, but there is no API. We can rerun our code to generate new data whenever the page is updated.

Unfortunately, the web scrapers we build can be broken when the structure of the targeted page changes. That is less likely to happen with APIs because they are designed for data exchange, and carefully curated with that end in mind. The priority for most web designers is the quality of the display of information, not the reliability and ease of data exchange. This causes data cleaning challenges unique to web scraping, including HTML elements that house the data being in surprising and changing locations, formatting tags that obfuscate the underlying data, and explanatory text that aid data interpretation being difficult to retrieve. In addition to these challenges, scraping presents data cleaning issues that are familiar, such as changing data types in columns, less than ideal headings, and missing values. We deal with data issues that occur most frequently in this recipe.

Getting ready

You will need Beautiful Soup installed to run the code in this recipe. You can install it with pip by entering pip install beautifulsoup4 in a terminal window or Windows PowerShell.

We will scrape data from a web page, find the following table in that page, and load it into a pandas DataFrame:

Figure 2.1 – COVID-19 data from six countries

Figure 2.1 – COVID-19 data from six countries

Note

I created this web page, http://www.alrb.org/datacleaning/covidcaseoutliers.html, based on COVID-19 data for public use from Our World in Data, available at https://ourworldindata.org/coronavirus-source-data.

How to do it…

We scrape the COVID data from the website and do some routine data checks:

  1. Import the pprint, requests, and BeautifulSoup libraries:
    >>> import pandas as pd
    >>> import numpy as np
    >>> import json
    >>> import pprint
    >>> import requests
    >>> from bs4 import BeautifulSoup
  2. Parse the web page and get the header row of the table.

    Use Beautiful Soup's find method to get the table we want and then use find_all to retrieve the elements nested within the th elements for that table. Create a list of column labels based on the text of the th rows:

    >>> webpage = requests.get("http://www.alrb.org/datacleaning/covidcaseoutliers.html")
    >>> bs = BeautifulSoup(webpage.text, 'html.parser')
    >>> theadrows = bs.find('table', {'id':'tblDeaths'}).thead.find_all('th')
    >>> type(theadrows)
    <class 'bs4.element.ResultSet'>
    >>> labelcols = [j.get_text() for j in theadrows]
    >>> labelcols[0] = "rowheadings"
    >>> labelcols
    ['rowheadings', 'Cases', 'Deaths', 'Cases per Million', 'Deaths per Million', 'population', 'population_density', 'median_age', 'gdp_per_capita', 'hospital_beds_per_100k']
  3. Get the data from the table cells.

    Find all of the table rows for the table we want. For each table row, find the th element and retrieve the text. We will use that text for our row labels. Also, for each row, find all the td elements (the table cells with the data) and save text from all of them in a list. This gives us datarows, which has all the numeric data in the table. (You can confirm that it matches the table from the web page.) We then insert the labelrows list (which has the row headings) at the beginning of each list in datarows:

    >>> rows = bs.find('table', {'id':'tblDeaths'}).tbody.find_all('tr')
    >>> datarows = []
    >>> labelrows = []
    >>> for row in rows:
    ...   rowlabels = row.find('th').get_text()
    ...   cells = row.find_all('td', {'class':'data'})
    ...   if (len(rowlabels)>3):
    ...     labelrows.append(rowlabels)
    ...   if (len(cells)>0):
    ...     cellvalues = [j.get_text() for j in cells]
    ...     datarows.append(cellvalues)
    ... 
    >>> pprint.pprint(datarows[0:2])
    [['9,394', '653', '214', '15', '43,851,043', '17', '29', '13,914', '1.9'],
     ['16,642', '668', '1848', '74', '9,006,400', '107', '44', '45,437', '7.4']]
    >>> pprint.pprint(labelrows[0:2])
    ['Algeria', 'Austria']
    >>> 
    >>> for i in range(len(datarows)):
    ...   datarows[i].insert(0, labelrows[i])
    ... 
    >>> pprint.pprint(datarows[0:1])
    [['Algeria','9,394','653','214','15','43,851,043','17','29','13,914','1.9']]
  4. Load the data into pandas.

    Pass the datarows list to the DataFrame method of pandas. Notice that all data is read into pandas with the object data type, and that some data has values that cannot be converted into numeric values in their current form (due to the commas):

    >>> totaldeaths = pd.DataFrame(datarows, columns=labelcols)
    >>> totaldeaths.head()
      rowheadings    Cases Deaths  ... median_age gdp_per_capita  \
    0     Algeria    9,394    653  ...         29         13,914   
    1     Austria   16,642    668  ...         44         45,437   
    2  Bangladesh   47,153    650  ...         28          3,524   
    3     Belgium   58,381   9467  ...         42         42,659   
    4      Brazil  514,849  29314  ...         34         14,103   
    >>> totaldeaths.dtypes
    rowheadings               object
    Cases                     object
    Deaths                    object
    Cases per Million         object
    Deaths per Million        object
    population                object
    population_density        object
    median_age                object
    gdp_per_capita            object
    hospital_beds_per_100k    object
    dtype: object
  5. Fix the column names and convert the data to numeric values.

    Remove spaces from column names. Remove all non-numeric data from the first columns with data, including the commas (str.replace("[^0-9]",""). Convert to numeric values, except for the rowheadings column:

    >>> totaldeaths.columns = totaldeaths.columns.str.replace(" ", "_").str.lower()
    >>> for col in totaldeaths.columns[1:-1]:
    ...   totaldeaths[col] = totaldeaths[col].\
    ...     str.replace("[^0-9]","").astype('int64')
    ... 
    >>> totaldeaths['hospital_beds_per_100k'] = totaldeaths['hospital_beds_per_100k'].astype('float')
    >>> totaldeaths.head()
      rowheadings   cases  deaths  ...  median_age  gdp_per_capita  \
    0     Algeria    9394     653  ...          29           13914   
    1     Austria   16642     668  ...          44           45437   
    2  Bangladesh   47153     650  ...          28            3524   
    3     Belgium   58381    9467  ...          42           42659   
    4      Brazil  514849   29314  ...          34           14103   
    >>> totaldeaths.dtypes
    rowheadings                object
    cases                       int64
    deaths                      int64
    cases_per_million           int64
    deaths_per_million          int64
    population                  int64
    population_density          int64
    median_age                  int64
    gdp_per_capita              int64
    hospital_beds_per_100k    float64
    dtype: object

We have now created a pandas DataFrame from an html table.

How it works…

Beautiful Soup is a very useful tool for finding specific HTML elements in a web page and retrieving text from them. You can get one HTML element with find and get one or more with find_all. The first argument for both find and find_all is the HTML element to get. The second argument takes a Python dictionary of attributes. You can retrieve text from all of the HTML elements you find with get_text.

Some amount of looping is usually necessary to process the elements and text, as with step 2 and step 3. These two statements in step 2 are fairly typical:

>>> theadrows = bs.find('table', {'id':'tblDeaths'}).thead.find_all('th')
>>> labelcols = [j.get_text() for j in theadrows]

The first statement finds all the th elements we want and creates a Beautiful Soup result set called theadrows from the elements it found. The second statement iterates over the theadrows Beautiful Soup result set using the get_text method to get the text from each element, and stores it in the labelcols list.

Step 3 is a little more involved, but makes use of the same Beautiful Soup methods. We find all of the table rows (tr) in the target table (rows = bs.find('table', {'id':'tblDeaths'}).tbody.find_all('tr')). We then iterate over each of those rows, finding the th element and getting the text in that element (rowlabels = row.find('th').get_text()). We also find all of the table cells (td) for each row (cells = row.find_all('td', {'class':'data'}) and get the text from all table cells (cellvalues = [j.get_text() for j in cells]). Note that this code is dependent on the class of the td elements being data. Finally, we insert the row labels we get from the th elements at the beginning of each list in datarows:

>>> for i in range(len(datarows)):
...   datarows[i].insert(0, labelrows[i])

In step 4, we use the DataFrame method to load the list we created in steps 2 and 3 into pandas. We then do some cleaning similar to what we have done in previous recipes in this chapter. We use string replace to remove spaces from column names and to remove all non-numeric data, including commas, from what are otherwise valid numeric values. We convert all columns, except for the rowheadings column, to numeric.

There's more…

Our scraping code is dependent on several aspects of the web page's structure not changing: the ID of the main table, the presence of th tags with column and row labels, and the td elements continuing to have their class equal to data. The good news is that if the structure of the web page does change, this will likely only affect the find and find_all calls. The rest of the code would not need to change.

 

Persisting JSON data

There are several reasons why we might want to serialize a JSON file:

  • We may have retrieved the data with an API, but need to keep a snapshot of the data.
  • The data in the JSON file is relatively static and informs our data cleaning and analysis over multiple phases of a project.
  • We might decide that the flexibility of a schema-less format such as JSON helps us solve many data cleaning and analysis problems.

It is worth highlighting this last reason to use JSON – that it can solve many data problems. Although tabular data structures clearly have many benefits, particularly for operational data, they are often not the best way to store data for analysis purposes. In preparing data for analysis, a substantial amount of time is spent either merging data from different tables or dealing with data redundancy when working with flat files. Not only are these processes time consuming, but every merge or reshaping leaves the door open to a data error of broad scope. This can also mean that we end up paying too much attention to the mechanics of manipulating data and too little to the conceptual issues at the core of our work.

We return to the Cleveland Museum of Art collections data in this recipe. There are at least three possible units of analysis for this data file – the collection item level, the creator level, and the citation level. JSON allows us to nest citations and creators within collections. (You can examine the structure of the JSON file in the Getting ready section of this recipe.) This data cannot be persisted in a tabular structure without flattening the file, which we did in an earlier recipe in this chapter. In this recipe, we will use two different methods to persist JSON data, each with its own advantages and disadvantages.

Getting ready

We will be working with data on the Cleveland Museum of Art's collection of works by African-American artists. The following is the structure of the JSON data returned by the API. It has been abbreviated to save space:

{"info": { "total": 778, "parameters": {"african_american_artists": "" }}, 
"data": [
{
"id": 165157, 
"accession_number": "2007.158", 
"title": "Fulton and Nostrand", 
"creation_date": "1958", 
"citations": [
  {
   "citation": "Annual Exhibition: Sculpture, Paintings...", 
   "page_number": "Unpaginated, [8],[12]", 
   "url": null
   }, 
  {
   "citation": "\"Moscow to See Modern U.S. Art,\"<em> New York...",   
   "page_number": "P. 60",
   "url": null
  }]
"creators": [
      {
     "description": "Jacob Lawrence (American, 1917-2000)", 
     "extent": null, 
     "qualifier": null, 
     "role": "artist", 
     "birth_year": "1917", 
     "death_year": "2000"
     }
  ]
 }

How to do it...

We will serialize the JSON data using two different methods:

  1. Load the pandas, json, pprint, requests, and msgpack libraries:
    >>> import pandas as pd
    >>> import json
    >>> import pprint
    >>> import requests
    >>> import msgpack
  2. Load the JSON data from an API. I have abbreviated the JSON output:
    >>> response = requests.get("https://openaccess-api.clevelandart.org/api/artworks/?african_american_artists")
    >>> camcollections = json.loads(response.text)
    >>> print(len(camcollections['data']))
    778
    >>> pprint.pprint(camcollections['data'][0])
    {'accession_number': '2007.158',
     'catalogue_raisonne': None,
     'citations': [{'citation': 'Annual Exhibition: Sculpture...',
                    'page_number': 'Unpaginated, [8],[12]',
                    'url': None},
                   {'citation': '"Moscow to See Modern U.S....',
                    'page_number': 'P. 60',
                    'url': None}]
     'collection': 'American - Painting',
     'creation_date': '1958',
     'creators': [{'biography': 'Jacob Lawrence (born 1917)...',
                   'birth_year': '1917',
                   'description': 'Jacob Lawrence (American...',
                   'role': 'artist'}],
     'type': 'Painting'}
  3. Save and reload the JSON file using Python's json library.

    Persist the JSON data in human-readable form. Reload it from the saved file and confirm that it worked by retrieving the creators data from the first collections item:

    >>> with open("data/camcollections.json","w") as f:
    ...   json.dump(camcollections, f)
    ... 
    >>> with open("data/camcollections.json","r") as f:
    ...   camcollections = json.load(f)
    ... 
    >>> pprint.pprint(camcollections['data'][0]['creators'])
    [{'biography': 'Jacob Lawrence (born 1917) has been a prominent artist since...'
      'birth_year': '1917',
      'description': 'Jacob Lawrence (American, 1917-2000)',
      'role': 'artist'}]
  4. Save and reload the JSON file using msgpack:
    >>> with open("data/camcollections.msgpack", "wb") as outfile:
    ...     packed = msgpack.packb(camcollections)
    ...     outfile.write(packed)
    ... 
    1586507
    >>> with open("data/camcollections.msgpack", "rb") as data_file:
    ...     msgbytes = data_file.read()
    ... 
    >>> camcollections = msgpack.unpackb(msgbytes)
    >>> pprint.pprint(camcollections['data'][0]['creators'])
    [{'biography': 'Jacob Lawrence (born 1917) has been a prominent...',
      'birth_year': '1917',
      'death_year': '2000',
      'description': 'Jacob Lawrence (American, 1917-2000)',
      'role': 'artist'}]

How it works…

We use the Cleveland Museum of Art's collections API to retrieve collections items. The african_american_artists flag in the query string indicates that we just want collections for those creators. json.loads returns a dictionary called info and a list of dictionaries called data. We check the length of the data list. This tells us that there are 778 items in collections. We then display the first item of collections to get a better look at the structure of the data. (I have abbreviated the JSON output.)

We save and then reload the data using Python's JSON library in step 3. The advantage of persisting the data in this way is that it keeps the data in human-readable form. Unfortunately, it has two disadvantages: saving takes longer than alternative serialization methods, and it uses more storage space.

In step 4, we use msgpack to persist our data. This is faster than Python's json library, and the saved file uses less space. Of course, the disadvantage is that the resulting JSON is binary rather than text-based.

There's more…

I use both methods for persisting JSON data in my work. When I am working with small amounts of data, and that data is relatively static, I prefer human-readable JSON. A great use case for this is the recipes in the previous chapter where we needed to create value labels.

I use msgpack when I am working with large amounts of data, where that data changes regularly. msgpack files are also great when you want to take regular snapshots of key tables in enterprise databases.

The Cleveland Museum of Art's collections data is similar in at least one important way to the data we work with every day. The unit of analysis frequently changes. Here we are looking at collections, citations, and creators. In our work, we might have to simultaneously look at students and courses, or households and deposits. An enterprise database system for the museum data would likely have separate collections, citations, and creators tables that we would eventually need to merge. The resulting merged file would have data redundancy issues that we would need to account for whenever we changed the unit of analysis.

When we alter our data cleaning process to work directly from JSON or parts of it, we end up eliminating a major source of errors. We do more data cleaning with JSON in the Classes that handle non-tabular data structures recipe in Chapter 10, User-Defined Functions and Classes to Automate Data Cleaning.

About the Author

  • Michael Walker

    Michael Walker has worked as a data analyst for over 30 years at a variety of educational institutions. He has also taught data science, research methods, statistics, and computer programming to undergraduates since 2006. He generates public sector and foundation reports and conducts analyses for publication in academic journals.

    Browse publications by this author

Latest Reviews

(3 reviews total)
Weel organized, easy to follow.
Excellent reference library
Gr8!!!!!!!!!!!!!!!!!!!!!!!

Recommended For You

Python Data Cleaning Cookbook
Unlock this book and the full library for FREE
Start free trial