Reader small image

You're reading from  Learn Python by Building Data Science Applications

Product typeBook
Published inAug 2019
Reading LevelIntermediate
PublisherPackt
ISBN-139781789535365
Edition1st Edition
Languages
Tools
Right arrow
Authors (2):
Philipp Kats
Philipp Kats
author image
Philipp Kats

Philipp Kats is a researcher at the Urban Complexity Lab, NYU CUSP, a research fellow at Kazan Federal University, and a data scientist at StreetEasy, with many years of experience in software development. His interests include data analysis, urban studies, data journalism, and visualization. Having a bachelor's degree in architectural design and a having followed the rocky path (at first) of being a self-taught developer, Philipp knows the pain points of learning programming and is eager to share his experience.
Read more about Philipp Kats

David Katz
David Katz
author image
David Katz

David Katz is a researcher and holds a Ph.D. in mathematics. As a mathematician at heart, he sees code as a tool to express his questions. David believes that code literacy is essential as it applies to most disciplines and professions. David is passionate about sharing his knowledge and has 6 years of experience teaching college and high school students.
Read more about David Katz

View More author details
Right arrow

Data Cleaning and Manipulation

Before we dive into data analysis, data needs to be properly prepared and structured. Some datasets, for example, structured computer logs, are ready to go from the start, but, most of the time, the majority of the time is spent preparing data properly. This process inevitably requires certain decisions that depend on the specifics of the task.

In this chapter, we will learn how to prepare the data with pandas, using the dataset we collected from Wikipedia in Chapter 7, Scraping Data from the Web with Beautiful Soup 4, as an example.

We will cover the following topics in the chapter:

  • Quick start with pandas
  • Working with real data
  • Regular expressions
  • Using custom functions with pandas dataframes
  • Writing the file

Technical requirements

The code for this chapter makes use of two packages: pandas, which is included in the default Anaconda distribution, and missingo, which we included in the environment.yml file. If you skipped the step of conda environment creation, just install missingo using the pip or conda package managers. As always, all the notebooks are stored in the repository, in the Chapter11 folder (https://github.com/PacktPublishing/Learn-Python-by-Building-Data-Science-Applications).

Getting started with pandas

Pandas is the tool for data manipulation in Python—it combines speed and convenience, allowing the rapid processing and manipulation of data. Let's first overview a number of basic operations: pandas is simple and intuitive to use, but it is still a learning curve.

pandas does have two main data structures:

  1. Series is a one-dimensional array of one data type that also has an index. The index could be numeric, categorical, a string, or datetime.
  2. DataFrame is a two-dimensional table consisting of a set of columns—each of one single data type. Dataframe has two indexes—index and columns. Columns of Dataframe can be thought of as Series. Rows can be retrieved as Series but, in this case, data in the cells will likely be converted to one shared data type object (more on that later).

Most of the time, we get our data from external...

Working with real data

Let's now try using pandas on real data. In Chapter 7, Scraping Data from the Web with Beautiful Soup 4, we collected a huge dataset of WWII battles and operations—including casualties, armies, dates, and locations. We never explored what is inside the dataset, though, and usually, this kind of data requires intensive processing. Now, let's see what we'll be able to do with this data.

As you may recall, we stored the dataset as a nested .json file. pandas can read from JSON files of different structures, but it won't understand nested data points. At this point, the task for us is straightforward (you may think of writing a recursive function, for example), so we won't discuss this much. If you want, you can check the 0_json_to_table.ipynb notebook in this chapter's folder on GitHub at the following link: https://github...

Getting to know regular expressions

Strings that store data usually have certain patterns, which can be leveraged to retrieve actual data values in a unified fashion. For example, some location cells have distinctive coordinates, and numbers and symbols of degrees, minutes, and seconds. To extract those values, we could write a custom Python code, but this will be verbose and time-consuming.

This problem – extracting values from text by defining a pattern – sounds like something quite general and useful in many situations. When a problem can be stated as something universal, it usually means that it is, and someone has a solution! This is, by the way, a good approach for programming in general.

Indeed, there is a universal solution, called regular expressions, or regex. Regex is a special mini-language that defines patterns in a text to look for. It is language-agnostic...

Parsing locations

Let's start with the location column. As you remember, data in this column is supposed to represent the location where the battle took place. In many cases, the value was stored as Wikipedia GeoMarker, which includes latitude/longitude coordinates. Here is what the raw value of this marker looks like:

>>> battles['Location'].iloc[10]
'Warsaw, Poland52°13′48″N 21°00′39″E\ufeff / \ufeff52.23000°N 21.01083°E\ufeff / 52.23000; 21.01083Coordinates: 52°13′48″N 21°00′39″E\ufeff / \ufeff52.23000°N 21.01083°E\ufeff / 52.23000; 21.01083'

Note that this geotag has both a nice latitude/longitude pair (with minutes and seconds), as well as its float representation, which is easier to use. In fact, the very same coordinates are repeated at the very...

Time

Another column is time. Now, pandas has a built-in DateTime parser and a very good one! Just use pd.to_datetime() on your scalar value or a collection. In this case, however, it won't work, and neither will any external packages that usually help (dateparser is our favorite). And all that because cells describe a time range, and not just one specific date.

Again, let's (at least, for now) see whether we can make our life simpler. Indeed, we probably don't care about specific dates—all we need is the month and year. Luckily, all months are properly stated and uniform—and pd.to_datetime can parse them. So, all we need is to correctly extract two month-year pairs from each.

Now, it seems hard to define one regular expression that will work here. Instead, we can try to get all years (we know all of them are four-digit numbers, starting with 19) and...

Belligerents

Lastly, as we noticed, in some rows, the axis and allies parties are swapped. It is slightly confusing for this specific dataset. For example, in this dual model, we'll have to mark Soviets as axis when they attacked Poland during the initial stages of the war. Let's take a look at all the possible combinations:

battles['Belligerents.allies'].value_counts()

Here, value_counts() calculates a number of occurrences of each value. Hence, the index of those series represents unique values. There is a more intuitive alternative – the unique() function (which is also faster). However, this is a NumPy function and it returns a NumPy array, which Jupyter prints badly—that's the only reason we prefer to use value_counts.

From the examination, we can observe that all the incorrect values contain either one of 'Germany', &apos...

Understanding casualties

Casualties are probably the most verbose and non-structured columns of the dataset. It will be extremely hard to make use of all the nuances of information here, so again—perhaps we can simplify the task, getting only the things we really want to use. Perhaps we can use code words to extract any digit preceding them; for example, ([\d|,]+)\s*dead should extract any consecutive digits or commas before the word 'dead'. We can define similar patterns for all types of casualties and loop over all of them, testing the patterns. There are, unfortunately, many keywords that mean the same thing ('captured', 'prisoners', and many more), so we have to make them optional, similar to the preceding month expression:

digit_pattern = '([\d|\,]+)(?:\[\d+\])?\s*(?:{words})'

keywords = { 'killed': ['dead&apos...

Quality assurance

I know we have spent a lot of time cleaning the data, but there is still one last task we need to perform – quality assurance. Proper quality assurance is a very important practice. In a nutshell, you need to define certain assumptions about the dataset (for example, minimum and maximum values, the acceptable number of missing values, standard deviation, medians, the number of unique values, and many more). The key is to start with something that is somewhat reasonable, and then run tests to check whether the data fits your assumptions. If not, investigate specific data points to check whether your assumptions were incorrect (and update them), or whether there are still some issues with the data. It just gets a little more tricky for the multilevel columns. Consider the following code:

assumptions = {
'killed': [0, 1_500_000],
'wounded&apos...

Writing the file

Finally, we have all the data we wanted, in a more-or-less good condition. Let's store it in CSV format. We can always use other formats instead. For example, the pickle format, by definition, preserves all the data types and properties of the dataframe (we won't need to convert dates from strings again), but can't be read manually (it also has a number of security risks). CSV, on the other hand, can be opened manually or with something like Excel, edited, and then stored again if you observed that there are factual errors in the data or something that is easier to correct manually.

In the following code block, we export our CSV file into a dataframe just to specify a relative path to the file we want it to be. The index=None argument is optional—this ensures that the index (a generic range of numbers in our case) won't be written:

new_dataset...

Summary

In this chapter, we spent time cleaning the data we acquired in Chapter 6, First Script – Geocoding with Web APIs. Unless data was carefully prepared for the exact purpose of analysis, the chances are that cleaning will take a lot of time and effort. Here, we learned the basics of pandas, and how to filter and mask the data. We discussed how to investigate missing values, saw how to use regular expressions to extract specific values from non-structured text, creating data of a proper structure and type, and learned how to apply custom functions to each cell in the entire Series or DataFrame and then used that information to geocode locations where we lacked coordinates.

Finally, we stored all the data we processed, along with the original values, in another CSV file, ready to be explored in our next chapter.

Questions

  1. Why, if there is an empty cell in the Pandas column, are integer values in this column converted into floats?
  2. What is the benefit of plotting missing values?
  3. What is regex? Is it a separate language?
  4. How can we use regex in Python?
  1. How is a regex pattern defined? How can we combine and modify patterns dynamically within the code?
  2. Is it a good idea to run ordinary Python functions on dataframe cells? What are the pros and cons of that approach? Should we use loops for that?
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Learn Python by Building Data Science Applications
Published in: Aug 2019Publisher: PacktISBN-13: 9781789535365
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Philipp Kats

Philipp Kats is a researcher at the Urban Complexity Lab, NYU CUSP, a research fellow at Kazan Federal University, and a data scientist at StreetEasy, with many years of experience in software development. His interests include data analysis, urban studies, data journalism, and visualization. Having a bachelor's degree in architectural design and a having followed the rocky path (at first) of being a self-taught developer, Philipp knows the pain points of learning programming and is eager to share his experience.
Read more about Philipp Kats

author image
David Katz

David Katz is a researcher and holds a Ph.D. in mathematics. As a mathematician at heart, he sees code as a tool to express his questions. David believes that code literacy is essential as it applies to most disciplines and professions. David is passionate about sharing his knowledge and has 6 years of experience teaching college and high school students.
Read more about David Katz