You're reading from Learn Python by Building Data Science Applications

Product typeBook

Published inAug 2019

Reading LevelIntermediate

PublisherPackt

ISBN-139781789535365

Edition1st Edition

Languages

Python

Tools

Pygame

Concepts

Application Development

Authors (2):

Philipp Kats

David Katz

View More author details

Data Cleaning and Manipulation

Before we dive into data analysis, data needs to be properly prepared and structured. Some datasets, for example, structured computer logs, are ready to go from the start, but, most of the time, the majority of the time is spent preparing data properly. This process inevitably requires certain decisions that depend on the specifics of the task.

In this chapter, we will learn how to prepare the data with pandas, using the dataset we collected from Wikipedia in Chapter 7, Scraping Data from the Web with Beautiful Soup 4, as an example.

We will cover the following topics in the chapter:

Quick start with pandas
Working with real data
Regular expressions
Using custom functions with pandas dataframes
Writing the file

Technical requirements

The code for this chapter makes use of two packages: pandas, which is included in the default Anaconda distribution, and missingo, which we included in the environment.yml file. If you skipped the step of conda environment creation, just install missingo using the pip or conda package managers. As always, all the notebooks are stored in the repository, in the Chapter11 folder (https://github.com/PacktPublishing/Learn-Python-by-Building-Data-Science-Applications).

Getting started with pandas

Pandas is the tool for data manipulation in Python—it combines speed and convenience, allowing the rapid processing and manipulation of data. Let's first overview a number of basic operations: pandas is simple and intuitive to use, but it is still a learning curve.

pandas does have two main data structures:

Series is a one-dimensional array of one data type that also has an index. The index could be numeric, categorical, a string, or datetime.
DataFrame is a two-dimensional table consisting of a set of columns—each of one single data type. Dataframe has two indexes—index and columns. Columns of Dataframe can be thought of as Series. Rows can be retrieved as Series but, in this case, data in the cells will likely be converted to one shared data type object (more on that later).

Most of the time, we get our data from external...

Working with real data

Let's now try using pandas on real data. In Chapter 7, Scraping Data from the Web with Beautiful Soup 4, we collected a huge dataset of WWII battles and operations—including casualties, armies, dates, and locations. We never explored what is inside the dataset, though, and usually, this kind of data requires intensive processing. Now, let's see what we'll be able to do with this data.

As you may recall, we stored the dataset as a nested .json file. pandas can read from JSON files of different structures, but it won't understand nested data points. At this point, the task for us is straightforward (you may think of writing a recursive function, for example), so we won't discuss this much. If you want, you can check the 0_json_to_table.ipynb notebook in this chapter's folder on GitHub at the following link: https://github...

Getting to know regular expressions

Strings that store data usually have certain patterns, which can be leveraged to retrieve actual data values in a unified fashion. For example, some location cells have distinctive coordinates, and numbers and symbols of degrees, minutes, and seconds. To extract those values, we could write a custom Python code, but this will be verbose and time-consuming.

This problem – extracting values from text by defining a pattern – sounds like something quite general and useful in many situations. When a problem can be stated as something universal, it usually means that it is, and someone has a solution! This is, by the way, a good approach for programming in general.

Indeed, there is a universal solution, called regular expressions, or regex. Regex is a special mini-language that defines patterns in a text to look for. It is language-agnostic...

Parsing locations

Let's start with the location column. As you remember, data in this column is supposed to represent the location where the battle took place. In many cases, the value was stored as Wikipedia GeoMarker, which includes latitude/longitude coordinates. Here is what the raw value of this marker looks like:

>>> battles['Location'].iloc[10]
'Warsaw, Poland52°13′48″N 21°00′39″E\ufeff / \ufeff52.23000°N 21.01083°E\ufeff / 52.23000; 21.01083Coordinates: 52°13′48″N 21°00′39″E\ufeff / \ufeff52.23000°N 21.01083°E\ufeff / 52.23000; 21.01083'

Note that this geotag has both a nice latitude/longitude pair (with minutes and seconds), as well as its float representation, which is easier to use. In fact, the very same coordinates are repeated at the very...

Time

Another column is time. Now, pandas has a built-in DateTime parser and a very good one! Just use pd.to_datetime() on your scalar value or a collection. In this case, however, it won't work, and neither will any external packages that usually help (dateparser is our favorite). And all that because cells describe a time range, and not just one specific date.

Again, let's (at least, for now) see whether we can make our life simpler. Indeed, we probably don't care about specific dates—all we need is the month and year. Luckily, all months are properly stated and uniform—and pd.to_datetime can parse them. So, all we need is to correctly extract two month-year pairs from each.

Now, it seems hard to define one regular expression that will work here. Instead, we can try to get all years (we know all of them are four-digit numbers, starting with 19) and...

Belligerents

Lastly, as we noticed, in some rows, the axis and allies parties are swapped. It is slightly confusing for this specific dataset. For example, in this dual model, we'll have to mark Soviets as axis when they attacked Poland during the initial stages of the war. Let's take a look at all the possible combinations:

battles['Belligerents.allies'].value_counts()

Here, value_counts() calculates a number of occurrences of each value. Hence, the index of those series represents unique values. There is a more intuitive alternative – the unique() function (which is also faster). However, this is a NumPy function and it returns a NumPy array, which Jupyter prints badly—that's the only reason we prefer to use value_counts.

From the examination, we can observe that all the incorrect values contain either one of 'Germany', &apos...

Understanding casualties

Casualties are probably the most verbose and non-structured columns of the dataset. It will be extremely hard to make use of all the nuances of information here, so again—perhaps we can simplify the task, getting only the things we really want to use. Perhaps we can use code words to extract any digit preceding them; for example, ([\d|,]+)\s*dead should extract any consecutive digits or commas before the word 'dead'. We can define similar patterns for all types of casualties and loop over all of them, testing the patterns. There are, unfortunately, many keywords that mean the same thing ('captured', 'prisoners', and many more), so we have to make them optional, similar to the preceding month expression:

digit_pattern = '([\d|\,]+)(?:\[\d+\])?\s*(?:{words})'

keywords = { 'killed': ['dead&apos...

Quality assurance

I know we have spent a lot of time cleaning the data, but there is still one last task we need to perform – quality assurance. Proper quality assurance is a very important practice. In a nutshell, you need to define certain assumptions about the dataset (for example, minimum and maximum values, the acceptable number of missing values, standard deviation, medians, the number of unique values, and many more). The key is to start with something that is somewhat reasonable, and then run tests to check whether the data fits your assumptions. If not, investigate specific data points to check whether your assumptions were incorrect (and update them), or whether there are still some issues with the data. It just gets a little more tricky for the multilevel columns. Consider the following code:

assumptions = {
 'killed': [0, 1_500_000],
 'wounded&apos...

Writing the file

Finally, we have all the data we wanted, in a more-or-less good condition. Let's store it in CSV format. We can always use other formats instead. For example, the pickle format, by definition, preserves all the data types and properties of the dataframe (we won't need to convert dates from strings again), but can't be read manually (it also has a number of security risks). CSV, on the other hand, can be opened manually or with something like Excel, edited, and then stored again if you observed that there are factual errors in the data or something that is easier to correct manually.

In the following code block, we export our CSV file into a dataframe just to specify a relative path to the file we want it to be. The index=None argument is optional—this ensures that the index (a generic range of numbers in our case) won't be written:

new_dataset...

Summary

In this chapter, we spent time cleaning the data we acquired in Chapter 6, First Script – Geocoding with Web APIs. Unless data was carefully prepared for the exact purpose of analysis, the chances are that cleaning will take a lot of time and effort. Here, we learned the basics of pandas, and how to filter and mask the data. We discussed how to investigate missing values, saw how to use regular expressions to extract specific values from non-structured text, creating data of a proper structure and type, and learned how to apply custom functions to each cell in the entire Series or DataFrame and then used that information to geocode locations where we lacked coordinates.

Finally, we stored all the data we processed, along with the original values, in another CSV file, ready to be explored in our next chapter.

Questions

Why, if there is an empty cell in the Pandas column, are integer values in this column converted into floats?
What is the benefit of plotting missing values?
What is regex? Is it a separate language?
How can we use regex in Python?

How is a regex pattern defined? How can we combine and modify patterns dynamically within the code?
Is it a good idea to run ordinary Python functions on dataframe cells? What are the pros and cons of that approach? Should we use loops for that?

Pandas Cookbook, by Packt (https://www.packtpub.com/big-data-and-business-intelligence/pandas-cookbook)
Python Regular Expressions, by Packt (https://www.packtpub.com/application-development/mastering-python-regular-expressions)
2018 Python Regular Expressions – Real-World Projects, by Packt (https://www.packtpub.com/big-data-and-business-intelligence/2018-python-regular-expressions-real-world-projects-video)

The rest of the chapter is locked

You have been reading a chapter from

Learn Python by Building Data Science Applications

Published in: Aug 2019Publisher: PacktISBN-13: 9781789535365

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (2)

Philipp Kats

Philipp Kats is a researcher at the Urban Complexity Lab, NYU CUSP, a research fellow at Kazan Federal University, and a data scientist at StreetEasy, with many years of experience in software development. His interests include data analysis, urban studies, data journalism, and visualization. Having a bachelor's degree in architectural design and a having followed the rocky path (at first) of being a self-taught developer, Philipp knows the pain points of learning programming and is eager to share his experience.
Read more about Philipp Kats

David Katz

David Katz is a researcher and holds a Ph.D. in mathematics. As a mathematician at heart, he sees code as a tool to express his questions. David believes that code literacy is essential as it applies to most disciplines and professions. David is passionate about sharing his knowledge and has 6 years of experience teaching college and high school students.
Read more about David Katz

Personalised recommendations for you

Based on your interests and search pattern

C++ Programming for Linux Systems

This book covers the essential system programming tools and helps you explore the features of C++20. It emphasizes important details to maintain code quality and tackle everyday challenges of developing software for high performance, optimization, and more.

BookSep 2023288 pages

Expert C++

Discover advanced programming techniques, the latest features of C++17 and C++20, and best practices for memory management, debugging, testing, and large-scale application design with Expert C++. Ideal for experienced developers advancing to proficient programmers and building professional-grade C++ applications.

BookAug 2023604 pages

iOS 17 Programming for Beginners

iOS 17 Programming for Beginners, Eighth Edition is your comprehensive guide to learning the art of iOS app development. Whether you dream of creating the next chart-topping app or simply want to enhance your programming skills, this book is your trusted companion on this exciting journey.

BookOct 2023604 pages4

Developer Career Masterplan

Written by industry experts that have spent the last 20+ years helping developers grow their career path towards senior developer positions and beyond. This book provides a comprehensive guide, sharing examples and stories from their global careers. By the end, you’ll have the knowledge to create a clear career progression plan as a technical professional.

BookSep 2023310 pages

Refactoring with C#

In Refactoring with C#, you’ll explore the process of safely refactoring modern .NET code using Visual Studio features, advanced unit tests, AI assistance, and custom Roslyn analyzers.

BookNov 2023434 pages

Python Real-World Projects

Amplify your developer journey by curating a dynamic project portfolio that outshines traditional resumes. Delve into the Python realm through immersive projects, mastering core concepts while constructing comprehensive modules and applications. From data acquisition prowess to impactful data visualization, Python Real-World Projects arms you with essential skills to beat the competition.

BookSep 2023478 pages5

The MVVM Pattern in .NET MAUI

The MVVM Pattern in .NET MAUI enables developers to master MVVM principles and effectively apply them to .NET MAUI. This book uses real-life examples and covers complex problems to help you successfully apply MVVM with .NET MAUI to confidently develop robust and high-performing cross-platform apps.

BookNov 2023386 pages

Extending Microsoft Business Central with Power Platform

Extending Business Central with the Power Platform is a step-by-step guide for Business Central professionals to create solutions that automate business processes, explain complex workflow approvals, and integrate with hundreds of other systems, without traditional development. It’ll guide you in customizing Business Central with Power Platform.

BookAug 2023458 pages5

Extending Microsoft Business Central with Power Platform

Extending Business Central with the Power Platform is a step-by-step guide for Business Central professionals to create solutions that automate business processes, explain complex workflow approvals, and integrate with hundreds of other systems, without traditional development. It’ll guide you in customizing Business Central with Power Platform.

BookAug 2023458 pages5

Quantum Computing Algorithms

The book emphasizes intuitive ideas behind quantum algorithms in ways that other books don’t cover, striking a careful balance between no math and too much math. To get the most from this book, you should be comfortable with basic algebra and writing simple computer code. No prior understanding of quantum physics is needed to get started.

BookSep 2023342 pages

Python – Complete Python, Django, Data Science and ML Guide

Unlock Python's full potential with this 50+ hour course! From programming to web and game development, data manipulation, and machine learning, gain the skills required to succeed in various Python-related careers. With practical tasks, hands-on experience, and a strong foundation in Python, you'll be ready to tackle real-world challenges and take advantage of the many opportunities this versatile language offers.

VideoNov 202350 hours 30 minutes5

Python – Complete Python, Django, Data Science and ML Guide

Unlock Python's full potential with this 50+ hour course! From programming to web and game development, data manipulation, and machine learning, gain the skills required to succeed in various Python-related careers. With practical tasks, hands-on experience, and a strong foundation in Python, you'll be ready to tackle real-world challenges and take advantage of the many opportunities this versatile language offers.

VideoNov 202350 hours 30 minutes5