Reader small image

You're reading from  The Data Wrangling Workshop - Second Edition

Product typeBook
Published inJul 2020
Reading LevelIntermediate
PublisherPackt
ISBN-139781839215001
Edition2nd Edition
Languages
Tools
Right arrow
Authors (3):
Brian Lipp
Brian Lipp
author image
Brian Lipp

Brian Lipp is a Technology Polyglot, Engineer, and Solution Architect with a wide skillset in many technology domains. His programming background has ranged from R, Python, and Scala, to Go and Rust development. He has worked on Big Data systems, Data Lakes, data warehouses, and backend software engineering. Brian earned a Master of Science, CSIS from Pace University in 2009. He is currently a Sr. Data Engineer working with large Tech firms to build Data Ecosystems.
Read more about Brian Lipp

Shubhadeep Roychowdhury
Shubhadeep Roychowdhury
author image
Shubhadeep Roychowdhury

Shubhadeep Roychowdhury holds a master's degree in computer science from West Bengal University of Technology and certifications in machine learning from Stanford. He works as a senior software engineer at a Paris-based cybersecurity startup, where he is applying state-of-the-art computer vision and data engineering algorithms and tools to develop cutting-edge products. He often writes about algorithm implementation in Python and similar topics.
Read more about Shubhadeep Roychowdhury

Dr. Tirthajyoti Sarkar
Dr. Tirthajyoti Sarkar
author image
Dr. Tirthajyoti Sarkar

Dr. Tirthajyoti Sarkar works as a senior principal engineer in the semiconductor technology domain, where he applies cutting-edge data science/machine learning techniques for design automation and predictive analytics. He writes regularly about Python programming and data science topics. He holds a Ph.D. from the University of Illinois and certifications in artificial intelligence and machine learning from Stanford and MIT.
Read more about Dr. Tirthajyoti Sarkar

View More author details
Right arrow

About the Book

While a huge amount of data is readily available to us, it is not useful in its raw form. For data to be meaningful, it must be curated and refined.

If you're a beginner, then The Data Wrangling Workshop, Second Edition will help to break down the process for you. You'll start with the basics and build your knowledge, progressing from the core aspects behind data wrangling, to using the most popular tools and techniques.

This book starts by showing you how to work with data structures using Python. Through examples and activities, you'll understand why you should stay away from traditional methods of data cleaning used in other languages and take advantage of the specialized pre-built routines in Python. Later, you'll learn how to use the same Python backend to extract and transform data from an array of sources, including the internet, large database vaults, and Excel financial tables. To help you prepare for more challenging scenarios, the book teaches you how to handle missing or incorrect data, and reformat it based on the requirements from your downstream analytics tool.

By the end of this book, you will have developed a solid understanding of how to perform data wrangling with Python, and learned several techniques and best practices to extract, clean, transform, and format your data efficiently, from a diverse array of sources.

Audience

The Data Wrangling Workshop, Second Edition is designed for developers, data analysts, and business analysts who are looking to pursue a career as a full-fledged data scientist or analytics expert. Although this book is for beginners who want to start data wrangling, prior working knowledge of the Python programming language is necessary to easily grasp the concepts covered here. It will also help to have rudimentary knowledge of relational databases and SQL.

About the Chapters

Chapter 1, Introduction to Data Wrangling with Python, describes the importance of data wrangling in data science and introduces the basic building blocks that are used in data wrangling.

Chapter 2, Advanced Operations on Built-in Data Structures, discusses advanced built-in data structures that can be used for complex data wrangling problems faced by data scientists. The chapter will also talk about working with files using standard Python libraries.

Chapter 3, Introduction to NumPy, Pandas, and Matplotlib, will introduce you to the fundamentals of the NumPy, Pandas, and Matplotlib libraries. These are fundamental libraries for when you are performing data wrangling. This chapter will teach you how to calculate descriptive statistics of a one-dimensional/multi-dimensional DataFrame.

Chapter 4, A Deep Dive into Data Wrangling with Python, will introduce working with pandas DataFrames, including coverage of advanced concepts such as subsetting, filtering, grouping, and much more.

Chapter 5, Getting Comfortable with Different Kinds of Data Sources, introduces you to the several diverse data sources you might encounter as a data wrangler. This chapter will provide you with the knowledge to read CSV, Excel, and JSON files into pandas DataFrames.

Chapter 6, Learning the Hidden Secrets of Data Wrangling, discusses data problems that arise in business use cases and how to resolve them. This chapter will give you the knowledge needed to be able to clean and handle real-life messy data.

Chapter 7, Advanced Web Scraping and Data Gathering, introduces you to the concepts of advanced web scraping and data gathering. It will enable you to use Python libraries such as requests and BeautifulSoup to read various web pages and gather data from them.

Chapter 8, RDBMS and SQL, will introduce you to the basics of using RDBMSes to query databases using Python and convert data from SQL and store it into a pandas DataFrame. A large part of the world's data is stored in RDBMSes, so it is necessary to master this topic if you want to become a successful data-wrangling expert.

Chapter 9, Applications in Business Use Cases and Conclusion of the Course, will enable you to utilize the skills you have learned through the course of the previous chapters. By the end of this chapter, you will be able to easily handle data wrangling tasks for business use cases.

Conventions

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "This will return the value associated with it – ["list_element1", 34]".

A block of code is set as follows:

list_1 = []
    for x in range(0, 10):
    list_1.append(x)
list_1

Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "Click New and choose Python 3."

Code Presentation

Lines of code that span multiple lines are split using a backslash ( \ ). When the code is executed, Python will ignore the backslash, and treat the code on the next line as a direct continuation of the current line.

For example:

history = model.fit(X, y, epochs=100, batch_size=5, verbose=1, \
                    validation_split=0.2, shuffle=False)

Comments are added into code to help explain specific bits of logic. Single-line comments are denoted using the # symbol, as follows:

# Print the sizes of the dataset
print("Number of Examples in the Dataset = ", X.shape[0])
print("Number of Features for each example = ", X.shape[1])

Multi-line comments are enclosed by triple quotes, as shown below:

"""
Define a seed for the random number generator to ensure the 
result will be reproducible
"""
seed = 1
np.random.seed(seed)
random.set_seed(seed)

Setting up Your Environment

Before we explore the book in detail, we need to set up specific software and tools. In the following section, we shall see how to do that.

Installing Python

Installing Python on Windows

To install Python on Windows, do the following:

  1. Find your desired version of Python on the official installation page at https://www.anaconda.com/distribution/#windows.
  2. Ensure that you select Python 3.7 on the download page.
  3. Ensure that you install the correct architecture for your computer system, that is, either 32-bit or 64-bit. You can find out this information in the System Properties window of your OS.
  4. After you have downloaded the installer, simply double-click the file and follow the user-friendly prompts onscreen.

Installing Python on Linux

To install Python on Linux, you have a couple of options. Here is one option:

  1. Open the command line and verify that Python 3 is not already installed by running python3 --version.
  2. To install Python 3, run this:
    sudo apt-get update
    sudo apt-get install python3.7
  3. If you encounter problems, there are numerous sources online that can help you troubleshoot the issue.

Alternatively, install Anaconda Linux by downloading the installer from https://www.anaconda.com/distribution/#linux and following the instructions.

Installing Python on MacOS

Similar to the case with Linux, you have a couple of methods for installing Python on a Mac. To install Python on macOS X, do the following:

  1. Open the Terminal for Mac by pressing CMD + Spacebar, type terminal in the open search box, and hit Enter.
  2. Install Xcode through the command line by running xcode-select –install.
  3. The easiest way to install Python 3 is using Homebrew, which is installed through the command line by running ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)".
  4. Add Homebrew to your $PATH environment variable. Open your profile in the command line by running sudo nano ~/.profile and inserting export PATH="/usr/local/opt/python/libexec/bin:$PATH" at the bottom.
  5. The final step is to install Python. In the command line, run brew install python.
  6. You can also install Python via the Anaconda installer available from https://www.anaconda.com/distribution/#macos.

Installing Libraries

pip comes pre-installed with Anaconda. Once Anaconda is installed on your machine, all the required libraries can be installed using pip, for example, pip install numpy. Alternatively, you can install all the required libraries using pip install –r requirements.txt. You can find the requirements.txt file at https://packt.live/30UUshh.

The exercises and activities will be executed in Jupyter Notebooks. Jupyter is a Python library and can be installed in the same way as the other Python libraries – that is, with pip install jupyter, but fortunately, it comes pre-installed with Anaconda. To open a notebook, simply run the command jupyter notebook in the Terminal or Command Prompt.

Project Jupyter

Project Jupyter is open source, free software that gives you the ability to run code written in Python and some other languages interactively from a special notebook, similar to a browser interface. It was born in 2014 from the IPython project and has since become the default choice for the entire data science workforce.

Once you are running the Jupyter server, click New and choose Python 3. A new browser tab will open with a new and empty notebook. Rename the Jupyter file:

Figure 0.1: Jupyter server interface

Figure 0.1: Jupyter server interface

The main building blocks of Jupyter notebooks are cells. There are two types of cells: In (short for input) and Out (short for output). You can write code, normal text, and Markdown in In cells, press Shift + Enter (or Shift + Return), and the code written in that particular In cell will be executed. The result will be shown in an Out cell, and you will land in a new In cell, ready for the next block of code. Once you get used to this interface, you will slowly discover the power and flexibility it offers.

When you start a new cell, by default, it is assumed that you will write code in it. However, if you want to write text, then you have to change the type. You can do that using the following sequence of keys: Escape -> m -> Enter:

Figure 0.2: Jupyter notebook

Figure 0.2: Jupyter notebook

When you are done with writing the text, execute it using Shift + Enter. Unlike the code cells, the result of the compiled Markdown will be shown in the same place as the In cell.

To have a cheat sheet of all the handy key shortcuts in Jupyter Notebook, you can bookmark this Gist: https://gist.github.com/kidpixo/f4318f8c8143adee5b40. With this basic introduction and the image ready to be used, we are ready to embark on the exciting and enlightening journey.

Accessing the Code Files

You can find the complete code files of this book at https://packt.live/2YenXcb. You can also run many activities and exercises directly in your web browser by using the interactive lab environment at https://packt.live/2YKlrJQ.

We've tried to support interactive versions of all activities and exercises, but we recommend a local installation as well for instances where this support isn't available.

If you have any issues or questions about installation, please email us at workshops@packt.com.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
The Data Wrangling Workshop - Second Edition
Published in: Jul 2020Publisher: PacktISBN-13: 9781839215001
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
Brian Lipp

Brian Lipp is a Technology Polyglot, Engineer, and Solution Architect with a wide skillset in many technology domains. His programming background has ranged from R, Python, and Scala, to Go and Rust development. He has worked on Big Data systems, Data Lakes, data warehouses, and backend software engineering. Brian earned a Master of Science, CSIS from Pace University in 2009. He is currently a Sr. Data Engineer working with large Tech firms to build Data Ecosystems.
Read more about Brian Lipp

author image
Shubhadeep Roychowdhury

Shubhadeep Roychowdhury holds a master's degree in computer science from West Bengal University of Technology and certifications in machine learning from Stanford. He works as a senior software engineer at a Paris-based cybersecurity startup, where he is applying state-of-the-art computer vision and data engineering algorithms and tools to develop cutting-edge products. He often writes about algorithm implementation in Python and similar topics.
Read more about Shubhadeep Roychowdhury

author image
Dr. Tirthajyoti Sarkar

Dr. Tirthajyoti Sarkar works as a senior principal engineer in the semiconductor technology domain, where he applies cutting-edge data science/machine learning techniques for design automation and predictive analytics. He writes regularly about Python programming and data science topics. He holds a Ph.D. from the University of Illinois and certifications in artificial intelligence and machine learning from Stanford and MIT.
Read more about Dr. Tirthajyoti Sarkar