Reader small image

You're reading from  The Data Wrangling Workshop - Second Edition

Product typeBook
Published inJul 2020
Reading LevelIntermediate
PublisherPackt
ISBN-139781839215001
Edition2nd Edition
Languages
Tools
Right arrow
Authors (3):
Brian Lipp
Brian Lipp
author image
Brian Lipp

Brian Lipp is a Technology Polyglot, Engineer, and Solution Architect with a wide skillset in many technology domains. His programming background has ranged from R, Python, and Scala, to Go and Rust development. He has worked on Big Data systems, Data Lakes, data warehouses, and backend software engineering. Brian earned a Master of Science, CSIS from Pace University in 2009. He is currently a Sr. Data Engineer working with large Tech firms to build Data Ecosystems.
Read more about Brian Lipp

Shubhadeep Roychowdhury
Shubhadeep Roychowdhury
author image
Shubhadeep Roychowdhury

Shubhadeep Roychowdhury holds a master's degree in computer science from West Bengal University of Technology and certifications in machine learning from Stanford. He works as a senior software engineer at a Paris-based cybersecurity startup, where he is applying state-of-the-art computer vision and data engineering algorithms and tools to develop cutting-edge products. He often writes about algorithm implementation in Python and similar topics.
Read more about Shubhadeep Roychowdhury

Dr. Tirthajyoti Sarkar
Dr. Tirthajyoti Sarkar
author image
Dr. Tirthajyoti Sarkar

Dr. Tirthajyoti Sarkar works as a senior principal engineer in the semiconductor technology domain, where he applies cutting-edge data science/machine learning techniques for design automation and predictive analytics. He writes regularly about Python programming and data science topics. He holds a Ph.D. from the University of Illinois and certifications in artificial intelligence and machine learning from Stanford and MIT.
Read more about Dr. Tirthajyoti Sarkar

View More author details
Right arrow

3. Introduction to NumPy, Pandas, and Matplotlib

Overview

In this chapter, you will learn about the fundamentals of the NumPy, pandas, and matplotlib libraries. You will learn to create one-dimensional and multi-dimensional arrays and manipulate pandas DataFrames and series objects. By the end of this chapter, you will be able to visualize and plot numerical data using the Matplotlib library, as well as to apply matplotlib, NumPy, and pandas to calculate descriptive statistics from a DataFrame or matrix.

Introduction

In the preceding chapters, we covered some advanced data structures, such as stack, queue, iterator, and file operations in Python. In this chapter, we will cover three essential libraries, namely NumPy, pandas, and matplotlib. NumPy is an advanced math library in Python with an extensive range of functionality. pandas is a library built on NumPy that allows developers to model the data in a table structure similar to a database; malplotlib, on the other hand, is a charting library that is influenced by Matlab. With these libraries, you will be able to handle most data wrangling tasks.

NumPy Arrays

A NumPy array is similar to a list but differs in some ways. In the life of a data scientist, reading and manipulating an array is of prime importance, and it is also the most frequently encountered task. These arrays could be a one-dimensional list, a multi-dimensional table, or a matrix full of numbers and can be used for a variety of mathematical calculations.

An array could be filled with integers, floating-point numbers, Booleans, strings, or even mixed types. However, in the majority of cases, numeric data types are predominant. Some example scenarios where you will need to handle numeric arrays are as follows:

  • To read a list of phone numbers and postal codes and extract a certain pattern
  • To create a matrix with random numbers to run a Monte Carlo simulation on a statistical process
  • To scale and normalize a sales figure table, with lots of financial and transactional data
  • To create a smaller table of key descriptive statistics (for example...

Advanced Mathematical Operations

Generating numerical arrays is a fairly common task. So far, we have been doing this by creating a Python list object and then converting that into a NumPy array. However, we can bypass that and work directly with native NumPy methods. The arange function creates a series of numbers based on the minimum and maximum bounds you give and the step size you specify. Another function, linspace, creates a series of fixed numbers of the intermediate points between two extremes.

In the next exercise, we are going to create a list and then convert that into a NumPy array. We will then show you how to perform some advanced mathematical operations on that array.

Exercise 3.04: Advanced Mathematical Operations on NumPy Arrays

In this exercise, we'll practice using all the built-in mathematical functions of the NumPy library. Here, we are going to be creating a list and converting it into a NumPy array. Then, we will perform some advanced mathematical...

Statistics and Visualization with NumPy and Pandas

One of the great advantages of using libraries such as NumPy and pandas is that a plethora of built-in statistical and visualization methods are available, for which we don't have to search for and write new code. Furthermore, most of these subroutines are written using C or Fortran code (and pre-compiled), making them extremely fast to execute.

Refresher on Basic Descriptive Statistics

For any data wrangling task, it is quite useful to extract basic descriptive statistics, which should describe the data in ways such as the mean, median, and mode and create some simple visualizations or plots. These plots are often the first step in identifying fundamental patterns as well as oddities (if present) in the data. In any statistical analysis, descriptive statistics is the first step, followed by inferential statistics, which tries to infer the underlying distribution or process that the data might have been generated from. You...

The Definition of Statistical Measures – Central Tendency and Spread

A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data. They are also categorized as summary statistics:

  • Mean: The mean is the sum of all values divided by the total number of values.
  • Median: The median is the middle value. It is the value that splits the dataset in half. To find the median, order your data from smallest to largest, and then find the data point that has an equal amount of values above and below it.
  • Mode: The mode is the value that occurs the most frequently in your dataset. On a bar chart, the mode is the highest bar.

Generally, the mean is a better measure to use for symmetric data while the median is a better measure for data with a skewed (left- or right-heavy) distribution. For categorical data, you have to use the mode:

Figure 3.22: A screenshot of a...

Data Wrangling in Statistics and Visualization

A good data wrangling professional is expected to encounter a dizzying array of diverse data sources each day. As we explained previously, due to a multitude of complex sub-processes and mutual interactions that give rise to such data, they all fall into the category of discrete or continuous random variables.

It would be extremely difficult and confusing for a data wrangler or a data science team if all of this data continued to be treated as completely random without any shape or pattern. A formal statistical basis must be given to such random data streams, and one of the simplest ways to start that process is to measure their descriptive statistics.

Assigning a stream of data to a particular distribution function (or a combination of many distributions) is actually part of inferential statistics. However, inferential statistics starts only when descriptive statistics is done alongside measuring all the important parameters of...

Summary

In this chapter, we started with the basics of NumPy arrays, including how to create them and their essential properties. We discussed and showed how a NumPy array is optimized for vectorized element-wise operations and differs from a regular Python list. Then, we moved on to practicing various operations on NumPy arrays such as indexing, slicing, filtering, and reshaping. We also covered special one-dimensional and two-dimensional arrays, such as zeros, ones, identity matrices, and random arrays.

In the second major topic of this chapter, we started with pandas series objects and quickly moved on to a critically important object – pandas DataFrames. They are analogous to Excel or Matlab or a database tab, but with many useful properties for data wrangling. We demonstrated some basic operations on DataFrames, such as indexing, sub-setting, row and column addition, and deletion.

Next, we covered the basics of plotting with matplotlib, the most widely used and...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
The Data Wrangling Workshop - Second Edition
Published in: Jul 2020Publisher: PacktISBN-13: 9781839215001
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
Brian Lipp

Brian Lipp is a Technology Polyglot, Engineer, and Solution Architect with a wide skillset in many technology domains. His programming background has ranged from R, Python, and Scala, to Go and Rust development. He has worked on Big Data systems, Data Lakes, data warehouses, and backend software engineering. Brian earned a Master of Science, CSIS from Pace University in 2009. He is currently a Sr. Data Engineer working with large Tech firms to build Data Ecosystems.
Read more about Brian Lipp

author image
Shubhadeep Roychowdhury

Shubhadeep Roychowdhury holds a master's degree in computer science from West Bengal University of Technology and certifications in machine learning from Stanford. He works as a senior software engineer at a Paris-based cybersecurity startup, where he is applying state-of-the-art computer vision and data engineering algorithms and tools to develop cutting-edge products. He often writes about algorithm implementation in Python and similar topics.
Read more about Shubhadeep Roychowdhury

author image
Dr. Tirthajyoti Sarkar

Dr. Tirthajyoti Sarkar works as a senior principal engineer in the semiconductor technology domain, where he applies cutting-edge data science/machine learning techniques for design automation and predictive analytics. He writes regularly about Python programming and data science topics. He holds a Ph.D. from the University of Illinois and certifications in artificial intelligence and machine learning from Stanford and MIT.
Read more about Dr. Tirthajyoti Sarkar