You're reading from The Data Wrangling Workshop - Second Edition

Product typeBook

Published inJul 2020

Reading LevelIntermediate

PublisherPackt

ISBN-139781839215001

Edition2nd Edition

Languages

Python

Tools

Jupyter

Concepts

Data Processing

Authors (3):

Brian Lipp

Shubhadeep Roychowdhury

Dr. Tirthajyoti Sarkar

View More author details

3. Introduction to NumPy, Pandas, and Matplotlib

Overview

In this chapter, you will learn about the fundamentals of the NumPy, pandas, and matplotlib libraries. You will learn to create one-dimensional and multi-dimensional arrays and manipulate pandas DataFrames and series objects. By the end of this chapter, you will be able to visualize and plot numerical data using the Matplotlib library, as well as to apply matplotlib, NumPy, and pandas to calculate descriptive statistics from a DataFrame or matrix.

Introduction

In the preceding chapters, we covered some advanced data structures, such as stack, queue, iterator, and file operations in Python. In this chapter, we will cover three essential libraries, namely NumPy, pandas, and matplotlib. NumPy is an advanced math library in Python with an extensive range of functionality. pandas is a library built on NumPy that allows developers to model the data in a table structure similar to a database; malplotlib, on the other hand, is a charting library that is influenced by Matlab. With these libraries, you will be able to handle most data wrangling tasks.

NumPy Arrays

A NumPy array is similar to a list but differs in some ways. In the life of a data scientist, reading and manipulating an array is of prime importance, and it is also the most frequently encountered task. These arrays could be a one-dimensional list, a multi-dimensional table, or a matrix full of numbers and can be used for a variety of mathematical calculations.

An array could be filled with integers, floating-point numbers, Booleans, strings, or even mixed types. However, in the majority of cases, numeric data types are predominant. Some example scenarios where you will need to handle numeric arrays are as follows:

To read a list of phone numbers and postal codes and extract a certain pattern
To create a matrix with random numbers to run a Monte Carlo simulation on a statistical process
To scale and normalize a sales figure table, with lots of financial and transactional data
To create a smaller table of key descriptive statistics (for example...

Advanced Mathematical Operations

Generating numerical arrays is a fairly common task. So far, we have been doing this by creating a Python list object and then converting that into a NumPy array. However, we can bypass that and work directly with native NumPy methods. The arange function creates a series of numbers based on the minimum and maximum bounds you give and the step size you specify. Another function, linspace, creates a series of fixed numbers of the intermediate points between two extremes.

In the next exercise, we are going to create a list and then convert that into a NumPy array. We will then show you how to perform some advanced mathematical operations on that array.

Exercise 3.04: Advanced Mathematical Operations on NumPy Arrays

In this exercise, we'll practice using all the built-in mathematical functions of the NumPy library. Here, we are going to be creating a list and converting it into a NumPy array. Then, we will perform some advanced mathematical...

Statistics and Visualization with NumPy and Pandas

One of the great advantages of using libraries such as NumPy and pandas is that a plethora of built-in statistical and visualization methods are available, for which we don't have to search for and write new code. Furthermore, most of these subroutines are written using C or Fortran code (and pre-compiled), making them extremely fast to execute.

Refresher on Basic Descriptive Statistics

For any data wrangling task, it is quite useful to extract basic descriptive statistics, which should describe the data in ways such as the mean, median, and mode and create some simple visualizations or plots. These plots are often the first step in identifying fundamental patterns as well as oddities (if present) in the data. In any statistical analysis, descriptive statistics is the first step, followed by inferential statistics, which tries to infer the underlying distribution or process that the data might have been generated from. You...

The Definition of Statistical Measures – Central Tendency and Spread

A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data. They are also categorized as summary statistics:

Mean: The mean is the sum of all values divided by the total number of values.
Median: The median is the middle value. It is the value that splits the dataset in half. To find the median, order your data from smallest to largest, and then find the data point that has an equal amount of values above and below it.
Mode: The mode is the value that occurs the most frequently in your dataset. On a bar chart, the mode is the highest bar.

Generally, the mean is a better measure to use for symmetric data while the median is a better measure for data with a skewed (left- or right-heavy) distribution. For categorical data, you have to use the mode:

Figure 3.22: A screenshot of a...

Data Wrangling in Statistics and Visualization

A good data wrangling professional is expected to encounter a dizzying array of diverse data sources each day. As we explained previously, due to a multitude of complex sub-processes and mutual interactions that give rise to such data, they all fall into the category of discrete or continuous random variables.

It would be extremely difficult and confusing for a data wrangler or a data science team if all of this data continued to be treated as completely random without any shape or pattern. A formal statistical basis must be given to such random data streams, and one of the simplest ways to start that process is to measure their descriptive statistics.

Assigning a stream of data to a particular distribution function (or a combination of many distributions) is actually part of inferential statistics. However, inferential statistics starts only when descriptive statistics is done alongside measuring all the important parameters of...

Summary

In this chapter, we started with the basics of NumPy arrays, including how to create them and their essential properties. We discussed and showed how a NumPy array is optimized for vectorized element-wise operations and differs from a regular Python list. Then, we moved on to practicing various operations on NumPy arrays such as indexing, slicing, filtering, and reshaping. We also covered special one-dimensional and two-dimensional arrays, such as zeros, ones, identity matrices, and random arrays.

In the second major topic of this chapter, we started with pandas series objects and quickly moved on to a critically important object – pandas DataFrames. They are analogous to Excel or Matlab or a database tab, but with many useful properties for data wrangling. We demonstrated some basic operations on DataFrames, such as indexing, sub-setting, row and column addition, and deletion.

Next, we covered the basics of plotting with matplotlib, the most widely used and...

The rest of the chapter is locked

You have been reading a chapter from

The Data Wrangling Workshop - Second Edition

Published in: Jul 2020Publisher: PacktISBN-13: 9781839215001

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (3)

Brian Lipp

Brian Lipp is a Technology Polyglot, Engineer, and Solution Architect with a wide skillset in many technology domains. His programming background has ranged from R, Python, and Scala, to Go and Rust development. He has worked on Big Data systems, Data Lakes, data warehouses, and backend software engineering. Brian earned a Master of Science, CSIS from Pace University in 2009. He is currently a Sr. Data Engineer working with large Tech firms to build Data Ecosystems.
Read more about Brian Lipp

Shubhadeep Roychowdhury

Shubhadeep Roychowdhury holds a master's degree in computer science from West Bengal University of Technology and certifications in machine learning from Stanford. He works as a senior software engineer at a Paris-based cybersecurity startup, where he is applying state-of-the-art computer vision and data engineering algorithms and tools to develop cutting-edge products. He often writes about algorithm implementation in Python and similar topics.
Read more about Shubhadeep Roychowdhury

Dr. Tirthajyoti Sarkar

Dr. Tirthajyoti Sarkar works as a senior principal engineer in the semiconductor technology domain, where he applies cutting-edge data science/machine learning techniques for design automation and predictive analytics. He writes regularly about Python programming and data science topics. He holds a Ph.D. from the University of Illinois and certifications in artificial intelligence and machine learning from Stanford and MIT.
Read more about Dr. Tirthajyoti Sarkar

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages