Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Mastering pandas. - Second Edition

You're reading from  Mastering pandas. - Second Edition

Product type Book
Published in Oct 2019
Publisher
ISBN-13 9781789343236
Pages 674 pages
Edition 2nd Edition
Languages
Author (1):
Ashish Kumar Ashish Kumar
Profile icon Ashish Kumar

Table of Contents (21) Chapters

Preface Section 1: Overview of Data Analysis and pandas
Introduction to pandas and Data Analysis Installation of pandas and Supporting Software Section 2: Data Structures and I/O in pandas
Using NumPy and Data Structures with pandas I/Os of Different Data Formats with pandas Section 3: Mastering Different Data Operations in pandas
Indexing and Selecting in pandas Grouping, Merging, and Reshaping Data in pandas Special Data Operations in pandas Time Series and Plotting Using Matplotlib Section 4: Going a Step Beyond with pandas
Making Powerful Reports In Jupyter Using pandas A Tour of Statistics with pandas and NumPy A Brief Tour of Bayesian Statistics and Maximum Likelihood Estimates Data Case Studies Using pandas The pandas Library Architecture pandas Compared with Other Tools A Brief Tour of Machine Learning Other Books You May Enjoy

Using NumPy and Data Structures with pandas

This chapter is one of the most important ones in this book. We will now begin to dive into the nitty-gritty of pandas. We start by taking a tour of NumPy ndarrays, a data structure not in pandas but NumPy. Knowledge of NumPy ndarrays is useful as they are the building blocks on which pandas DataFrames have been built. One key benefit of NumPy arrays is that they execute what is known as vectorized operations, which are operations that require traversing/looping on a Python array and are much faster.

In this chapter, I will present the material via numerous examples using Jupyter.

The topics we will cover in this chapter include a tour of the numpy.ndarray data structure, the pandas.Series one-dimensional (1D) pandas data structure, the pandas.DataFrame two-dimensional (2D) pandas tabular data structure, and the pandas.Panel three-dimensional...

NumPy ndarrays

Arrays are vital objects in the data analysis scenario. Arrays allow for structured handling of elements that are stacked across rows and columns. The elements of an array are bound by the rule that they should all be of the same data type. For example, the medical records of five patients have been presented as an array as follows:

Blood glucose level

Heart rate

Cholesterol level

Peter Parker

100

65

160

Bruce Wayne

150

82

200

Tony Stark

90

55

80

Barry Allen

130

73

220

Steve Rogers

190

80

150

It is seen that all 15 elements are of data type int. Arrays could also be composed of strings, floats, or complex numbers. Arrays could be constructed from lists—a widely used and versatile data structure in Python:

array_list = [[100, 65, 160],
[150, 82, 200],
[90, 55, 80],
[130, 73, 220],
[190, 80, 150]...

Implementing neural networks with NumPy

While NumPy is definitely not the go-to package for training a neural network in real-time scenarios, learning to implement it in NumPy brings out the flexibility and might of NumPy for doing complex matrix computations and also provides a better understanding of neural networks.

First, let's synthetically generate a dataset for a binary classification problem that will be used for training the neural network. The data will be from two different Gaussian distributions, and the model will be trained to classify this data into either of the two categories. Let's generate the data with 1,000 samples in each category:

N = 1000
X1 = np.random.randn(N, 2) + np.array([0.9, 0.9])
X2 = np.random.randn(N, 2) + np.array([-0.9, -0.9])

Now we have two 1000 x 2 arrays. For the predictor variable, we can use the zeros and ones functions to create...

Practical applications of multidimensional arrays

Panel data (spreadsheet-like data with several distinguishable rows and columns; the kind of data we generally encounter) is best handled by the DataFrame data structure available in pandas and R. Arrays can be used too but it would be tedious.

So what is a good example of data in real life that can be best represented by an array? Images, which are generally represented as multidimensional arrays of pixels, are a good example. In this section, we will see examples of multidimensional representation of an image and why it makes sense.

Any object detection or image-processing algorithm performed on an image requires it to be represented in a numerical array format. For text data, term-document matrix and term frequency-inverse document frequency (TF-IDF) are used to vectorize (create numerical arrays) the data. In the case of an...

Data structures in pandas

The pandas package was created by Wes McKinney in 2008 as a result of frustrations he encountered while working on time series data in R. It is built on top of NumPy and provides features not available in it. It provides fast, easy-to-understand data structures and helps fill the gap between Python and a language like R. NumPy deals with homogeneous blocks of data. Using pandas helps to deal with data in a tabular structure composed of different data types.

The official documentation for pandas can be found at http://pandas.pydata.org/pandas-docs/stable/dsintro.html.

There are three main data structures in pandas:

  • Series—1D
  • DataFrame—2D
  • Panel—3D

Series

A Series is really a 1D...

Summary

This chapter was a quick tour of the power of NumPy and showed a glimpse of how it makes life easier while working with pandas. Some of the highlights from the chapter were as follows:

  • A NumPy array is a versatile data structure used for containing multidimensional homogeneous data.

  • There are a variety of methods available for slicing/dicing, creating, and manipulating an array in the NumPy package.

  • NumPy arrays have practical applications such as being the building blocks of linear algebra operations and a tool to manipulate multidimensional array data such as images and audio.

  • Arrays (or matrices) are the computational blocks used in advanced mathematical models such as neural networks.

  • NumPy arrays are the precursors of some of the essential data structures in pandas, namely Series.

  • Series are very similar to arrays. Series are one-dimensional. A...

lock icon The rest of the chapter is locked
You have been reading a chapter from
Mastering pandas. - Second Edition
Published in: Oct 2019 Publisher: ISBN-13: 9781789343236
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}