Working with NumPy Arrays

Scientific computing is a multidisciplinary field, with its applications spanning across disciplines such as numerical analysis, computational finance, and bioinformatics.

Let's consider a case for financial markets. When you think about financial markets, there is a huge interconnected web of interactions. Governments, banks, investment funds, insurance companies, pensions, individual investors, and others are involved in this exchange of financial instruments. You can't simply model all the interactions between market participants because everyone who is involved in financial transactions has different motives and different risk/return objectives. There are also other factors which affect the prices of financial assets. Even modeling one asset price requires you to do a tremendous amount of work, and your success is not guaranteed. In mathematical terms, this doesn't have a closed-form solution and this makes a great case for utilizing scientific computing where you can use advanced computing techniques to attack such problems.

By writing computer programs, you will have the power to better understand the system you are working on. Usually, the computer program you will be writing will be some sort of simulation, such as the Monte Carlo simulation. By using a simulation such as Monte Carlo, you can model the price of option contracts. Pricing financial assets is a good material for simulations, simply because of the complexity of financial markets. All of these mathematical computations need a powerful, scalable and convenient structure for your data (which is mostly in matrix form) when you do your computation. In other words, you need a more compact structure than a list in order to simplify your task. NumPy is a perfect candidate for performant vector/matrix operations and its extensive library of mathematical operations makes numeric computing easy and efficient.

In this chapter, we will cover the following topics:

The importance of NumPy
Theoretical and practical information about vectors and matrices
NumPy array operations and their usage in multidimensional arrays

The question is, where should we start practicing coding skills? In this book, you will be using Python because of its huge adoption in the scientific community, and you will mainly work with a specific library called NumPy, which stands for numerical Python.

Technical requirements

In this book, we will use Jupyter Notebooks. We will edit and run Python code via a web browser. It's an open source platform which you can install by following the instructions in this link: http://jupyter.org/install.

This book will be using Python 3.x, so when you open a new notebook, you should pick Python 3 kernel. Alternatively, you can install Jupyter Notebook using Anaconda (Python version 3.6), which is highly recommended. You can install it by following the instructions in this link: https://www.anaconda.com/download/.

Why do we need NumPy?

Python has become a rockstar programming language recently, not only because it has friendly syntax and readability, but because it can be used for a variety of purposes. Python's ecosystem of various libraries makes various computations relatively easy for programmers. Stack Overflow is one the most popular websites for programmers. Users can ask questions by tagging which programming language they relate to. The following figure shows the growth of major programming languages by calculating these tags and plot the popularity of major programming languages over the years. The research conducted by Stack Overflow can be further analyzed via this link to their official blog: https://stackoverflow.blog/2017/09/06/incredible-growth-python/:

Growth of major programming languages

NumPy is the most fundamental package for scientific computing in Python and is the base for many other packages. Since Python was not initially designed for numerical computing, this need has arised in the late 90's when Python started to become popular among engineers and programmers who needed faster vector operations. As you can see from the following figure, many popular machine learning and computational packages use some of NumPy's features, and the most important thing is that they use NumPy arrays heavily in their methods, which makes NumPy an essential library for scientific projects.

The figure shows some well-known libraries which use NumPy features:

NumPy stack

For numerical computing, you mainly work with vectors and matrices. You can manipulate them in different ways by using a range of mathematical functions. NumPy is a perfect fit for these kinds of situations since it allows users to have their computations completed efficiently. Even though Python lists are very easy to create and manipulate, they don't support vectorized operations. Python doesn't have fixed type elements in lists and for example, for loop is not very efficient because, at every iteration, data type needs to be checked. In NumPy arrays, however, the data type is fixed and also supports vectorized operations. NumPy is not just more efficient in multidimensional array operations comparing to Python lists; it also provides many mathematical methods that you can apply as soon as it's imported. NumPy is a core library for the scientific Python data science stack.

SciPy has strong relationship with NumPy as it's using NumPy multidimensional arrays as a base data structure for its scientific functions for linear algebra, optimization. interpolation, integration, FFT, signal and image processing and others. SciPy was built on top of the NumPy array framework and uplifted scientific programming with its advanced mathematical functions. Therefore some parts of the NumPy API have been moved to SciPy. This relationship with NumPy makes SciPy more convenient for advanced scientific computing in many cases.

To sum this up, we can summarize NumPy's advantages as follows:

It's open source and zero-cost
It's a high-level programming language with user-friendly syntax
It's more efficient than Python lists
It has more advanced built-in functions and is well-integrated with other libraries

Who uses NumPy?

In both academic and business circles, you will hear people talking about the tools and technologies they use in their work. Depending on the environment and conditions, you might need to work with specific technologies. For example, if your company has already invested in SAS, you will need to carry out your project in the SAS development environment suited to your problem.

However, one of the advantages of NumPy is that it's open source, and it costs nothing for you to utilize it in your project. If you have already coded in Python, it's super easy to learn. If performance is your concern, you can easily embed C or Fortran code. Moreover, it will introduce you to a whole other set of libraries such as SciPy and Scikit-learn, which you can use to solve almost any problem.

Since data mining and predictive analytics became really important recently, roles like Data Scientist and Data Analyst are mentioned as the hottest jobs of the 21st century in many business journals such as Forbes, Bloomberg, and so on. People who need to work with data and do analysis, modeling, or forecasting should become familiar with NumPy's usage and its capabilities, as it will help you quickly prototype and test your ideas. If you are a working professional, your firm most probably wants to use data analysis methods in order to move one step ahead of its competitors. If they can better understand the data they have, they can understand the business better, and this will lead them to make better decisions. NumPy plays a critical role here as it is capable of performing wide range of operations and making your projects timewise efficient.

Introduction to vectors and matrices

A matrix is a group of numbers or elements which are arranged as a rectangular array. The matrix's rows and columns are usually indexed by letter. For a n x m matrix, n represents the number of rows and m represents the number of columns. If we have a hypothetical n x m matrix, it will be structured as follows:

If n = m, then it is called a square matrix:

A vector is actually a matrix with one row or one column with more than one element. It can also be defined as the 1-by-m or n-by-1 matrix. You can interpret a vector as an arrow or direction in an m dimensional space. Generally, the capital letter denotes a matrix, like X in this example, and lowercase letters with subscripts like X₁₁ denote the element of the matrix X.

In addition, there are some important special matrices: the zero matrix (null matrix) and the identity matrix. 0 denotes the zero matrix, which is a matrix of all 0s (MacDufee 1943 p.27). In a 0 matrix, it's optional to add subscripts:

The identity matrix denoted by I, and its diagonal elements are 1 while the others are 0:

When you multiply a matrix X with the identity matrix, the result will be equal to X:

An identity matrix is very useful for calculating the inverse of a matrix. When you multiply any given matrix with its inverse, the result will be an identity matrix:

Let's briefly see the matrix algebra on NumPy arrays. Addition and subtraction operations for matrices are similar to math equations with ordinary single numbers. As an example:

Scalar multiplication is also pretty straightforward. As an example, if you multiply your matrix X by 4, the only thing that you should do is multiply each element with the value 4 as follows:

The seemingly complicated part of matrix manipulation at the beginning is matrix multiplication.

Imagine you have two matrices as Xand Y, where X is an matrix and Y is an matrix:

The product of these two matrices will be as follows:

So each element of the product matrix is calculated as follows:

Don't worry if you didn't understand the notation. The following example will make things clearer. You have matrices X and Y and the goal is to get the matrix product of these matrices:

The basic idea is that the product of the i_th row of X and the j_thof column Y will become the i_th, j_th element of the matrix in the result. Multiplication will start with the first row of X and the first column of Y, so their product will be Z[1,1]:

You can cross-check the results easily with the following four lines of code:

In [1]: import numpy as np
        x = np.array([[1,0,4],[3,3,1]])
        y = np.array([[2,5],[1,1],[3,2]])
        x.dot(y)
Out[1]: array([[14, 13],[12, 20]])

The previous code block is just a demonstration of how easy to calculate the dot product of two matrices by use of NumPy. In later chapters, we will go more in deep into matrix operations and linear algebra.

Basics of NumPy array objects

As mentioned in the preceding section, what makes NumPy special is the usage of multidimensional arrays called ndarrays. All ndarray items are homogeneous and use the same size in memory. Let's start by importing NumPy and analyzing the structure of a NumPy array object by creating the array. You can easily import this library by typing the following statement into your console. You can use any naming convention instead of np, but in this book, np will be used as it's the standard convention. Let's create a simple array and explain what the attributes hold by Python behind the scenes as metadata of the created array, so-called attributes:

In [2]: import numpy as np
        x = np.array([[1,2,3],[4,5,6]])
        x
Out[2]: array([[1, 2, 3],[4, 5, 6]])
In [3]: print("We just create a ", type(x))
Out[3]: We just create a <class 'numpy.ndarray'>
In [4]: print("Our template has shape as" ,x.shape)
Out[4]: Our template has shape as (2, 3)
In [5]: print("Total size is",x.size)
Out[5]: Total size is 6
In [6]: print("The dimension of our array is " ,x.ndim)
Out[6]: The dimension of our array is 2
In [7]: print("Data type of elements are",x.dtype)
Out[7]: Data type of elements are int32
In [8]: print("It consumes",x.nbytes,"bytes")
Out[8]: It consumes 24 bytes

As you can see, the type of our object is a NumPy array. x.shape returns a tuple which gives you the dimension of the array as an output such as (n,m). You can get the total number of elements in an array by using x.size. In our example, we have six elements in total. Knowing attributes such as shape and dimension is very important. The more you know, the more you will be comfortable with computations. If you don't know your array's size and dimensions, it wouldn't be wise to start doing computations with it. In NumPy, you can use x.ndim to check what the dimension of your array is. There are other attributes such as dtype and nbytes, which are very useful while you are checking memory consumption and verifying what kind of data type you should use in the array. In our example, each element has a data type of int32 that consumes 24 bytes in total. You can also force some of these attributes while creating your array such as dtype. Previously, the data type was an integer. Let's switch it to float, complex, or uint (unsigned integer). In order to see what the data type change does, let's analyze what byte consumption is, which is shown as follows:

In [9]: x = np.array([[1,2,3],[4,5,6]], dtype = np.float)
         print(x)
Out[9]: print(x.nbytes)
        [[ 1. 2. 3.]
        [ 4. 5. 6.]]
        48
In [10]: x = np.array([[1,2,3],[4,5,6]], dtype = np.complex)
         print(x)
         print(x.nbytes)
Out[10]: [[ 1.+0.j 2.+0.j 3.+0.j]
         [ 4.+0.j 5.+0.j 6.+0.j]]
         96
In [11]: x = np.array([[1,2,3],[4,-5,6]], dtype = np.uint32)
         print(x)
         print(x.nbytes)
Out[11]: [[ 1 2 3]
         [ 4 4294967291 6]]
         24

As you can see, each type consumes a different number of bytes. Imagine you have a matrix as follows and that you are using int64 or int32 as the data type:

In [12]: x = np.array([[1,2,3],[4,5,6]], dtype = np.int64)
         print("int64 consumes",x.nbytes, "bytes")
         x = np.array([[1,2,3],[4,5,6]], dtype = np.int32)
         print("int32 consumes",x.nbytes, "bytes")
Out[12]: int64 consumes 48 bytes
         int32 consumes 24 bytes

The memory need is doubled if you use int64. Ask this question to yourself; which data type would suffice? Until your numbers are higher than 2,147,483,648 and lower than -2,147,483,647, using int32 is enough. Imagine you have a huge array with a size over 100 MB. In such cases, this conversion plays a crucial role in performance.

As you may have noticed in the previous example, when you were changing the data types, you were creating an array each time. Technically, you cannot change the dtype after you create the array. However, what you can do is either create it again or copy the existing one with a new dtype and with the astype attribute. Let's create a copy of the array with the new dtype. Here is an example of how you can also change your dtype with the astype attribute:

In [13]: x_copy = np.array(x, dtype = np.float)
         x_copy
Out[13]: array([[ 1., 2., 3.],
         [ 4., 5., 6.]])
In [14]: x_copy_int = x_copy.astype(np.int)
         x_copy_int
Out[14]: array([[1, 2, 3],
         [4, 5, 6]])

Please keep in mind that when you use the astype attribute, it doesn't change the dtype of the x_copy, even though you applied it to x_copy. It keeps the x_copy, but creates a x_copy_int:

In [15]: x_copy
Out[15]: array([[ 1., 2., 3.],
         [ 4., 5., 6.]])

Let's imagine a case where you are working in a research group that tries to identify and calculate the risks of an individual patient who has cancer. You have 100,000 records (rows), where each row represents a single patient, and each patient has 100 features (results of some of the tests). As a result, you have (100000, 100) arrays:

In [16]: Data_Cancer= np.random.rand(100000,100)
         print(type(Data_Cancer))
         print(Data_Cancer.dtype)
         print(Data_Cancer.nbytes)
         Data_Cancer_New = np.array(Data_Cancer, dtype = np.float32)
         print(Data_Cancer_New.nbytes)
Out[16]: <class 'numpy.ndarray'>
         float64
         80000000
         40000000

As you can see from the preceding code, their size decreases from 80 MB to 40 MB just by changing the dtype. What we get in return is less precision after decimal points. Instead of being precise to 16 decimals points, you will have only 7 decimals. In some machine learning algorithms, precision can be negligible. In such cases, you should feel free to adjust your dtype so that it minimizes your memory usage.

NumPy array operations

This section will guide you through the creation and manipulation of numerical data with NumPy. Let's start by creating a NumPy array from the list:

In [17]: my_list = [2, 14, 6, 8]
         my_array = np.asarray(my_list)
         type(my_array)
Out[17]: numpy.ndarray

Let's do some addition, subtraction, multiplication, and division with scalar values:

In [18]: my_array + 2
Out[18]: array([ 4, 16, 8, 10])
In [19]: my_array - 1
Out[19]: array([ 1, 13, 5, 7])
In [20]: my_array * 2
Out[20]: array([ 4, 28, 12, 16, 8])
In [21]: my_array / 2
Out[21]: array([ 1. , 7. , 3. , 4. ])

It's much harder to do the same operations in a list because the list does not support vectorized operations and you need to iterate its elements. There are many ways to create NumPy arrays, and now you will use one of these methods to create an array which is full of zeros. Later, you will perform some arithmetic operations to see how NumPy behaves in element-wise operations between two arrays:

In [22]: second_array = np.zeros(4) + 3
         second_array
Out[22]: array([ 3., 3., 3., 3.])
In [23]: my_array - second_array
Out[23]: array([ -1., 11., 3., 5.])
In [24]: second_array / my_array
Out[24]: array([ 1.5 , 0.21428571, 0.5 , 0.375 ])

As we did in the previous code, you can create an array which is full of ones with np.ones or an identity array with np.identity and do the same algebraic operations that you did previously:

In [25]: second_array = np.ones(4) + 3
         second_array
Out[25]: array([ 4., 4., 4., 4.])
In [26]: my_array - second_array
Out[26]: array([ -2., 10., 2., 4.])
In [27]: second_array / my_array
Out[27]: array([ 2. , 0.28571429, 0.66666667, 0.5 ])

It works as expected with the np.ones method, but when you use the identity matrix, the calculation returns a (4,4) array as follows:

In [28]: second_array = np.identity(4)
         second_array
Out[28]: array([[ 1., 0., 0., 0.],
                [ 0., 1., 0., 0.],
                [ 0., 0., 1., 0.],
                [ 0., 0., 0., 1.]])
In [29]: second_array = np.identity(4) + 3
         second_array
Out[29]: array([[ 4., 3., 3., 3.],
                [ 3., 4., 3., 3.],
                [ 3., 3., 4., 3.],
                [ 3., 3., 3., 4.]])
In [30]: my_array - second_array
Out[30]: array([[ -2., 11., 3., 5.],
                [ -1., 10., 3., 5.],
                [ -1., 11., 2., 5.],
                [ -1., 11., 3., 4.]])

What this does is subtract the first element of my_array from all of the elements of the first column of second_array and the second_element of the second column, and so on. The same rule is applied to division as well. Please keep in mind that you can successfully do array operations even if they are not exactly the same shape. Later in this chapter, you will learn about broadcasting errors when computation cannot be done between two arrays due to differences in their shapes:

In [31]: second_array / my_array
Out[31]: array([[ 2.  , 0.21428571, 0.5       , 0.375      ],
                [ 1.5 , 0.28571429, 0.5       , 0.375      ],
                [ 1.5 , 0.21428571, 0.66666667, 0.375      ],
                [ 1.5 , 0.21428571, 0.5       , 0.5        ]])

One of the most useful methods in creating NumPy arrays is arange. This returns an array for a given interval between your start and end values. The first argument is the start value of your array, the second is the end value (where it stops creating values), and the third one is the interval. Optionally, you can define your dtype as the fourth argument. The default interval values are 1:

In [32]: x = np.arange(3,7,0.5)
         x
Out[32]: array([ 3. , 3.5, 4. , 4.5, 5. , 5.5, 6. , 6.5])

There is another way to create an array with fixed intervals between the start and stop point when you cannot decide what the interval should be, but you should know how many splits your array should have:

In [33]: x = np.linspace(1.2, 40.5, num=20)
         x
Out[33]: array([ 1.2        , 3.26842105,  5.33684211,  7.40526316,   9.47368421,
                 11.54210526, 13.61052632, 15.67894737, 17.74736842, 19.81578947,
                 21.88421053, 23.95263158, 26.02105263, 28.08947368, 30.15789474,
                 32.22631579, 34.29473684, 36.36315789, 38.43157895, 40.5       ])

There are two different methods which are similar in usage but return different sequences of numbers because their base scale is different. This means that the distribution of the numbers will be different as well. The first one is geomspace, which returns numbers on a logarithmic scale with a geometric progression:

In [34]: np.geomspace(1, 625, num=5)
Out[34]: array([ 1., 5., 25., 125., 625.])

The other important method is logspace, where you can return the values for your start and stop points, which are evenly scaled in:

In [35]: np.logspace(3, 4, num=5)
Out[35]: array([ 1000. , 1778.27941004, 3162.27766017, 5623.4132519 ,
                10000. ])

What are these arguments? If the starting point is 3 and the ending point is 4, then these functions return the numbers which are much higher than the initial range. Your starting point is actually set as default to 10**Start Argument and the ending is set as 10**End Argument. So technically, in this example, the starting point is 10**3 and the ending point is 10**4. You can avoid such situations and keep your start and end points the same as when you put them as arguments in the method. The trick is to use base 10 logarithms of the given arguments:

In [36]: np.logspace(np.log10(3) , np.log10(4) , num=5)
Out[36]: array([ 3. , 3.2237098 , 3.46410162, 3.72241944, 4. ])

By now, you should be familiar with different ways of creating arrays with different distributions. You have also learned how to do some basic operations with them. Let's continue with other useful functions that you will definitely use in your day to day work. Most of the time, you will have to work with multiple arrays and you will need to compare them very quickly. NumPy has a great solution for this problem; you can compare the arrays as you would compare two integers:

In [37]: x = np.array([1,2,3,4])
         y = np.array([1,3,4,4])
         x == y
Out[37]: array([ True, False, False, True], dtype=bool)

The comparison is done element-wise and it returns a Boolean vector, whether elements are matching in two different arrays or not. This method works well in small size arrays and also gives you more details. You can see from the output array where the values are represented as False, that these indexed values are not matching in these two arrays. If you have a large array, you may also choose to get a single answer to your question, whether the elements are matching in two different arrays or not:

In [38]: x = np.array([1,2,3,4])
         y = np.array([1,3,4,4])
         np.array_equal(x,y)
Out[38]: False

Here, you have a single Boolean output. You only know that arrays are not equal, but you don't know which elements exactly are not equal. The comparison is not only limited to checking whether two arrays are equal or not. You can also do element-wise higher- lower comparison between two arrays:

In [39]: x = np.array([1,2,3,4])
         y = np.array([1,3,4,4])
         x < y
Out[39]: array([False, True, True, False], dtype=bool)

When you need to do logical comparison (AND, OR, XOR), you can use them in your array as follows:

In [40]: x = np.array([0, 1, 0, 0], dtype=bool)
         y = np.array([1, 1, 0, 1], dtype=bool)
         np.logical_or(x,y)
Out[40]: array([ True, True, False, True], dtype=bool)
In [41]: np.logical_and(x,y)
Out[41]: array([False, True, False, False], dtype=bool)
In [42]: x = np.array([12,16,57,11])
         np.logical_or(x < 13, x > 50)
Out[42]: array([ True, False, True, True], dtype=bool)

So far, algebraic operations such as addition and multiplication have been covered. How can we use these operations with transcendental functions such as the exponential function, logarithms, or trigonometric functions?

In [43]: x = np.array([1, 2, 3,4 ])
         np.exp(x)
Out[43]: array([ 2.71828183, 7.3890561 , 20.08553692, 54.59815003])
In [44]: np.log(x)
Out[44]: array([ 0. , 0.69314718, 1.09861229, 1.38629436])
In [45]: np.sin(x)
Out[45]: array([ 0.84147098, 0.90929743, 0.14112001, -0.7568025 ])

What about the transpose of a matrix? First, you will use the reshape function with arange to set the desired shape of the matrix:

In [46]: x = np.arange(9)
         x
Out[46]: array([0, 1, 2, 3, 4, 5, 6, 7, 8])
In [47]: x = np.arange(9).reshape((3, 3))
         x
Out[47]: array([[0, 1, 2],
                [3, 4, 5],
                [6, 7, 8]])
In [48]: x.T
Out[48]: array([[0, 3, 6],
                [1, 4, 7],
                [2, 5, 8]])

You transpose the 3*3 array, so the shape doesn't change because both dimensions are 3. Let's see what happens when you don't have a square array:

In [49]: x = np.arange(6).reshape(2,3)
         x
Out[49]: array([[0, 1, 2],
                [3, 4, 5]])
In [50]: x.T
Out[50]: array([[0, 3],
                [1, 4],
                [2, 5]])

The transpose works as expected and the dimensions are switched as well. You can also get summary statistics from arrays such as mean, median, and standard deviation. Let's start with methods that NumPy offers for calculating basic statistics:

Method	Description
`np.sum`	Returns the sum of all array values or along the specified axis
`np.amin`	Returns the minimum value of all arrays or along the specified axis
`np.amax`	Returns the maximum value of all arrays or along the specified axis
`np.percentile`	Returns the given q^th percentile of all arrays or along the specified axis
`np.nanmin`	The same as `np.amin`, but ignores NaN values in an array
`np.nanmax`	The same as `np.amax`, but ignores NaN values in an array
`np.nanpercentile`	The same as `np.percentile`, but ignores NaN values in an array

The following code block gives an example of the preceding statistical methods of NumPy. These methods are very useful as you can operate the methods in a whole array or axis-wise according to your needs. You should note that you can find more fully-featured and better implementations of these methods in SciPy as it uses NumPy multidimensional arrays as a data structure:

In [51]: x = np.arange(9).reshape((3,3))
         x
Out[51]: array([[0, 1, 2],
                [3, 4, 5],
                [6, 7, 8]])
In [52]: np.sum(x)
Out[52]: 36
In [53]: np.amin(x)
Out[53]: 0
In [54]: np.amax(x)
Out[54]: 8
In [55]: np.amin(x, axis=0)
Out[55]: array([0, 1, 2])
In [56]: np.amin(x, axis=1)
Out[56]: array([0, 3, 6])
In [57]: np.percentile(x, 80)
Out[57]: 6.4000000000000004

The axis argument determines the dimension that this function will operate on. In this example, axis=0 represents the first axis which is the x axis, and axis = 1 represents the second axis which is y. When we use a regular amin(x), we return a single value because it calculates the minimum value in all arrays, but when we specify the axis, it starts evaluating the function axis-wise and returns an array which shows the results for each row or column. Imagine you have a large array; you find the max value by using amax, but what will happen if you need to pass the index of this value to another function? In such cases, argmin and argmax come to the rescue, as shown in the following snippet:

In [58]: x = np.array([1,-21,3,-3])
         np.argmax(x)
Out[58]: 2
In [59]: np.argmin(x)
Out[59]: 1

Let's continue with more statistical functions:

Method	Description
`np.mean`	Returns the mean of all array values or along the specific axis
`np.median`	Returns the median of all array values or along the specific axis
`np.std`	Returns the standard deviation of all array values or along the specific axis
`np.nanmean`	The same as `np.mean`, but ignores NaN values in an array
`np.nanmedian`	The same as `np.nanmedian`, but ignores NaN values in an array
`np.nonstd`	The same as `np.nanstd`, but ignores NaN values in an array

The following code gives more examples of the preceding statistical methods of NumPy. These methods are heavily used in data discovery phases, where you analyze your data features and distribution:

In [60]: x = np.array([[2, 3, 5], [20, 12, 4]])
         x
Out[60]: array([[ 2, 3, 5],
                [20, 12, 4]])
In [61]: np.mean(x)
Out[61]: 7.666666666666667
In [62]: np.mean(x, axis=0)
Out[62]: array([ 11. , 7.5, 4.5])
In [63]: np.mean(x, axis=1)
Out[63]: array([ 3.33333333, 12. ])
In [64]: np.median(x)
Out[64]: 4.5
In [65]: np.std(x)
Out[65]: 6.3944420310836261

Working with multidimensional arrays

This section will give you a brief understanding of multidimensional arrays by going through different matrix operations.

In order to do matrix multiplication in NumPy, you have to use dot() instead of *. Let's see some examples:

In [66]: c = np.ones((4, 4))
         c*c
Out[66]: array([[ 1., 1., 1., 1.],
                [ 1., 1., 1., 1.],
                [ 1., 1., 1., 1.],
                [ 1., 1., 1., 1.]])
In [67]: c.dot(c)
Out[67]: array([[ 4., 4., 4., 4.],
                [ 4., 4., 4., 4.],
                [ 4., 4., 4., 4.],
                [ 4., 4., 4., 4.]])

The most important topic in working with multidimensional arrays is stacking, in other words how to merge two arrays. hstack is used for stacking arrays horizontally (column-wise) and vstack is used for stacking arrays vertically (row-wise). You can also split the columns with the hsplit and vsplit methods in the same way that you stacked them:

In [68]: y = np.arange(15).reshape(3,5)
         x = np.arange(10).reshape(2,5)
         new_array = np.vstack((y,x))
         new_array
Out[68]: array([[ 0, 1, 2, 3, 4],
                [ 5, 6, 7, 8, 9],
                [10, 11, 12, 13, 14],
                [ 0, 1, 2, 3, 4],
                [ 5, 6, 7, 8, 9]])
In [69]: y = np.arange(15).reshape(5,3)
         x = np.arange(10).reshape(5,2)
         new_array = np.hstack((y,x))
         new_array
Out[69]: array([[ 0, 1, 2, 0, 1],
                [ 3, 4, 5, 2, 3],
                [ 6, 7, 8, 4, 5],
                [ 9, 10, 11, 6, 7],
                [12, 13, 14, 8, 9]])

These methods are very useful in machine learning applications, especially when creating datasets. After you stack your arrays, you can check their descriptive statistics by using Scipy.stats. Imagine a case where you have 100 records, and each record has 10 features, which means you have a 2D matrix which has 100 rows and 10 columns. The following example shows how you can easily get some descriptive statistics for each feature:

In [70]: from scipy import stats
         x= np.random.rand(100,10)
         n, min_max, mean, var, skew, kurt = stats.describe(x)
         new_array = np.vstack((mean,var,skew,kurt,min_max[0],min_max[1]))
         new_array.T
Out[70]: array([[ 5.46011575e-01, 8.30007104e-02, -9.72899085e-02,
                 -1.17492785e+00, 4.07031246e-04, 9.85652100e-01],
                [ 4.79292653e-01, 8.13883169e-02, 1.00411352e-01,
                 -1.15988275e+00, 1.27241020e-02, 9.85985488e-01],
                [ 4.81319367e-01, 8.34107619e-02, 5.55926602e-02,
                 -1.20006450e+00, 7.49534810e-03, 9.86671083e-01],
                [ 5.26977277e-01, 9.33829059e-02, -1.12640661e-01,
                 -1.19955646e+00, 5.74237697e-03, 9.94980830e-01],
                [ 5.42622228e-01, 8.92615897e-02, -1.79102183e-01,
                 -1.13744108e+00, 2.27821933e-03, 9.93861532e-01],
                [ 4.84397369e-01, 9.18274523e-02, 2.33663872e-01,
                 -1.36827574e+00, 1.18986562e-02, 9.96563489e-01],
                [ 4.41436165e-01, 9.54357485e-02, 3.48194314e-01,
                 -1.15588500e+00, 1.77608372e-03, 9.93865324e-01],
                [ 5.34834409e-01, 7.61735119e-02, -2.10467450e-01,
                 -1.01442389e+00, 2.44706226e-02, 9.97784091e-01],
                [ 4.90262346e-01, 9.28757119e-02, 1.02682367e-01,
                 -1.28987137e+00, 2.97705706e-03, 9.98205307e-01],
                [ 4.42767478e-01, 7.32159267e-02, 1.74375646e-01,
                 -9.58660574e-01, 5.52410464e-04, 9.95383732e-01]])

NumPy has a great module named numpy.ma, which is used for masking array elements. It's very useful when you want to mask (ignore) some elements while doing your calculations. When NumPy masks, it will be treated as an invalid and does not take into account computation:

In [71]: import numpy.ma as ma
         x = np.arange(6)
         print(x.mean())
         masked_array = ma.masked_array(x, mask=[1,0,0,0,0,0])
         masked_array.mean()
         2.5 
Out[71]: 3.0

In the preceding code, you have an array x = [0,1,2,3,4,5]. What you do is mask the first element of the array and then calculate the mean. When an element is masked as 1(True), the associated index value in the array will be masked. This method is also very useful while replacing the NAN values:

In [72]: x = np.arange(25, dtype = float).reshape(5,5)
         x[x<5] = np.nan
         x
Out[72]: array([[ nan, nan, nan, nan, nan],
                [ 5., 6., 7., 8., 9.],
                [ 10., 11., 12., 13., 14.],
                [ 15., 16., 17., 18., 19.],
                [ 20., 21., 22., 23., 24.]])
In [73]: np.where(np.isnan(x), ma.array(x, mask=np.isnan(x)).mean(axis=0), x)
Out[73]: array([[ 12.5, 13.5, 14.5, 15.5, 16.5],
                [ 5. , 6. , 7. , 8. , 9. ],
                [ 10. , 11. , 12. , 13. , 14. ],
                [ 15. , 16. , 17. , 18. , 19. ],
                [ 20. , 21. , 22. , 23. , 24. ]])

In preceding code, we changed the value of the first five elements to nan by putting a condition with index. x[x<5] refers to the elements which indexed for 0, 1, 2, 3, and 4. Then we overwrite these values with the mean of each column(excluding nan values). There are many other useful methods in array operations in order help your code be more concise:

Method	Description
`np.concatenate`	Join to the matrix in a sequence with a given matrix
`np.repeat`	Repeat the element of an array along a specific axis
`np.delete`	Return a new array with the deleted subarrays
`np.insert`	Insert values before the specified axis
`np.unique`	Find unique values in an array
`np.tile`	Create an array by repeating a given input for a given number of repetitions

Indexing, slicing, reshaping, resizing, and broadcasting

When you are working with huge arrays in machine learning projects, you often need to index, slice, reshape, and resize.

Indexing is a fundamental term used in mathematics and computer science. As a general term, indexing helps you to specify how to return desired elements of various data structures. The following example shows indexing for a list and a tuple:

In [74]: x = ["USA","France", "Germany","England"]
         x[2]
Out[74]: 'Germany'
In [75]: x = ('USA',3,"France",4)
         x[2]
Out[75]: 'France'

In NumPy, the main usage of indexing is controlling and manipulating the elements of arrays. It's a way of creating generic lookup values. Indexing contains three child operations, which are field access, basic slicing, and advanced indexing. In field access, you just specify the index of an element in an array to return the value for a given index.

NumPy is very powerful when it comes to indexing and slicing. In many cases, you need to refer your desired element in an array and do the operations on this sliced area. You can index your array similarly to what you do with tuples or lists with square bracket notations. Let's start with field access and simple slicing with one-dimensional arrays and move on to more advanced techniques:

In [76]: x = np.arange(10)
         x
Out[76]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [77]: x[5]
Out[77]: 5
In [78]: x[-2]
Out[78]: 8
In [79]: x[2:8]
Out[79]: array([2, 3, 4, 5, 6, 7])
In [80]: x[:]
Out[80]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [81]: x[2:8:2]
Out[81]: array([2, 4, 6])

Indexing starts from 0, so when you create an array with an element, your first element is indexed as x[0], the same way as your last element, x[n-1]. As you can see in the preceding example, x[5] refers to the sixth element. You can also use negative values in indexing. NumPy understands these values as the n^th orders backwards. Like in the example, x[-2] refers to the second to last element. You can also select multiple elements in your array by stating the starting and ending indexes and also creating sequential indexing by stating the increment level as a third argument, as in the last line of the code.

So far, we have seen indexing and slicing in 1D arrays. The logic does not change, but for the sake of demonstration, let's do some practice for multidimensional arrays as well. The only thing that changes when you have multidimensional arrays is just having more axis. You can slice the n-dimensional array as [slicing in x-axis, slicing in y-axis] in the following code:

In [82]: x = np.reshape(np.arange(16),(4,4))
         x
Out[82]: array([[ 0, 1, 2, 3],
                [ 4, 5, 6, 7],
                [ 8, 9, 10, 11],
                [12, 13, 14, 15]])
In [83]: x[1:3]
Out[83]: array([[ 4, 5, 6, 7],
                [ 8, 9, 10, 11]])
In [84]: x[:,1:3]
Out[84]: array([[ 1, 2],
                [ 5, 6],
                [ 9, 10],
                [13, 14]])
In [85]: x[1:3,1:3]
Out[85]: array([[ 5, 6],
                [ 9, 10]])

You sliced the arrays row and column-wise, but you haven't sliced the elements in a more irregular or more dynamic fashion, which means you always slice them in a rectangular or square way. Imagine a 4*4 array that we want to slice as follows:

To obtain the preceding slicing, we execute the following code:

In [86]: x = np.reshape(np.arange(16),(4,4))
         x
Out[86]: array([[ 0, 1, 2, 3],
                [ 4, 5, 6, 7],
                [ 8, 9, 10, 11],
                [12, 13, 14, 15]])
In [87]: x[[0,1,2],[0,1,3]]
Out[87]: array([ 0, 5, 11])

In advanced indexing, the first part indicates the index of rows to be sliced and the second part indicates the corresponding columns. In the preceding example, you first sliced the 1^st, 2^nd, and 3^rd rows ([0,1,2]) and then sliced the 1^st, 2^nd and 4^th columns ([0,1,3]) into sliced rows.

The reshape and resize methods may seem similar, but there are differences in the outputs of these operations. When you reshape the array, it's just the output that changes the shape of the array temporarily, but it does not change the array itself. When you resize the array, it changes the size of the array permanently, and if the new array's size is bigger than the old one, the new array elements will be filled with repeated copies of the old ones. On the contrary, if the new array is smaller, a new array will take the elements from the old array with the order of index which is required to fill the new one. Please note that same data can be shared by different ndarrays which means that an ndarray can be a view to another ndarray. In such cases changes made in one array will have consequences on other views.

The following code gives an example of how the new array elements are filled when the size is bigger or smaller than the original array:

In [88]: x = np.arange(16).reshape(4,4)
         x
Out[88]: array([[ 0, 1, 2, 3],
                [ 4, 5, 6, 7],
                [ 8, 9, 10, 11],
                [12, 13, 14, 15]])
In [89]: np.resize(x,(2,2))
Out[89]: array([[0, 1],
                 [2, 3]])
In [90]: np.resize(x,(6,6))
Out[90]: array([[ 0, 1, 2, 3, 4, 5],
                [ 6, 7, 8, 9, 10, 11],
                [12, 13, 14, 15, 0, 1],
                [ 2, 3, 4, 5, 6, 7],
                [ 8, 9, 10, 11, 12, 13],
                [14, 15, 0, 1, 2, 3]])

The last important term of this subsection is broadcasting, which explains how NumPy behaves in arithmetic operations of the array when they have different shapes. NumPy has two rules for broadcasting: either the dimensions of the arrays are equal, or one of them is 1. If one of these conditions is not met, then you will get one of the two errors: frames are not aligned or operands could not be broadcast together:

In [91]: x = np.arange(16).reshape(4,4)
         y = np.arange(6).reshape(2,3)
         x+y
        ---------------------------------------------------------------                           ------------
        ValueError Traceback (most recent call last)
        <ipython-input-102-083fc792f8d9> in <module>()
        1 x = np.arange(16).reshape(4,4)
        2 y = np.arange(6).reshape(2,3)
        ----> 3 x+y
        12
        ValueError: operands could not be broadcast together with                      shapes (4,4) (2,3)

You might have seen that you can multiply two matrices with shapes (4, 4) and (4,) or with (2, 2) and (2, 1). The first case meets the condition of having one dimension so that the multiplication becomes a vector * array, which does not cause any broadcasting problems:

In [92]: x = np.ones(16).reshape(4,4)
          y = np.arange(4)
          x*y
Out[92]: array([[ 0., 1., 2., 3.],
                 [ 0., 1., 2., 3.],
                 [ 0., 1., 2., 3.],
                 [ 0., 1., 2., 3.]])
In [93]: x = np.arange(4).reshape(2,2)
         x
Out[93]: array([[0, 1],
                [2, 3]])
In [94]: y = np.arange(2).reshape(1,2)
         y
Out[94]: array([[0, 1]])
In [95]: x*y
Out[95]: array([[0, 1],
                [0, 3]])

The preceding code block gives an example for the second case, where during computation small arrays iterate through the large array and the output is stretched across the whole array. That's the reason why there are (4, 4) and (2, 2) outputs: during the multiplication, both arrays are broadcast to larger dimensions.

Summary

In this chapter, you got familiar with NumPy basics for array operations and refreshed your knowledge about basic matrix operations. NumPy is an extremely important library for Python scientific stacks, with its extensive methods for array operations. You have learned how to work with multidimensional arrays and covered important topics such as indexing, slicing, reshaping, resizing, and broadcasting. The main goal of this chapter was to give you a brief idea of how NumPy works when it comes to numerical datasets, which will be helpful in your daily data analysis work.

In the next chapter, you will learn the basics of linear algebra and complete practical examples with NumPy.