Let's get started. We can find a mind map describing software that can be used for data analysis at http://www.xmind.net/m/WvfC/. Obviously, we can't install all of this software in this chapter. We will install NumPy, SciPy, matplotlib, and IPython on different operating systems and have a look at some simple code that uses NumPy.
NumPy is a fundamental Python library that provides numerical arrays and functions.
SciPy is a scientific Python library, which supplements and slightly overlaps NumPy. NumPy and SciPy historically shared their code base but were later separated.
matplotlib is a plotting library based on NumPy. You can read more about matplotlib in Chapter 6, Data Visualization.
IPython provides an architecture for interactive computing. The most notable part of this project is the IPython shell. We will cover the IPython shell later in this chapter.
Installation instructions for the other software we need will be given throughout the book at the appropriate time. At the end of this chapter, you will find pointers on how to find additional information online if you get stuck or are uncertain about the best way to solve problems.
In this chapter, we will cover:
Installing Python, SciPy, matplotlib, IPython, and NumPy on Windows, Linux, and Macintosh
Writing a simple application using NumPy arrays
Getting to know IPython
Online resources and help
The software used in this book is based on Python, so you are required to have Python installed. On some operating systems, Python is already installed. You, however, need to check whether the Python version is compatible with the software version you want to install. There are many implementations of Python, including commercial implementations and distributions. In this book, we will focus on the standard CPython implementation, which is guaranteed to be compatible with NumPy.
Note
You can download Python from https://www.python.org/download/. On this website, we can find installers for Windows and Mac OS X as well as source archives for Linux, Unix, and Mac OS X.
The software we will install in this chapter has binary installers for Windows, various Linux distributions, and Mac OS X. There are also source distributions if you prefer that. You need to have Python 2.4.x or above installed on your system. Python 2.7.x is currently the best Python version to have because most Scientific Python libraries support it. Python 2.7 will be supported and maintained until 2020. After that, we will have to switch to Python 3.
We will learn how to install and set up NumPy, SciPy, matplotlib, and IPython on Windows, Linux and Mac OS X. Let's look at the process in detail.
Installing on Windows is, fortunately, a straightforward task that we will cover in detail. You only need to download an installer and a wizard will guide you through the installation steps. We will give you steps to install NumPy here. The steps to install the other libraries are similar. The actions we will take are as follows:
Download installers for Windows from the SourceForge website (refer to the following table). The latest release versions may change, so just choose the one that fits your setup best.
Library
URL
Latest version
NumPy
1.8.1
SciPy
0.14.0
matplotlib
1.3.1
IPython
2.0.0
Choose the appropriate version. In this example, we chose
numpy-1.8.1-win32-superpack-python2.7.exe
.Now, we can see a description of NumPy and its features. Click on the Next button.
If you have Python installed, it should automatically be detected. If it is not detected, maybe your path settings are wrong.
Click on the Next button if Python is found; otherwise, click on the Cancel button and install Python (NumPy cannot be installed without Python). Click on the Next button. This is the point of no return. Well, kind of, but it is best to make sure that you are installing to the proper directory, and so on and so forth. Now the real installation starts. This may take a while.
Note
The situation around installers is rapidly evolving. Other alternatives exist in various stages of maturity (see http://www.scipy.org/install.html). It might be necessary to put the
msvcp71.dll
file in yoursystem32
directory located atC:\Windows\
. You can get it from http://www.dll-files.com/dllindex/dll-files.shtml?msvcp71.
Installing the recommended software on Linux depends on the distribution you have. We will discuss how you would install NumPy from the command line;you could probably use graphical installers depending on your distribution (distro). The commands to install matplotlib, SciPy, and IPython are the same; only the package names are different. Installing matplotlib, SciPy, and IPython is recommended but optional.
Most Linux distributions have NumPy packages. We will go through the necessary commands for some of the popular Linux distributions as follows:
Run the following instructions from the command line to install NumPy on Red Hat:
$ yum install python-numpy
To install NumPy on Mandriva, run the following command-line instruction:
$ urpmi python-numpy
To install NumPy on Gentoo, run the following command-line instruction:
$ sudo emerge numpy
To install NumPy on Debian or Ubuntu, we need to type the following:
$ sudo apt-get install python-numpy
The following table gives an overview of the Linux distributions and corresponding package names for NumPy, SciPy, matplotlib, and IPython:
Linux distribution |
NumPy |
SciPy |
matplotlib |
IPython |
---|---|---|---|---|
Arch Linux |
|
|
|
|
Debian |
|
|
|
|
Fedora |
|
|
|
|
Gentoo |
|
|
|
|
openSUSE |
|
|
|
|
Slackware |
|
|
|
|
You can install NumPy, matplotlib, and SciPy on Mac OS X with a graphical installer or from the command line with a port manager, such as MacPorts or Fink, depending on your preference. The prerequisite is to install XCode, as it is not part of OS X releases. We will install NumPy with a GUI installer using the following steps:
We can get a NumPy installer from the SourceForge website at http://sourceforge.net/projects/numpy/files/. Similar files exist for matplotlib and SciPy.
Just change
numpy
in the previous URL toscipy
ormatplotlib
to get installers of the respective libraries. IPython didn't have a GUI installer at the time of writing this.Download the appropriate DMG file; usually the latest one is the best.
Another alternative is SciPy Superpack (https://github.com/fonnesbeck/ScipySuperpack).
Whichever option you choose, it is important to make sure that updates that impact the system Python library don't negatively influence already-installed software by not building against the Python library provided by Apple. Install NumPy, matplotlib, and SciPy using the following steps:
Open the DMG file (in this example,
numpy-1.8.1-py2.7-python.org-macosx10.6.dmg
).Double-click on the icon of the opened box—the one with a subscript that ends with
.mpkg
. We will be presented with the welcome screen of the installer.Click on the Continue button to go to the Read Me screen, where we will be presented with a short description of NumPy.
Click on the Continue button to go to the License screen.
Read the license, click on the Continue button, and then click on the Accept button when prompted to accept the license. Continue through the screens that follow from there, and click on the Finish button at the end.
Alternatively, we can install the libraries through the MacPorts route, with Fink or Homebrew. The following installation commands install all these packages. We only need NumPy for all the tutorials in this book, so please omit the packages you are not interested in.
To install with MacPorts, type in the following command:
$ sudo port install py-numpy py-scipy py-matplotlib py-ipython
Fink also has packages for NumPy, such as
scipy-core-py24
,scipy-core-py25
, andscipy-core-py26
. The SciPy packages arescipy-py24
,scipy-py25
, andscipy-py26
. We can install NumPy and other recommended packages that we will be using in this book for Python 2.6 with the following command:$ fink install scipy-core-py26 scipy-py26 matplotlib-py26
As a last resort or if we want to have the latest code, we can build from source. In practice, it shouldn't be that hard, although depending on your operating system, you might run into problems. As operating systems and related software are rapidly evolving, in such cases, the best you can do is search online or ask for help. In this chapter, we give pointers on good places to look for help.
The source code can be retrieved with git
or as an archive from GitHub. The steps to install NumPy from source are straightforward and given here. We can retrieve the source code for NumPy with git
as follows:
$ git clone git://github.com/numpy/numpy.git numpy
Note
There are similar commands for SciPy, matplotlib, and IPython (refer to the table that follows after this piece of information). The IPython source code can be downloaded from https://github.com/ipython/ipython/releases as a source archive or ZIP file. You can then unpack it with your favorite tool or with the following command:
$ tar -xzf ipython.tar.gz
Please refer to the following table for the git
commands and source archive/zip links:
Library |
Git command |
Tarball/zip URL |
---|---|---|
NumPy |
git clone git://github.com/numpy/numpy.git numpy
| |
SciPy |
git clone http://github.com/scipy/scipy.git scipy
| |
matplotlib |
git clone git://github.com/matplotlib/matplotlib.git
| |
IPython |
git clone --recursive https://github.com/ipython/ipython.git
|
Install on /usr/local
with the following command from the source code directory:
$ python setup.py build $ sudo python setup.py install --prefix=/usr/local
To build, we need a C compiler such as GCC and the Python header files in the python-dev
or python-devel
package.
If you have setuptools
or pip
, you can install NumPy, SciPy, matplotlib, and IPython with the following commands. For each library, we give two commands, one for setuptools
and one for pip
. You only need to choose one command per pair:
$ easy_install numpy $ pip install numpy $ easy_install scipy $ pip install scipy $ easy_install matplotlib $ pip install matplotlib $ easy_install ipython $ pip install ipython
It may be necessary to prepend sudo
to these commands if your current user doesn't have sufficient rights on your system.
After going through the installation of NumPy, it's time to have a look at NumPy arrays. NumPy arrays are more efficient than Python lists when it comes to numerical operations. NumPy arrays are, in fact, specialized objects with extensive optimizations. NumPy code requires less explicit loops than equivalent Python code. This is based on vectorization.
If we go back to highschool mathematics, then we should remember the concepts of scalars and vectors. The number 2, for instance, is a scalar. When we add 2 to 2, we are performing scalar addition. We can form a vector out of a group of scalars. In Python programming terms, we will then have a one-dimensional array. This concept can, of course, be extended to higher dimensions. Performing an operation on two arrays, such as addition, can be reduced to a group of scalar operations. In straight Python, we will do that with loops going through each element in the first array and adding it to the corresponding element in the second array. However, this is more verbose than the way it is done in mathematics. In mathematics, we treat the addition of two vectors as a single operation. That's the way NumPy arrays do it too, and there are certain optimizations using low-level C routines, which make these basic operations more efficient. We will cover NumPy arrays in more detail in the following chapter, Chapter 2, NumPy Arrays.
Imagine that we want to add two vectors called a
and b
. The word vector is used here in the mathematical sense, which means a one-dimensional array. We will learn in Chapter 3, Statistics and Linear Algebra, about specialized NumPy arrays that represent matrices. The vector a
holds the squares of integers 0
to n; for instance, if n
is equal to 3
, a
contains 0
, 1
, or 4
. The vector b
holds the cubes of integers 0
to n, so if n
is equal to 3
, then the vector b
is equal to 0
, 1
, or 8
. How would you do that using plain Python? After we come up with a solution, we will compare it with the NumPy equivalent.
The following function solves the vector addition problem using pure Python without NumPy:
def pythonsum(n): a = range(n) b = range(n) c = [] for i in range(len(a)): a[i] = i ** 2 b[i] = i ** 3 c.append(a[i] + b[i]) return c
The following is a function that solves the vector addition problem with NumPy:
def numpysum(n): a = numpy.arange(n) ** 2 b = numpy.arange(n) ** 3 c = a + b return c
Notice that numpysum()
does not need a for
loop. Also, we used the arange()
function from NumPy, which creates a NumPy array for us with integers from 0
to n. The arange()
function was imported; that is why it is prefixed with numpy
.
Now comes the fun part. Remember that it was mentioned in the Preface that NumPy is faster when it comes to array operations. How much faster is Numpy, though? The following program will show us by measuring the elapsed time in microseconds for the numpysum()
and pythonsum()
functions. It also prints the last two elements of the vector sum. Let's check that we get the same answers using Python and NumPy:
#!/usr/bin/env/python import sys from datetime import datetime import numpy as np """ This program demonstrates vector addition the Python way. Run from the command line as follows python vectorsum.py n where n is an integer that specifies the size of the vectors. The first vector to be added contains the squares of 0 up to n. The second vector contains the cubes of 0 up to n. The program prints the last 2 elements of the sum and the elapsed time. """ def numpysum(n): a = np.arange(n) ** 2 b = np.arange(n) ** 3 c = a + b return c def pythonsum(n): a = range(n) b = range(n) c = [] for i in range(len(a)): a[i] = i ** 2 b[i] = i ** 3 c.append(a[i] + b[i]) return c size = int(sys.argv[1]) start = datetime.now() c = pythonsum(size) delta = datetime.now() - start print "The last 2 elements of the sum", c[-2:] print "PythonSum elapsed time in microseconds", delta.microseconds start = datetime.now() c = numpysum(size) delta = datetime.now() - start print "The last 2 elements of the sum", c[-2:] print "NumPySum elapsed time in microseconds", delta.microseconds
The output of the program for 1000
, 2000
, and 3000
vector elements is as follows:
$ python vectorsum.py 1000 The last 2 elements of the sum [995007996, 998001000] PythonSum elapsed time in microseconds 707 The last 2 elements of the sum [995007996 998001000] NumPySum elapsed time in microseconds 171 $ python vectorsum.py 2000 The last 2 elements of the sum [7980015996, 7992002000] PythonSum elapsed time in microseconds 1420 The last 2 elements of the sum [7980015996 7992002000] NumPySum elapsed time in microseconds 168 $ python vectorsum.py 4000 The last 2 elements of the sum [63920031996, 63968004000] PythonSum elapsed time in microseconds 2829 The last 2 elements of the sum [63920031996 63968004000] NumPySum elapsed time in microseconds 274
Clearly, NumPy is much faster than the equivalent normal Python code. One thing is certain; we get the same results whether we are using NumPy or not. However, the result that is printed differs in representation. Notice that the result from the numpysum()
function does not have any commas. How come? Obviously, we are not dealing with a Python list but with a NumPy array. We will learn more about NumPy arrays in the next chapter, Chapter 2, NumPy Arrays.
Scientists, data analysts, and engineers are used to experimenting. IPython was created by scientists with experimentation in mind. The interactive environment that IPython provides is viewed by many as a direct answer to MATLAB, Mathematica, and Maple.
The following is a list of features of the IPython shell:
Tab completion, which helps you find a command
History mechanism
Inline editing
Ability to call external Python scripts with
%run
Access to system commands
The pylab switch
Access to the Python debugger and profiler
The following list describes how to use the IPython shell:
The pylab switch: The pylab switch automatically imports all the
Scipy
,NumPy
, andmatplotlib
packages. Without this switch, we would have to import these packages ourselves.All we need to do is enter the following instruction on the command line:
$ ipython -pylab Type "copyright", "credits" or "license" for more information. IPython 2.0.0-dev -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details. Welcome to pylab, a matplotlib-based Python environment [backend: MacOSX]. For more information, type 'help(pylab)'. In [1]: quit()
Saving a session: We might want to be able to go back to our experiments. In IPython, it is easy to save a session for later use, with the following command:
In [1]: %logstart Activating auto-logging. Current session state plus future input saved. Filename : ipython_log.py Mode : rotate Output logging : False Raw input log : False Timestamping : False State : active
Logging can be switched off as follows:
In [9]: %logoff Switching logging OFF
Executing system shell command: Execute a system shell command in the default IPython profile by prefixing the command with the
!
symbol. For instance, the following input will get the current date:In [1]: !date
In fact, any line prefixed with
!
is sent to the system shell. Also, we can store the command output as shown here:In [2]: thedate = !date In [3]: thedate
Displaying history: We can show the history of commands with the
%hist
command, for example:In [1]: a = 2 + 2 In [2]: a Out[2]: 4 In [3]: %hist a = 2 + 2 a %hist
This is a common feature in Command Line Interface (CLI) environments. We can also search through the history with the
-g
switch as follows:In [5]: %hist -g a = 2 1: a = 2 + 2
Tip
Downloading the example code
You can download the example code files for all the Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
We saw a number of so-called magic functions in action. These functions start with the %
character. If the magic function is used on a line by itself, the %
prefix is optional.
When we are in IPython's pylab mode ($ ipython –pylab
), we can open manual pages for NumPy functions with the help
command. It is not necessary to know the name of a function. We can type a few characters and then let tab completion do its work. Let's, for instance, browse the available information for the arange()
function.
We can browse the available information in either of the following two ways:
Calling the help function: Call the
help
command. Type in a few characters of the function and press the Tab key.Querying with a question mark: Another option is to append a question mark to the function name. You will then, of course, need to know the function name, but you don't have to type
help
, for example:In [3]: arange?
Tab completion is dependent on
readline
, so you need to make sure that it is installed. It can be installed withsetuptools
with one of the following commands:$ easy_install readline $ pip install readline
If you have browsed the Internet looking for information on Python, it is very likely that you have seen IPython notebooks. These are web pages with text, charts, and Python code in a special format. Have a look at these notebook collections at the following links:
Often, the notebooks are used as an educational tool or to demonstrate Python software. We can import or export notebooks either from plain Python code or using the special notebook format. The notebooks can be run locally, or we can make them available online by running a dedicated notebook server. Certain cloud computing solutions, such as Wakari and PiCloud, allow you to run notebooks in the Cloud. Cloud computing is one of the topics of Chapter 11, Environments Outside the Python Ecosystem and Cloud Computing.
The main documentation website for NumPy and SciPy is at http://docs.scipy.org/doc/. Through this web page, we can browse the NumPy reference guide at http://docs.scipy.org/doc/numpy/reference/ and the user guide as well as several tutorials.
The popular Stack Overflow software development forum has hundreds of questions tagged numpy
. To view them, go to http://stackoverflow.com/questions/tagged/numpy.
This might be stating the obvious, but numpy
can also be substituted with scipy
, ipython
, or almost anything of interest. If you are really stuck with a problem or you want to be kept informed of NumPy development, you can subscribe to the NumPy discussion mailing list. The e-mail address is <numpy-discussion@scipy.org>
. The number of e-mails per day is not too high, and there is almost no spam to speak of. Most importantly, developers actively involved with NumPy also answer questions asked on the discussion group. The complete list can be found at http://www.scipy.org/Mailing_Lists.
For IRC users, there is an IRC channel on irc://irc.freenode.net. The channel is called #scipy
, but you can also ask NumPy questions since SciPy users also have knowledge of NumPy, as SciPy is based on NumPy. There are at least 50 members on the SciPy channel at all times.
In this chapter, we installed NumPy, SciPy, matplotlib, and IPython that we will be using in tutorials. We got a vector addition program working and convinced ourselves that NumPy offers superior performance. In addition, we explored the available documentation and online resources.
In the next chapter, Chapter 2, NumPy Arrays, we will take a look under the hood of NumPy and explore some fundamental concepts including arrays and data types.