# Python Data Science Essentials - Second Edition

4.3 (6 reviews total)
By Alberto Boschetti , Luca Massaron
What do you get with a Packt Subscription?

• Constantly updated with 100+ new titles each month
• Breadth and depth in over 1,000+ technologies

Fully expanded and upgraded, the second edition of Python Data Science Essentials takes you through all you need to know to suceed in data science using Python. Get modern insight into the core of Python data, including the latest versions of Jupyter notebooks, NumPy, pandas and scikit-learn. Look beyond the fundamentals with beautiful data visualizations with Seaborn and ggplot, web development with Bottle, and even the new frontiers of deep learning with Theano and TensorFlow.

Dive into building your essential Python 3.5 data science toolbox, using a single-source approach that will allow to to work with Python 2.7 as well. Get to grips fast with data munging and preprocessing, and all the techniques you need to load, analyse, and process your data. Finally, get a complete overview of principal machine learning algorithms, graph analysis techniques, and all the visualization and deployment instruments that make it easier to present your results to an audience of both data science experts and business users.

Publication date:
October 2016
Publisher
Packt
Pages
378
ISBN
9781786462138

## Chapter 1. First Steps

Whether you are an eager learner of data science or a well-grounded data science practitioner, you can take advantage of this essential introduction to Python for data science. You can use it to the fullest if you already have at least some previous experience in basic coding, in writing general-purpose computer programs in Python, or in some other data analysis-specific language such as MATLAB or R.

This book will delve directly into Python for data science, providing you with a straight and fast route to solve various data science problems using Python and its powerful data analysis and machine learning packages. The code examples that are provided in this book don't require you to be a master of Python. However, they will assume that you at least know the basics of Python scripting, including data structures such as lists and dictionaries, and the workings of class objects. If you don't feel confident about these subjects or have minimal knowledge of the Python language, before reading this book, we suggest that you take an online tutorial. There are many possible choices, but we suggest starting with the suggestions from the official beginner's guide to Python from the Python Foundation or directly going to the free Code Academy course at https://www.codecademy.com/learn/python. Using Code Academy's tutorial, or any other alternative you may find useful, in a matter of a few hours of study, you should acquire all the building blocks that will ensure you enjoy this book to the fullest. We have also prepared a tutorial of our own, which can be found in the last part of this book, in order to provide an integration of the two aforementioned free courses.

In any case, don't be intimidated by our starting requirements; mastering Python enough for data science applications isn't as arduous as you may think. It's just that we have to assume some basic knowledge on the reader's part because our intention is to go straight to the point of doing data science without having to explain too much about the general aspects of the language that we will be using.

Are you ready, then? Let's start!

In this introductory chapter, we will work out the basics to set off in full swing and go through the following topics:

• How to set up a Python data science toolbox

• Using your browser as an interactive notebook, to code with Python using Jupyter

• An overview of the data that we are going to study in this book

## Introducing data science and Python

Data science is a relatively new knowledge domain, though its core components have been studied and researched for many years by the computer science community. Its components include linear algebra, statistical modeling, visualization, computational linguistics, graph analysis, machine learning, business intelligence, and data storage and retrieval.

Data science is a new domain and you have to take into consideration that currently its frontiers are still somewhat blurred and dynamic. Since data science is made of various constituent sets of disciplines, please also keep in mind that there are different profiles of data scientists depending on their competencies and areas of expertise.

In such a situation, what can be the best tool of the trade that you can learn and effectively use in your career as a data scientist? We believe that the best tool is Python, and we intend to provide you with all the essential information that you will need for a quick start.

In addition, other tools such as R and MATLAB provide data scientists with specialized tools to solve specific problems in statistical analysis and matrix manipulation in data science. However, Python really completes your data scientist skill set. This multipurpose language is suitable for both development and production alike; it can handle small- to large-scale data problems and it is easy to learn and grasp no matter what your background or experience is.

Created in 1991 as a general-purpose, interpreted, and object-oriented language, Python has slowly and steadily conquered the scientific community and grown into a mature ecosystem of specialized packages for data processing and analysis. It allows you to have uncountable and fast experimentations, easy theory development, and prompt deployment of scientific applications.

At present, the core Python characteristics that render it an indispensable data science tool are as follows:

• It offers a large, mature system of packages for data analysis and machine learning. It guarantees that you will get all that you may need in the course of a data analysis, and sometimes even more.

• Python can easily integrate different tools and offers a truly unifying ground for different languages, data strategies, and learning algorithms that can be fitted together easily and which can concretely help data scientists forge powerful solutions. There are packages that allow you to call code in other languages (in Java, C, Fortran, R, or Julia), outsourcing some of the computations to them and improving your script performance.

• It is very versatile. No matter what your programming background or style is (object-oriented, procedural, or even functional), you will enjoy programming with Python.

• It is cross-platform; your solutions will work perfectly and smoothly on Windows, Linux (even on small-sized distributions, suitable for IoT on tiny-PCs like Raspberry Pi, Arduino and so on), and Mac OS systems. You won't have to worry all that much about portability.

• Although interpreted, it is undoubtedly fast compared to other mainstream data analysis languages such as R and MATLAB (though it is not comparable to C, Java, and the newly emerged Julia language). Moreover, there are also static compilers such as Cython or just-in-time compilers such as PyPy that can transform Python code into C for higher performance.

• It can work with large in-memory data because of its minimal memory footprint and excellent memory management. The memory garbage collector will often save the day when you load, transform, dice, slice, save, or discard data using various iterations and reiterations of data wrangling.

• It is very simple to learn and use. After you grasp the basics, there's no better way to learn more than by immediately starting with the coding.

• Moreover, the number of data scientists using Python is continuously growing: new packages and improvements have been released by the community every day, making the Python ecosystem an increasingly prolific and rich language for data science.

## Installing Python

First, let's proceed to introduce all the settings you need in order to create a fully working data science environment to test the examples and experiment with the code that we are going to provide you with.

Python is an open source, object-oriented, and cross-platform programming language. Compared to some of its direct competitors (for instance, C++ or Java), Python is very concise. It allows you to build a working software prototype in a very short time. Yet it has become the most used language in the data scientist's toolbox not just because of that. It is also a general-purpose language, and it is very flexible due to a variety of available packages that solve a wide spectrum of problems and necessities.

### Python 2 or Python 3?

There are two main branches of Python: 2.7.x and 3.x. At the time of writing this second edition of the book, the Python Foundation (https://www.python.org/) is offering downloads for Python version 2.7.11 and 3.5.1. Although the third version is the newest, the older one is still the most used version in the scientific area, since a few packages (check the website at http://py3readiness.org/ for a compatibility overview) won't run otherwise yet.

In addition, there is no immediate backward compatibility between Python 3 and 2. In fact, if you try to run some code developed for Python 2 with a Python 3 interpreter, it may not work. Major changes have been made to the newest version, and that has affected past compatibility. Some data scientists, having built most of their work on Python 2 and its packages, are reluctant to switch to the new version.

In this second edition of the book, we intend to address a growing audience of data scientists, data analysts, and developers, who may not have such a strong legacy with Python 2. Thus, we agreed that it would be better to work with Python 3 rather than the older version. We suggest using a version such as Python 3.4 or above. After all, Python 3 is the present and the future of Python. It is the only version that will be further developed and improved by the Python Foundation and it will be the default version of the future on many operating systems.

Anyway, if you are currently working with version 2 and you prefer to keep on working with it, you can still use this book and all its examples. In fact, for the most part, our code will simply work on Python 2 after having the code itself preceded by these imports:

from __future__ import (absolute_import, division,
print_function, unicode_literals)
from builtins import *
from future import standard_library
standard_library.install_aliases()


### Tip

The from __future__ import commands should always occur at the beginning of your scripts or else you may experience Python reporting an error.

As described in the Python-future website (http://python-future.org/), these imports will help convert several Python 3-only constructs to a form compatible with both Python 3 and Python 2 (and in any case, most Python 3 code should just simply work on Python 2 even without the aforementioned imports).

In order to run the upward commands successfully, if the future package is not already available on your system, you should install it (version >= 0.15.2) using the following command to be executed from a shell:

### The installation of packages

Python won't come bundled with all you need, unless you take a specific premade distribution. Therefore, to install the packages you need, you can use either pip or easy_install. Both these two tools run in the command line and make the process of installation, upgrade, and removal of Python packages a breeze. To check which tools have been installed on your local machine, run the following command:

$> pip  ### Note To install pip, follow the instructions given at https://pip.pypa.io/en/latest/installing/. Alternatively, you can also run this command: $> easy_install


If both of these commands end up with an error, you need to install any one of them. We recommend that you use pip because it is thought of as an improvement over easy_install. Moreover, easy_install is going to be dropped in future and pip has important advantages over it. It is preferable to install everything using pip because:

• It is the preferred package manager for Python 3. Starting with Python 2.7.9 and Python 3.4, it is included by default with the Python binary installers.

• It provides an uninstall functionality.

• It rolls back and leaves your system clear if, for whatever reason, the package installation fails.

### Tip

Using easy_install in spite of the advantages of pip makes sense if you are working on Windows because pip won't always install pre-compiled binary packages. Sometimes it will try to build the package's extensions directly from C source, thus requiring a properly configured compiler (and that's not an easy task on Windows). This depends if the package is running on eggs, Python metadata files for distributing code as bundles, (and pip cannot directly use their binaries, but it needs to build from their source code) or wheels, the new standard for Python distribution of code bundles .(In this last case, pip can install binaries if available, as explained here: http://pythonwheels.com/). Instead, easy_install will always install available binaries from eggs and wheels. Therefore, if you are experiencing unexpected difficulties installing a package, easy_install can save your day (at some price anyway, as we just mentioned in the list).

The most recent versions of Python should already have pip installed by default. Therefore, you may have it already installed on your system. If not, the safest way is to download the get-pi.py script from https://bootstrap.pypa.io/get-pip.py and then run it using the following:

$> python get-pip.py  The script will also install the setup tool from https://pypi.python.org/pypi/setuptools, which also contains easy_install. You're now ready to install the packages you need in order to run the examples provided in this book. To install the < package-name > generic package, you just need to run this command: $> pip install < package-name >


Alternatively, you can run the following command:

$> easy_install < package-name >  Note that in some systems, pip might be named as pip3 and easy_install as easy_install-3 to stress the fact that both operate on packages for Python 3. If you're unsure, check the version of Python pip is operating on with: $> pip -V


For easy_install, the command is slightly different:

$> easy_install --version  After this, the <pk> package and all its dependencies will be downloaded and installed. If you're not certain whether a library has been installed or not, just try to import a module inside it. If the Python interpreter raises an ImportError error, it can be concluded that the package has not been installed. This is what happens when the NumPy library has been installed: >>> import numpy  This is what happens if it's not installed: >>> import numpy Traceback (most recent call last): File "<stdin>", line 1, in <module> ImportError: No module named numpy  In the latter case, you'll need to first install it through pip or easy_install. ### Note Take care that you don't confuse packages with modules. With pip, you install a package; in Python, you import a module. Sometimes, the package and the module have the same name, but in many cases, they don't match. For example, the sklearn module is included in the package named Scikit-learn. Finally, to search and browse the Python packages available for Python, look at https://pypi.python.org/pypi. ### Package upgrades More often than not, you will find yourself in a situation where you have to upgrade a package because either the new version is required by a dependency or it has additional features that you would like to use. First, check the version of the library you have installed by glancing at the __version__ attribute, as shown in the following example, numpy: >>> import numpy >>> numpy.__version__ # 2 underscores before and after '1.9.2'  Now, if you want to update it to a newer release, say the 1.11.0 version, you can run the following command from the command line: $> pip install -U numpy==1.11.0


Alternatively, you can use the following command:

$> easy_install --upgrade numpy==1.11.0  Finally, if you're interested in upgrading it to the latest available version, simply run this command: $> pip install -U numpy


You can alternatively run the following command:

$> easy_install --upgrade numpy  ### Scientific distributions As you've read so far, creating a working environment is a time-consuming operation for a data scientist. You first need to install Python and then, one by one, you can install all the libraries that you will need. Sometimes, the installation procedures may not go as smoothly as you'd hoped for earlier, requiring the user to do extra steps, to install additional executables (like, in Linux boxes, gFortran for Scipy) or libraries (like libfreetype for Matplotlib). Usually, the backtrace of the error produced during the failed installation is clear enough to understand what went wrong and to take the correct resolving action, but at other times, the error is tricky or subtle, holding up the user for hours without advancing in the process. If you want to save time and effort and want to ensure that you have a fully working Python environment that is ready to use, you can just download, install, and use the scientific Python distribution. Apart from Python, they also include a variety of preinstalled packages, and sometimes, they even have additional tools and an IDE. A few of them are very well known among data scientists, and in the sections that follow, you will find some of the key features of each of these packages. We suggest that you first promptly download and install a scientific distribution, such as Anaconda (which is the most complete one), and after practicing the examples in the book, decide to fully uninstall the distribution and set up Python alone, which can be accompanied by just the packages you need for your projects. #### Anaconda Anaconda (http://continuum.io/downloads) is a Python distribution offered by Continuum Analytics that includes nearly 200 packages, which comprises NumPy, SciPy, pandas, Jupyter, Matplotlib, Scikit-learn, and NLTK. It's a cross-platform distribution (Windows, Linux, and Mac OS X) that can be installed on machines with other existing Python distributions and versions. Its base version is free; instead, add-ons that contain advanced features are charged separately. Anaconda introduces conda, a binary package manager, as a command-line tool to manage your package installations. As stated on the website, Anaconda's goal is to provide enterprise-ready Python distribution for large-scale processing, predictive analytics, and scientific computing. #### Leveraging conda to install packages If you've decided to install an Anaconda distribution, you can take advantage of the conda binary installer we mentioned previously. Anyway, conda is an open source package management system, and consequently it can be installed separately from an Anaconda distribution. You can test immediately whether conda is available on your system. Open a shell and digit: $> conda -V


If conda is available, there will appear the version of your conda; otherwise an error will be reported. If conda is not available, you can quickly install it on your system by going to http://conda.pydata.org/miniconda.html and installing the Miniconda software suitable for your computer. Miniconda is a minimal installation that only includes conda and its dependencies.

conda can help you manage two tasks: installing packages and creating virtual environments. In this section, we will explore how conda can help you easily install most of the packages you may need in your data science projects.

$> conda update conda  Now you can install any package you need. To install the <package-name> generic package, you just need to run the following command: $> conda install <package-name>


You can also install a particular version of the package just by pointing it out:

$> conda install <package-name>=1.11.0  Similarly, you can install multiple packages at once by listing all their names: $> conda install <package-name-1> <package-name-2>


If you just need to update a package that you previously installed, you can keep on using conda:

$> conda update <package-name>  You can update all the available packages simply by using the --all argument: $> conda update --all


Finally, conda can also uninstall packages for you:

$> conda remove <package-name>  If you would like to know more about conda, you can read its documentation at http://conda.pydata.org/docs/index.html. In summary, as a main advantage, it handles binaries even better than easy_install (by always providing a successful installation on Windows without any need to compile the packages from source) but without its problems and limitations. With the use of conda, packages are easy to install (and installation is always successful), update, and even uninstall. On the other hand, conda cannot install directly from a git server (so it cannot access the latest version of many packages under development) and it doesn't cover all the packages available on PyPI as pip itself. #### Enthought Canopy Enthought Canopy (https://www.enthought.com/products/canopy/) is a Python distribution by Enthought Inc. It includes more than 200 preinstalled packages, such as NumPy, SciPy, Matplotlib, Jupyter, and pandas (more on these packages later). This distribution is targeted at engineers, data scientists, quantitative and data analysts, and enterprises. Its base version is free (which is named Canopy Express), but if you need advanced features, you have to buy a front version. It's a multiplatform distribution and its command-line install tool is canopy_cli. #### PythonXY PythonXY (http://python-xy.github.io/) is a free, open source Python distribution maintained by the community. It includes a number of packages, which include NumPy, SciPy, NetworkX, Jupyter, and Scikit-learn. It also includes Spyder, an interactive development environment inspired by the MATLAB IDE. The distribution is free. It works only on Microsoft Windows, and its command-line installation tool is pip. #### WinPython WinPython (http://winpython.sourceforge.net/) is also a free, open-source Python distribution maintained by the community. It is designed for scientists, and includes many packages such as NumPy, SciPy, Matplotlib, and Jupyter. It also includes Spyder as an IDE. It is free and portable. You can put WinPython into any directory, or even into a USB flash drive, and at the same time maintain multiple copies and versions of it on your system. It works only on Microsoft Windows, and its command-line tool is the WinPython Package Manager (WPPM). ### Explaining virtual environments No matter whether you have chosen installing a standalone Python or instead you used a scientific distribution, you may have noticed that you are actually bound on your system to the Python's version you have installed. The only exception, for Windows users, is to use a WinPython distribution, since it is a portable installation and you can have as many different installations as you need. A simple solution to break free of such a limitation is to use virtualenv, which is a tool to create isolated Python environments. That means that, by using different Python environments, you can easily achieve these things: • Testing any new package installation or doing experimentation on your Python environment without any fear of breaking anything in an irreparable way. In this case, you need a version of Python that acts as a sandbox. • Having at hand multiple Python versions (both Python 2 and Python 3), geared with different versions of installed packages. This can help you in dealing with different versions of Python for different purposes (for instance, some of the packages we are going to present on Windows OS only work using Python 3.4, which is not the latest release). • Taking a replicable snapshot of your Python environment easily and having your data science prototypes work smoothly on any other computer or in production. In this case, your main concern is the immutability and replicability of your working environment. You can find documentation about virtualenv at http://virtualenv.readthedocs.io/en/stable/, though we are going to provide you with all the directions you need to start using it immediately. In order to take advantage of virtualenv, you have first to install it on your system: $> pip install virtualenv


After the installation completes, you can start building your virtual environments. Before proceeding, you have to take a few decisions:

• If you have more versions of Python installed on your system, you have to decide which version to pick up. Otherwise, virtualenv will take the Python version virtualenv was installed by on your system. In order to set a different Python version, you have to digit the argument -p followed by the version of Python you want or insert the path of the Python executable to be used (for instance, -p python2.7) or just point to a Python executable such as -p c:\Anaconda2\python.exe.

• With virtualenv, when required to install a certain package, it will install it from scratch, even if it is already available at a system level (on the Python directory you created the virtual environment from). This default behavior makes sense because it allows you to create a completely separated empty environment. In order to save disk space and limit the time of installation of all the packages, you may instead decide to take advantage of already available packages on your system by using the argument --system-site-packages.

• You may want to be able to later move around your virtual environment across Python installations, even among different machines. Therefore, you may want to make the functioning of all of the environment's scripts relative to the path it is placed in by using the argument --relocatable.

After deciding on the Python version, the linking to existing global packages, and the relocability of the virtual environment, in order to start, you just launch the command from a shell. Declare the name you would like to assign to your new environment:

$> virtualenv clone  virtualenv will just create a new directory using the name you provided, in the path from which you actually launched the command. To start using it, you just enter the directory and digit activate: $> cd clone
$> activate  At this point, you can start working on your separated Python environment, installing packages and working with code. If you need to install multiple packages at once, you may need some special function from pip—pip freeze—which will enlist all the packages (and their versions) you have installed on your system. You can record the entire list in a text file by this command: $> pip freeze > requirements.txt


After saving the list in a text file, just take it into your virtual environment and install all the packages in a breeze with a single command:

$> pip install -r requirements.txt  Each package will be installed according to the order in the list (packages are listed in a case-insensitive sorted order). If a package requires other packages that are later in the list, that's not a big deal because pip automatically manages such situations. So if your package requires Numpy and Numpy is not yet installed, pip will install it first. When you're finished installing packages and using your environment for scripting and experimenting, in order to return to your system defaults, just issue this command: $> deactivate


If you want to remove the virtual environment completely, after deactivating and getting out of the environment's directory, you just have to get rid of the environment's directory itself by a recursive deletion. For instance, on Windows you just do this:

$> rd /s /q clone  On Linux and Mac, the command will be: $> rm -r -f clone


### Tip

If you are working extensively with virtual environments, you should consider using virtualenvwrapper, which is a set of wrappers for virtualenv, in order to help you manage multiple virtual environments easily. It can be found at http://bitbucket.org/dhellmann/virtualenvwrapper. If you are operating on a Unix system (Linux or OS X), another solution we have to quote is pyenv (which can be found at https://github.com/yyuu/pyenv), which lets you set your main Python version, allows installation of multiple versions and creates virtual environments. Its peculiarity is that it does not depend on Python to be installed and it works perfectly at the user-level (no need for sudo commands).

#### conda for managing environments

If you have installed the Anaconda distribution, or you have tried conda using a Miniconda installation, you can also take advantage of the conda command to run virtual environments as an alternative to virtualenv. Let's see in practice how to use conda for that. We can check what environments we have available like this:

>$conda info -e  This command will report to you what environments you can use on your system based on conda. Most likely, your only environment will be just root, pointing to your Anaconda distribution's folder. As an example, we can create an environment based on Python version 3.4, having all the necessary Anaconda-packaged libraries installed. That makes sense, for instance, for using the package Theano together with Python 3 on Windows (because of an issue we will explain in shortly). In order to create such an environment, just do this: $> conda create -n python34 python=3.4 anaconda


The command asks for a particular Python Version 3.4 and requires the installation of all packages available on the Anaconda distribution (the argument anaconda). It names the environment as python34 using the argument -n. The complete installation will take a while, given the large number of packages in the Anaconda installation. After having completed all of the installation, you can activate the environment:

$> activate python34  If you need to install additional packages to your environment, when activated, you just do the following: $> conda install -n python34 <package-name1> <package-name2>


That is, you make the list of the required packages follow the name of your environment. Naturally, you can also use pip install, as you would do in a virtualenv environment.

You can also use a file instead of listing all the packages by name yourself. You can create a list in an environment using the list argument and piping the output to a file:

$> conda list -e > requirements.txt  Then, in your target environment, you can install the entire list using: $> conda install --file requirements.txt


You can even create an environment, based on a requirements list:

$> conda create -n python34 python=3.4 --file requirements.txt  Finally, after having used the environment, to close the session, you simply do this: $> deactivate


Contrary to virtualenv, there is a specialized argument in order to completely remove an environment from your system:

$> conda remove -n python34 --all  ### A glance at the essential packages We mentioned that the two most relevant characteristics of Python are its ability to integrate with other languages and its mature package system, which is well embodied by PyPI (see the Python Package Index at: https://pypi.python.org/pypi), a common repository for the majority of Python open source packages that is constantly maintained and updated. The packages that we are now going to introduce are strongly analytical and they will constitute a complete data science toolbox. All the packages are made up of extensively tested and highly optimized functions for both memory usage and performance, ready to achieve any scripting operation with successful execution. A walkthrough on how to install them is provided in the following section. Partially inspired by similar tools present in R and MATLAB environments, we will together explore how a few selected Python commands can allow you to efficiently handle data and then explore, transform, experiment, and learn from the same without having to write too much code or reinvent the wheel. #### NumPy NumPy, which is Travis Oliphant's creation, is the true analytical workhorse of the Python language. It provides the user with multidimensional arrays, along with a large set of functions to operate a multiplicity of mathematical operations on these arrays. Arrays are blocks of data arranged along multiple dimensions, which implement mathematical vectors and matrices. Characterized by optimal memory allocation, arrays are useful not just for storing data, but also for fast matrix operations (vectorization), which are indispensable when you wish to solve ad hoc data science problems: • Website:http://www.numpy.org/ • Version at the time of print: 1.11.0 • Suggested install command: pip install numpy As a convention largely adopted by the Python community, when importing NumPy, it is suggested that you alias it as np: import numpy as np  We will be doing this throughout the course of this book. #### SciPy An original project by Travis Oliphant, Pearu Peterson, and Eric Jones, SciPy completes NumPy's functionalities, offering a larger variety of scientific algorithms for linear algebra, sparse matrices, signal and image processing, optimization, fast Fourier transformation, and much more: • Website: http://www.scipy.org/ • Version at time of print: 0.17.1 • Suggested install command: pip install scipy #### pandas The pandas package deals with everything that NumPy and SciPy cannot do. Thanks to its specific data structures, namely DataFrames and Series, pandas allows you to handle complex tables of data of different types (which is something that NumPy's arrays cannot do) and time series. Thanks to Wes McKinney's creation, you will be able easily and smoothly to load data from a variety of sources. You can then slice, dice, handle missing elements, add, rename, aggregate, reshape, and finally visualize your data at will: • Website: http://pandas.pydata.org/ • Version at the time of print: 0.18.1 • Suggested install command: pip install pandas Conventionally, pandas is imported as pd: import pandas as pd  #### Scikit-learn Started as part of the SciKits (SciPy Toolkits), Scikit-learn is the core of data science operations on Python. It offers all that you may need in terms of data preprocessing, supervised and unsupervised learning, model selection, validation, and error metrics. Expect us to talk at length about this package throughout this book. Scikit-learn started in 2007 as a Google Summer of Code project by David Cournapeau. Since 2013, it has been taken over by the researchers at INRA (French Institute for Research in Computer Science and Automation): ### Note Note that the imported module is named sklearn. #### Jupyter A scientific approach requires the fast experimentation of different hypotheses in a reproducible fashion. Initially named IPython and limited to working only with the Python language, Jupyter was created by Fernando Perez in order to address the need for an interactive command shell for several languages (based on shell, web browser, and application interface), featuring graphical integration, customizable commands, rich history (in the JSON format), and computational parallelism for an enhanced performance. Jupyter is our favored choice throughout this book, and it is used to clearly and effectively illustrate operations with scripts and data and the consequent results. We will devote a section of this chapter to explain in detail the characteristics of its interface and describing how it can turn into a precious tool for any data scientist: • Website: http://jupyter.org/ • Version at the time of print: 1.0.0 (ipykernel = 4.3.1) • Suggested install command: pip install jupyter #### Matplotlib Originally developed by John Hunter, matplotlib is a library that contains all the building blocks that are required to create quality plots from arrays and to visualize them interactively. You can find all the MATLAB-like plotting frameworks inside the pylab module: • Website: http://matplotlib.org/ • Version at the time of print: 1.5.1 • Suggested install command: pip install matplotlib You can simply import what you need for your visualization purposes with the following command: import matplotlib.pyplot as plt  ### Tip Downloading the example code You can download the example code files from your account at www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit www.packtpub.com/support and register to have the files e-mailed directly to you. #### Statsmodels Previously part of SciKits, statsmodels was thought to be a complement to SciPy's statistical functions. It features generalized linear models, discrete choice models, time series analysis, and a series of descriptive statistics as well as parametric and nonparametric tests: • Version at the time of print: 0.6.1 • Suggested install command: pip install statsmodels #### Beautiful Soup Beautiful Soup, a creation of Leonard Richardson, is a great tool to scrap out data from HTML and XML files retrieved from the Internet. It works incredibly well, even in the case of tag soups (hence the name), which are collections of malformed, contradictory, and incorrect tags. After choosing your parser (the HTML parser included in Python's standard library works fine), thanks to Beautiful Soup, you can navigate through the objects in the page and extract text, tables, and any other information that you may find useful: • Version at the time of print: 4.4.1 • Suggested install command: pip install beautifulsoup4 ### Note Note that the imported module is named bs4. #### NetworkX Developed by the Los Alamos National Laboratory, NetworkX is a package specialized in the creation, manipulation, analysis, and graphical representation of real-life network data (it can easily operate with graphs made up of a million nodes and edges). Besides specialized data structures for graphs and fine visualization methods (2D and 3D), it provides the user with many standard graph measures and algorithms, such as the shortest path, centrality, components, communities, clustering, and PageRank. We will mainly use this package in Chapter 5, Social Network Analysis: • Website: http://networkx.github.io/ • Version at the time of print: 1.11 • Suggested install command: pip install networkx Conventionally, NetworkX is imported as nx: import networkx as nx  #### NLTK The Natural Language Toolkit (NLTK) provides access to corpora and lexical resources and to a complete suite of functions for statistical Natural Language Processing (NLP), ranging from tokenizers to part-of-speech taggers and from tree models to named-entity recognition. Initially, Steven Bird and Edward Loper created the package as an NLP teaching infrastructure for their course at the University of Pennsylvania. Now, it is a fantastic tool that you can use to prototype and build NLP systems: • Website: http://www.nltk.org/ • Version at the time of print: 3.2.1 • Suggested install command: pip install nltk #### Gensim Gensim, programmed by Radim Řehůřek, is an open source package that is suitable for the analysis of large textual collections with the help of parallel distributable online algorithms. Among advanced functionalities, it implements Latent Semantic Analysis (LSA), topic modeling by Latent Dirichlet Allocation (LDA), and Google's word2vec, a powerful algorithm that transforms text into vector features that can be used in supervised and unsupervised machine learning. #### PyPy PyPy is not a package; it is an alternative implementation of Python 2.7.8 that supports most of the commonly used Python standard packages (unfortunately, NumPy is currently not fully supported). As an advantage, it offers enhanced speed and memory handling. Thus, it is very useful for heavy duty operations on large chunks of data and it should be part of your big data handling strategies: #### XGBoost XGBoost is a scalable, portable, and distributed gradient boosting library (a tree ensemble algorithm). Initially created by Tianqi Chen from Washington University, it has been enriched by a Python wrapper by Bing Xu and an R interface by Tong He (you can read the story behind XGBoost directly from its principal creator at http://homes.cs.washington.edu/~tqchen/2016/03/10/story-and-lessons-behind-the-evolution-of-xgboost.html). XGBoost is available for Python, R, Java, Scala, Julia, and C++, and it can work on a single machine (leveraging multithreading) in both Hadoop and Spark clusters: Detailed instructions for installing XGBoost on your system can be found at this page: https://github.com/dmlc/xgboost/blob/master/doc/build.md. The installation of XGBoost on both Linux and macOS is quite straightforward, whereas it is a little bit trickier for Windows users. On a Posix system, you just have to build the executable with make, but on Windows things are a little bit more tricky. For this reason, we provide specific installation steps to get XGBoost working on Windows: 1. First download and install Git for Windows (https://git-for-windows.github.io/). 2. Then you need a MINGW compiler present on your system. You can download it from http://www.mingw.org/ accordingly to the characteristics of your system. 3. From the command line, execute: $> git clone --recursive https://github.com/dmlc/xgboost
$> cd xgboost$> git submodule init
$> git submodule update  4. Then, always from the command line, copy the configuration for 64-byte systems to be the default one: $> copy make\mingw64.mk config.mk

5. Alternatively, you just copy the plain 32-byte version:

$> copy make\mingw.mk config.mk  6. After copying the configuration file, you can run the compiler, setting it to use four threads in order to speed up the compiling procedure: $> mingw32-make -j4

7. In MinGW, the make command comes with the name mingw32-make. If you are using a different compiler, the previous command may not work; then you can simply try:

$> make -j4  8. Finally, if the compiler completes its work without errors, you can install the package in your Python with this: $> cd python-package
$> python setup.py install  ### Tip After following all the preceding instructions, if you try to import XGBoost in Python and yet it doesn't load and results in an error, it may well be that Python cannot find the MinGW's g++ runtime libraries. You just need to find the location on your computer of MinGW's binaries (in our case, it was in C:\mingw-w64\mingw64\bin; just modify the next code to insert yours) and place the following code snippet before importing XGBoost: import os mingw_path = 'C:\\mingw-w64\\mingw64\\bin' os.environ['PATH']=mingw_path + ';' + os.environ['PATH'] import xgboost as xgb Depending on the state of the XGBoost project, similarly to many other projects under continuous development, the preceding installation commands may or may not temporarily work at the time you try them. Usually waiting for an update of the project or opening an issue with the authors of the package may solve the problem. In any case, we provide for our readers to download the version we have compiled on Windows and for those we used for the examples in this book. You can download it from Packt Publishing website (as pointed out in the preface). #### Theano Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. Basically, it provides you with all the building blocks you need to create deep neural networks. Created by academics (an entire development team; you can read their names on their most recent paper at http://arxiv.org/pdf/1605.02688.pdf), Theano has been used for large scale and intensive computations since 2007: • Release at the time of print: 0.8.2 In spite of many installation problems experienced by users in the past (especially Windows users), the installation of Theano should be straightforward, the package being now available on PyPI: $> pip install Theano


If you want the most updated version of the package, you can get it by GitHub cloning:

$> git clone git://github.com/Theano/Theano.git  Then you can proceed with direct Python installation: $> cd Theano
$> python setup.py install  To test your installation, you can run the following commands from the shell/CMD and verify the reports: $> pip install nose
$> pip install nose-parameterized$> nosetests theano


If you are working on a Windows OS and the previous instructions don't work, you can try these steps using the conda command provided by the Anaconda distribution:

1. Install TDM GCC x64 (this can be found at http://tdm-gcc.tdragon.net/)

2. Open an Anaconda prompt interface and execute:

$> conda update conda$> conda update --all
$> conda install mingw libpython$> pip install git+git://github.com/Theano/Theano.git


### Tip

Theano needs libpython, which isn't compatible yet with the version 3.5. So if your Windows installation is not working, this could be the likely cause. Anyway, Theano installs perfectly on Python version 3.4. Our suggestion in this case is to create a virtual Python environment based on version 3.4, install, and use Theano only on that specific version. Directions on how to create virtual environments are provided in the paragraph about virtualenv and conda create.

In addition, Theano's website provides some information to Windows users; it could support you when everything else fails: http://deeplearning.net/software/theano/install_windows.html.

An important requirement for Theano to scale out on GPUs is to install Nvidia CUDA drivers and SDK for code generation and execution on GPUs. If you do not know too much about the CUDA Toolkit, you can actually start from this web page in order to understand more about the technology being used: https://developer.nvidia.com/cuda-toolkit.

Therefore, if your computer has an NVidia GPU, you can find all the necessary instructions in order to install CUDA using this tutorial page from NVidia itself: http://docs.nvidia.com/cuda/cuda-quick-start-guide/index.html#axzz4Msw9qwJZ.

#### Keras

Keras is a minimalist and highly modular neural networks library, written in Python and capable of running on top of either Theano or TensorFlow (the source software library for numerical computation released by Google). Keras was created by François Chollet, a machine learning researcher working at Google:

• Website:  https://keras.io/

• Version at the time of print: 1.0.3

• Suggested installation from PyPI: $> pip install keras As an alternative, you can install the latest available version (which is advisable since the package is in continuous development) using the following command: $> pip install git+git://github.com/fchollet/keras.git


## Introducing Jupyter

As previously mentioned, Jupyter deserves more than a brief presentation. We are going to delve fully in detail about its history, installation, and usage for data science. Initially known as IPython, the project was initiated in 2001 as a free project by Fernando Perez. With this work, the author intended to address a deficiency in the Python stack and provide to the public a user-programming interface for data investigations that could easily incorporate the scientific approach (mainly meaning experimenting and interactively discovering) in the process of data discovery and development of data science solutions.

A scientific approach implies fast experimentation of different hypotheses in a reproducible fashion (as does data exploration and analysis in data science). When using this interface, you will be able more naturally to implement an explorative, iterative, trial and error research strategy during your code writing.

Recently (during Spring 2015), a large part of the IPython project was moved to a new one called Jupyter. This new project extends the potential usability of the original IPython interface to a wide range of programming languages, such as:

For a more complete list of available kernels for Jupyter, please visit the page at: https://github.com/ipython/ipython/wiki/IPython-kernels-for-other-languages.

For instance, once having installed Jupyter and its IPython kernel, you can easily add another useful kernel, such as the R kernel, in order to access through the same interface the R language. All you have to do is have an R installation, run your R interface, and enter the following commands:

install.packages(c('pbdZMQ', 'devtools'))
devtools::install_github('IRkernel/repr')
devtools::install_github('IRkernel/IRdisplay')
devtools::install_github('IRkernel/IRkernel')
IRkernel::installspec()


The commands will install the devtools library on your R, then pull and install all the necessary libraries from GitHub (you need to be connected to the Internet while running the other commands), and finally register the R kernel both in your R installation and on Jupyter. After that, every time you call the Jupyter notebook, you will have the choice of running either a Python or an R kernel, allowing you to use the same format and approach for all your data science projects.

### Tip

You cannot mix the same notebook commands for different kernels; each notebook only refers to a single kernel, that is, the one it was initially created with. Consequently on the same notebook you cannot mix languages or even versions of the same language like Python2 and Python3.

Thanks to the powerful idea of kernels, programs that run the user's code communicated by the frontend interface and provide feedback on the results of the executed code to the interface itself, you can use the same interface and interactive programming style no matter what language you are using for development.

In such a context, IPython is the zero kernel, the original starting one, still existing but not intended to be used anymore to refer to the entire project (without the IPython kernel, Jupyter will not even function, even if you have installed another kernel and linked it).

Therefore, Jupyter can simply be described as a tool for interactive tasks operable by a console or by a web-based notebook, which offers special commands that help developers to better understand and build the code that is being currently written.

Contrary to an IDE—which is built around the idea of writing a script, running it afterwards, and finally evaluating its results—Jupyter lets you write your code in chunks, named cells, run each of them sequentially, and evaluate the results of each one separately, examining both textual and graphic outputs. Besides graphical integration, it provides you with further help, thanks to customizable commands, a rich history (in the JSON format), and computational parallelism for an enhanced performance when dealing with heavy numeric computations.

Such an approach is also particularly fruitful for tasks involving developing code based on data, since it automatically accomplishes the often neglected duty of documenting and illustrating how data analysis has been done, its premises and assumptions, and its intermediate and final results. If a part of your job is to also present your work and advocate it to an internal or external stakeholder in the project, Jupyter can really do the magic of storytelling for you with little additional effort.

You can easily combine code, comments, formulas, charts, interactive plots, and rich media such as images and videos, making each Jupyter Notebook a complete scientific sketchpad to find all your experimentations and their results together.

Jupyter works on your favorite browser (which could be Explorer, Firefox, or Chrome, for instance) and, when started, presents a cell waiting for code to be written in. Each block of code enclosed in a cell can be run and its results are reported in the space just after the cell. Plots can be represented in the notebook (inline plot) or in a separate window. In our example, we decided to plot our chart inline.

Moreover, written notes can be written easily using the Markdown language, a very easy and fast-to-grasp markup language (http://daringfireball.net/projects/markdown/). Math formulas can be handled using MathJax (https://www.mathjax.org/) to render any LaTeX script inside HTML/Markdown.

There are several ways to insert LaTeX code in a cell. The easiest way is to simply use the Markdown syntax, wrapping the equations with single $ (dollar sign) for an inline LaTeX formula, or with double dollar sign $$ for a one-line central equation. Remember that to have a correct output, the cell should be set as Markdown. Here's an example. In Markdown: This is a \LaTeX inline equation: x = Ax+b And this is a one-liner:$$x = Ax + b$ This produces the following output: If you're looking for something more elaborate, that is, a formula that spans for more than one line, a table, a series of equations that should be aligned, or simply a use of special LaTeX functions, then it's better to use the %%latex magic command offered by the Jupyter notebook. In this case, the cell must be in code mode and contain the magic command as the first line. The following lines must define a complete LaTeX environment that can be compiled by the LaTeX interpreter. Here are a couple of examples showing what you can do: In: %%latex $|u(t)| = \begin{cases} u(t) & \text{if } t \geq 0 \\ -u(t) & \text{otherwise } \end{cases}$ Out:  In: %%latex \begin{align} f(x) &= (a+b)^2 \\ &= a^2 + (a+b) + (a+b) + b^2 \\ &= a^2 + 2\cdot (a+b) + b^2 \end{align} Out:  Remember that by using the %%latex magic command, the whole cell must comply with the LaTeX syntax. Therefore, if you just need to write a few simple equations in text, we strongly advice you to use the Markdown method. Being able to integrate technical formulas in Markdown is particularly fruitful for tasks involving development of code based on data, since it automatically accomplishes the often neglected duty of documenting and illustrating how data analysis has been managed as well as its premises, assumptions, and intermediate and final results. If a part of your job is to also present your work and persuade internal or external stakeholders in the project, Jupyter can really do the magic of storytelling for you with little additional effort. On the web page https://github.com/ipython/ipython/wiki/A-gallery-of-interesting-IPython-Notebooks, there are many examples, some of which you may find inspiring for your work, as it did for ours. Actually, we have to confess that keeping a cleaned, up-to-date Jupyter Notebook has saved us uncountable times when meetings with managers/stakeholders have suddenly popped up, requiring us to present the state of our work hastily. In short, Jupyter allows you to: • See intermediate (debugging) results for each step of the analysis • Run only some sections (or cells) of the code • Store intermediate results in JSON format and have the ability to do version control on them • Present your work (this will be a combination of text, code, and images), share it via the Jupyter Notebook Viewer service (http://nbviewer.jupyter.org/), and easily export it into a Python script, HTML, LaTeX, Markdown, PDF, or even slideshows (an HTML slideshow to be served by an HTTP server). In the next section, we will discuss Jupyter's installation in more detail and show an example of its usage in a data science task. ### Fast installation and first test usage Jupyter is our favored choice throughout this book. It is used to clearly and effectively illustrate and storytell operations using scripts and data, and their consequent results. Though we strongly recommend using Jupyter, if you are using a REPL or an IDE, you can use the same instructions and expect identical results (but for print formats and extensions of the returned results). If you do not have Jupyter installed on your system, you can promptly set it up using this command: > pip install jupyter


You can find complete instructions about Jupyter installation (covering different operating systems) on this web page: http://jupyter.readthedocs.io/en/latest/install.html

After installation, you can immediately start using Jupyter by calling it from the command line:

$> jupyter notebook  Once the Jupyter instance has opened in the browser, click on the New button; in the Notebooks section, choose Python 3 (other kernels may be present in the section depending on what you installed). At this point, your new empty notebook will look like the next screenshot and you can start entering the commands in the cells. For instance, you may start by typing in the cell: In: print ("This is a test")  After writing in cells, you just press the play button (below the Cell tab) to run it and obtain an output. Then, another cell will appear for your input. As you are writing in a cell, if you press the plus button on the menu bar, you will get a new cell and you can move from one cell to another using the arrows on the menu. Most of the other functions are quite intuitive and we invite you to try them. In order to know better how Jupyter works, you may use a quick start guide such as http://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/ or get a book which specializes on Jupyter functionalities. ### Note For a complete treatise of the full range of Jupyter functionalities when running the IPython kernel, refer to these two Packt Publishing books: • IPython Interactive Computing and Visualization Cookbook by Cyrille Rossant, Packt Publishing, September 25, 2014 • Learning IPython for Interactive Computing and Data Visualization by Cyrille Rossant, Packt Publishing, April 25, 2013 For our illustrative purposes, just consider that every Jupyter block of instructions has a numbered input statement and an output one. So you will find the code presented in this book structured in two blocks, at least when the output is not trivial at all. Otherwise, expect only the input part: In: <the code you have to enter> Out: <the output you should get>  As a rule, you just have to type the code after In: in your cells and run it. You can then compare your output with the output that we may provide using Out: followed by the output that we actually obtained on our computers when we tested the code. ### Jupyter magic commands As a special tool for interactive tasks, Jupyter offers special commands that help to better understand the code that you are currently writing. For instance, some of the commands are: • <object>? and <object>??: This prints a detailed description (with ?? being even more verbose) of <object> • %<function>: This uses the special <magic function> Let's demonstrate the usage of these commands with an example. We first start the interactive console with the jupyter command, which is used to run Jupyter from the command line, as shown here: $> jupyter console
Jupyter Console 4.1.1
In [1]: obj1 = range(10)


Then, in the first line of code, which is marked by Jupyter as [1], we create a list of 10 numbers (from 0 to 9), assigning the output to an object named obj1:

In [2]: obj1?
Type:        range
String form: range(0, 10)
Length:      10
Docstring:
range(stop) -> range object
range(start, stop[, step]) -> range object
Return an object that produces a sequence of integers from start
(inclusive)
to stop (exclusive) by step.  range(i, j) produces i, i+1, i+2,
..., j-1.
start defaults to 0, and stop is omitted!  range(4) produces
0, 1, 2, 3.
These are exactly the valid indices for a list of 4 elements.
When step is given, it specifies the increment (or decrement).
In [3]: %timeit x=100
The slowest run took 184.61 times longer than the fastest.
This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 24.6 ns per loop
In [4]: %quickref


In the next line of code, which is numbered [2], we inspect the obj1 object using the Jupyter command ?. Jupyter introspects the object, prints its details (obj is a range object that can generate the values [1, 2, 3..., 9] and elements), and finally prints some general documentation on the range objects. For complex objects, the usage of ?? instead of ? provides even more verbose output.

In line [3], we use the timeit magic function with a Python assignment (x=100). The timeit function runs this instruction many times and stores the computational time needed to execute it. Finally, it prints the average time that was taken to run the Python function.

We complete the overview with a list of all the possible special Jupyter functions by running the quickref helper function, as shown in line [4].

As you must have noticed, each time we use Jupyter, we have an input cell and, optionally, an output cell if there is something that has to be printed on stdout. Each input is numbered, so it can be referenced inside the Jupyter environment itself. For our purposes, we don't need to provide such references in the code of the book. Therefore, we will just report inputs and outputs without their numbers. However, we'll use the generic In: and Out: notations to point out the input and output cells. Just copy the commands after In: to your own Jupyter cell and expect an output that will be reported on the following Out:.

Therefore, the basic notations will be:

• The In: command

• The Out: output (wherever it is present and useful to be reported in the book)

Otherwise, if we expect you to operate directly on the Python console, we will use the following form:

 >>> command


Wherever necessary, the command-line input and output will be written as follows:

$> command  Moreover, to run the bash command in the Jupyter console, prefix it with a ! (exclamation mark): In: !ls Applications Google Drive Public Desktop Develop Pictures env temp ... In: !pwd /Users/mycomputer  ### How Jupyter Notebooks can help data scientists The main goal of the Jupyter Notebook is easy storytelling. Storytelling is essential in data science because you must have the power to do the following: • See intermediate (debugging) results for each step of the algorithm you're developing • Run only some sections (or cells) of the code • Store intermediate results and have the ability to version them • Present your work (this will be a combination of text, code, and images) Here comes Jupyter; it actually implements all the preceding actions: 1. To launch the Jupyter Notebook, run the following command: $> jupyter notebook

2. A web browser window will pop up on your desktop, backed by a Jupyter server instance. This is the how the main window looks:

3. Then, click on New Notebook. A new window will open, as shown in the following screenshot. You can start using the Notebook as soon as the kernel is ready. The small circle on the top right of the place, below the Python icon, indicates the state of the kernel: if it's filled, it means that the kernel is busy working; if it's empty (like the one in the screenshot) it means that the kernel is in idle, that is, ready to run any code

This is the web app that you'll use to compose your story. It's very similar to a Python IDE, with the bottom section (where you can write the code) composed of cells.

A cell can be either a piece of text (eventually formatted with a markup language) or a piece of code. In the second case, you have the ability to run the code, and any eventual output (the standard output) will be placed under the cell. The following is a very simple example of the same:

In: import random
a = random.randint(0, 100)
a
Out: 16
In: a*2
Out: 32


In the first cell, which is denoted by In:, we import the random module, assign a random value between 0 and 100 to the variable a, and print the value. When this cell is run, the output, which is denoted as Out:, is the random number. Then, in the next cell, we will just print the double of the value of the variable a.

As you can see, it's a great tool to debug and decide which parameter is best for a given operation. Now, what happens if we run the code in the first cell? Will the output of the second cell be modified since a is different? Actually, no, it won't. Each cell is independent and autonomous. In fact, after we run the code in the first cell, we end up with this inconsistent status:

In: import random
a = random.randint(0, 100)
a
Out: 56
In: a*2
Out: 32


### Note

Also note that the number in the squared parentheses has changed (from 1 to 3) since it's the third executed command (and its output) from the time the notebook started. Since each cell is autonomous, by looking at these numbers, you can understand their order of execution.

Jupyter is a simple, flexible, and powerful tool. However, as seen in the preceding example, you must note that when you update a variable that is going to be used later on in your Notebook, remember to run all the cells following the updated code so that you have a consistent state.

When you save a Jupyter Notebook, the resulting .ipynb file is JSON formatted, and it contains all the cells and their content plus the output. This makes things easier because you don't need to run the code to see the notebook (actually, you also don't need to have Python and its set of toolkits installed). This is very handy, especially when you have pictures featured in the output and some very time-consuming routines in the code. A downside of using the Jupyter Notebook is that its file format, which is JSON structured, cannot be easily read by humans. In fact, it contains images, code, text, and so on.

Now, let's discuss a data science-related example (don't worry about understanding it completely):

In:
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor


In the following cell, some Python modules are imported:

In:
X_full = boston_dataset.data
Y = boston_dataset.target
print (X_full.shape)
print (Y.shape)
Out:
(506, 13)
(506,)


Then, in cell [2], the dataset is loaded and an indication of its shape is shown. The dataset contains 506 house values that were sold in the suburbs of Boston, along with their respective data arranged in columns. Each column of the data represents a feature. A feature is a characteristic property of the observation. Machine learning uses features to establish models that can turn them into predictions. If you are from a statistical background, you can add features that can be intended as variables (values that vary with respect to the observations).

To see a complete description of the dataset, use print (boston_dataset.DESCR).

After loading the observations and their features, in order to provide a demonstration of how Jupyter can effectively support the development of data science solutions, we will perform some transformations and analysis on the dataset. We will use classes, such as SelectKBest, and methods, such as .getsupport() or .fit(). Don't worry if these are not clear to you now; they will all be covered extensively later in this book. Try to run the following code:

In:
selector = SelectKBest(f_regression, k=1)
selector.fit(X_full, Y)
X = X_full[:, selector.get_support()]
print (X.shape)
Out:
(506, 1)


Here, we select a feature (the most discriminative one) of the SelectKBest class that is fitted to the data by using the .fit() method. Thus, we reduce the dataset to a vector with the help of a selection operated by indexing on all the rows and on the selected feature, which can be retrieved by the .get_support() method.

Since the target value is a vector, we can, therefore, try to see whether there is a linear relation between the input (the feature) and the output (the house value). When there is a linear relationship between two variables, the output will constantly react to changes in the input by the same proportional amount and direction:

In:
def plot_scatter(X,Y,R=None):
plt.scatter(X, Y, s=32, marker='o', facecolors='white')
if R is not None:
plt.scatter(X, R, color='red', linewidth=0.5)
plt.show()
In:
plot_scatter(X,Y)


In our example, as X increases, Y decreases. However, this does not happen at a constant rate, because the rate of change is intense up to a certain X value and then it decreases and becomes constant. This is a condition of nonlinearity, and we can furthermore visualize it using a regression model. This model hypothesizes that the relationship between X and Y is linear in the form of y=a+bX. Its a and b parameters are estimated according to a certain criteria.

In the fourth cell, we scatter the input and output values for this problem:

In:
regressor = LinearRegression(normalize=True).fit(X, Y)
plot_scatter(X, Y, regressor.predict(X))


In the next cell, we create a regressor (a simple linear regression with feature normalization), train the regressor, and finally plot the best linear relation (that's the linear model of the regressor) between the input and output. Clearly, the linear model is an approximation that is not working well. We have two possible paths that we can follow at this point. We can transform the variables in order to make their relationship linear, or we can use a nonlinear model. Support Vector Machine (SVM) is a class of models that can easily solve nonlinearities. Also, Random Forests is another model for automatic solving of similar problems. Let's see them in action in Jupyter:

In:
regressor = SVR().fit(X, Y)
plot_scatter(X, Y, regressor.predict(X))


In:
regressor = RandomForestRegressor().fit(X, Y)
plot_scatter(X, Y, regressor.predict(X))


Finally, in the last two cells, we will repeat the same procedure. This time we will use two nonlinear approaches: an SVM and a Random Forest-based regressor.

This demonstrative code solves the nonlinearity problem. At this point, it is very easy to change the selected feature, regressor, and the number of features we use to train the model, and so on by simply modifying the cells where the script is. Everything can be done interactively, and according to the results we see, we can decide both what should be kept or changed and what is to be done next.

### Alternatives to Jupyter

In case you may not prefer using Jupyter, there are actually a few alternatives that can help you in testing the code you will find in the book. If you have experience with R, the RStudio (http://www.rstudio.com/) layout may appeal more to you. In this case, Yhat, a company providing data science solutions for decision APIs, offers their data science IDE for Python free of charge. Named Rodeo (http://www.yhat.com/products/rodeo), it works using the IPython kernel of Jupyter under the hood. It is an interesting alternative given its different user interface. The main advantages of using Rodeo are as follows:

• A video layout is arranged in four windows: editor, console, plots, environment.

• Autocomple is provided for editor and console

• Plots are always visible inside the application in a specific window

• You can easily inspect the working variables in the environment window

Rodeo can be simply installed using the installer. You can download it from its website, or you can simply do this in the command line:

$> pip install rodeo  After the installation, you can immediately run the Rodeo IDE with this command: $> rodeo .


Instead, if you have experience with MATLAB from Mathworks, you will find it easier to work with Spyder (http://pythonhosted.org/spyder/), a scientific IDE that can be found in major Scientific Python distributions (it is present in Anaconda, WinPython, and Python(x,y)—all distributions that we have suggested in the book). If you don't use a distribution, in order to install Spyder, you have to follow the instructions to be found at the web page: http://pythonhosted.org/spyder/installation.html. Spyder allows advanced editing, interactive editing, debugging, and introspection features and your scripts can be run in a Jupyter console or in a shell-like environment.

## Datasets and code used in the book

As we progress through the concepts presented in the book, in order to facilitate the reader's understanding, learning, and memorizing processes, we will illustrate practical and effective data science Python applications on various explicative datasets. The reader will always be able to immediately replicate, modify, and experiment with the proposed instructions and scripts on the data that we will use in this book.

As for the code that you are going to find in this book, we will limit our discussions to the most essential commands in order to inspire you from the beginning of your data science journey with Python to do more with less by leveraging key functions from the packages we presented beforehand.

Given our previous introduction, we will present the code to be run interactively as it appears on a Jupyter console or Notebook.

All the presented code will be offered in the Notebooks, and is available on the Packt website (as pointed out in the Preface). As for the data, we will provide different examples of datasets.

### Scikit-learn toy datasets

The Scikit-learn toy dataset module is embedded in the Scikit-learn package. Such datasets can easily be directly loaded into Python by the import command, and they don't require any download from any external Internet repository. Some examples of this type of dataset are the Iris, Boston, and Digits datasets, to name the principal ones mentioned in uncountable publications and books, and a few other classic ones for classification and regression.

Structured in a dictionary-like object, besides the features and target variables, they offer complete descriptions and contextualization of the data itself.

For instance, to load the Iris dataset, enter the following commands:

In:
from sklearn import datasets


After loading, we can explore the data description and understand how the features and targets are stored. All Scikit-learn datasets present the following methods:

• .DESCR: This provides a general description of the dataset

• .data: This contains all the features

• .feature_names: This reports the names of the features

• .target: This contains the target values expressed as values or numbered classes

• .target_names: This reports the names of the classes in the target

• .shape: This is a method that you can apply to both .data and .target; it reports the number of observations (the first value) and features (the second value if present) that are present

Now, let's just try to implement them (no output is reported, but the print commands will provide you with plenty of information):

In:
print (iris.DESCR)
print (iris.data)
print (iris.data.shape)
print (iris.feature_names)
print (iris.target)
print (iris.target.shape)
print (iris.target_names)


Now, you should know something more about the dataset—how many examples and variables are present and what their names are.

Notice that the main data structures that are enclosed in the iris object are the two arrays, data and target:

In:
print (type(iris.data))
Out:
<class 'numpy.ndarray'>


Iris.data offers the numeric values of the variables named sepal length, sepal width, petal length, and petal width arranged in a matrix form (150,4), where 150 is the number of observations and 4 is the number of features. The order of the variables is the order presented in iris.feature_names.

Iris.target is a vector of integer values, where each number represents a distinct class (refer to the content of target_names; each class name is related to its index number and setosa, which is the zero element of the list, is represented as 0 in the target vector).

The Iris flower dataset was first used in 1936 by Ronald Fisher, who was one of the fathers of modern statistical analysis, in order to demonstrate the functionality of linear discriminant analysis on a small set of empirically verifiable examples (each of the 150 data points represented iris flowers). These examples were arranged into tree-balanced species classes (each class consisted of one-third of the examples) and were provided with four metric descriptive variables that, when combined, were able to separate the classes.

The advantage of using such a dataset is that it is very easy to load, handle, and explore for different purposes, from supervised learning to graphical representation due to the dataset's low dimensionality. Modeling activities take almost no time on any computer, no matter what its specifications are. Moreover, the relationship between the classes and the role of the explicative variables are well known. Therefore, the task is challenging, but it is not very arduous.

For example, let's just observe how classes can be easily separated when you wish to combine at least two of the four available variables by using a scatterplot matrix.

Scatterplot matrices are arranged in a matrix format, whose columns and rows are the dataset variables. The elements of the matrix contain single scatterplots whose x values are determined by the row variable of the matrix and y values by the column variable. The diagonal elements of the matrix may contain a distribution histogram or some other univariate representation of the variable at the same time in its row and column.

The pandas library offers an off-the-shelf function to quickly make up scatterplot matrices and start exploring relationship and distributions between the quantitative variables in a dataset:

In:
import pandas as pd
import numpy as np
colors = list()
palette = {0: "red", 1: "green", 2: "blue"}
In:
for c in np.nditer(iris.target): colors.append(palette[int(c)])
# using the palette dictionary, we convert
# each numeric class into a color string
dataframe = pd.DataFrame(iris.data,  columns=iris.feature_names)
In:
sc = pd.scatter_matrix(dataframe, alpha=0.3, figsize=(10, 10),
diagonal='hist', color=colors, marker='o', grid=True)


We encourage you to experiment a lot with this dataset and with similar ones before you work on other complex real data, because the advantage of focusing on an accessible, non-trivial data problem is that it can help you to quickly build your foundations on data science.

After a while anyway, though they are useful and interesting for your learning activities, toy datasets will start limiting the variety of different experimentations that you can achieve. In spite of the insights provided, in order to progress, you'll need to gain access to complex and realistic data science topics. Consequently, we will have to resort to some external data.

#### The MLdata.org public repository

The second type of example dataset that we will present can be downloaded directly from the machine learning dataset repository, or from the LIBSVM data website. Contrary to the previous dataset, in this case, you will need access to the Internet.

First, mldata.org is a public repository for machine learning datasets that is hosted by the TU Berlin University and supported by Pattern Analysis, Statistical Modelling, and Computational Learning (PASCAL), a network funded by the European Union.

For example, if you need to download all the data related to earthquakes since 1972 as reported by the United States Geological Survey, in order to analyze the data to search for predictive patterns you will find the data repository at http://mldata.org/repository/data/viewslug/global-earthquakes/ (here, you will find a detailed description of the data).

Note that the directory that contains the dataset is global-earthquakes; you can directly obtain the data using the following commands:

In:
from sklearn.datasets import fetch_mldata
earthquakes = fetch_mldata('global-earthquakes')
print (earthquakes.data)
print (earthquakes.data.shape)
Out:
(59209L, 4L)


As in the case of the Scikit-learn package toy dataset, the obtained object is a complex dictionary-like structure, where your predictive variables are earthquakes.data and your target to be predicted is earthquakes.target. This being the real data, in this case, you will have quite a lot of examples and just a few variables available.

#### LIBSVM data examples

LIBSVM Data (http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/) is a page gathering data from many other collections. It offers different regression, binary, and multilabel classification datasets stored in the LIBSVM format. This repository is quite interesting if you wish to experiment with the support vector machines or any other machine learning algorithm.

If you want to load a dataset, first go to the web page where you can visualize the data on your browser. In the case of our example, visit http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a1a and note down the address. Then, you can proceed by performing a direct download using that address:

In:
import urllib2
target_page =
'http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a1a'
a2a = urllib2.urlopen(target_page)
In:
print (X_train.shape, y_train.shape)
Out:
(1605, 119) (1605,)


In return, you will get two single objects: a set of training examples in a sparse matrix format and an array of responses.

Sometimes, you may have to download the datasets directly from their repository using a web browser or a wget command (on Linux systems).

If you have already downloaded and unpacked the data (if necessary) into your working directory, the simplest way to load your data and start working is offered by the NumPy and the pandas library with their respective loadtxt and read_csv functions.

For instance, if you intend to analyze the Boston housing data and use the version present at http://mldata.org/repository/data/viewslug/regression-datasets-housing, you first have to download the regression-datasets-housing.csv file in your local directory.

Since the variables in the dataset are all numeric (13 continuous and one binary), the fastest way to load and start using it is by trying out the loadtxt NumPy function and directly loading all the data into an array.

Even in real-life datasets, you will often find mixed types of variables, which can be addressed by pandas.read_table or pandas.read_csv. Data can then be extracted by the values method; loadtxt can save a lot of memory if your data is already numeric. In fact, the loadtxt command doesn't require any in-memory duplication, something that is essential for large datasets, as other methods for loading a CSV file may use up all the available memory:

In:
delimiter=',')
print (type(housing))
Out:
<class 'numpy.ndarray'>
In:
print (housing.shape)
Out:
(506, 14)


The loadtxt function expects, by default, a tabulation as a separator between the values on a file. If the separator is a comma (,) or a semicolon(;), you have to make it explicit using the parameter delimiter:

>>>  import numpy as np
<type 'function'>


Help on function loadtxt in module numpy.lib.npyio.

Another important default parameter is dtype, which is set to float.

### Note

This means that loadtxt will force all of the loaded data to be converted into a floating-point number.

If you need to determinate a different type (for example, int), you have to declare it beforehand.

For instance, if you want to convert numeric data to int, use the following code:

In: housing_int =housing.astype(int)


Printing the first three elements of the row of the housing and housing_int arrays can help you understand the difference:

In:
print (housing[0,:3], '\n', housing_int[0,:3])
Out:
[  6.32000000e-03   1.80000000e+01   2.31000000e+00]
[ 0 18  2]


Frequently, though not always the case in our example, the data on files feature in the first line a textual header that contains the name of the variables. In this situation, the parameter that is skipped will point out the row in the loadtxt file from where it will start reading the data. Being the header on row 0 (in Python, counting always starts from 0), the parameter skip=1 will save the day and allow you to avoid an error and fail to load your data.

The situation would be slightly different if you were to download the Iris dataset, which is present at http://mldata.org/repository/data/viewslug/datasets-uci-iris/. In fact, this dataset presents a qualitative target variable, class, which is a string that expresses the iris species. Specifically, it's a categorical variable with four levels.

Therefore, if you were to use the loadtxt function, you will get a value error because an array must have all its elements of the same type. The variable class is a string, whereas the other variables are constituted by floating-point values.

The pandas library offers the solution to this and many similar cases, thanks to its DataFrame data structure that can easily handle datasets in a matrix form (row per columns) that is made up of different types of variables.

First, just download the datasets-uci-iris.csv file and have it saved in your local directory.

At this point, using read_csv from pandas is quite straightforward:

In:
iris_filename = 'datasets-uci-iris.csv'
'petal_length', \ 'petal_width', 'target'])
print (type(iris))
Out:
< class 'pandas.core.frame.DataFrame'>


In order not to make the snippets of code printed in the book too cumbersome, we often wrap them and make them nicely formatted. In order to safely interrupt the code and wrap it to a new line, we use the backslash symbol (\) as in the preceding code. When rendering the code of the book by yourself, you can ignore backslash symbols and go on writing all of the instruction on the same line, or you can digit the backslash and start a new line with the remainder of the instruction. Please be warned that typing the backslash and then continuing the instruction on the same line will cause an execution error.

Apart from the filename, you can specify the separator (sep), the way the decimal points are expressed (decimal), whether there is a header (in this case, header=None; normally, if you have a header, then header=0), and the name of the variable where there is one (you can use a list; otherwise, pandas will provide some automatic naming).

### Note

Also, we have defined names that use single words (instead of spaces, we used underscores). Thus, we can later directly extract single variables by calling them as we do for methods; for instance, iris.sepal_length will extract the sepal length data.

If, at this point, you need to convert the pandas DataFrame into a couple of NumPy arrays that contain the data and target values, this can be done easily in a couple of commands:

In:
iris_data = iris.values[:,:4]
iris_target, iris_target_labels = pd.factorize(iris.target)
print (iris_data.shape, iris_target.shape)
Out:
(150, 4) (150,)


#### Scikit-learn sample generators

As a last learning resource, the Scikit-learn package also offers the possibility to quickly create synthetic datasets for regression, binary and multilabel classification, cluster analysis, and dimensionality reduction.

The main advantage of recurring to synthetic data lies in its instantaneous creation in the working memory of your Python console. It is, therefore, possible to create bigger data examples without having to engage in long downloading sessions from the Internet (and saving a lot of stuff on your disk).

For example, you may need to work on a classification problem involving a million data points:

In:
from sklearn import datasets
X,y = datasets.make_classification(n_samples=10**6,
\ n_features=10, random_state=101)
print (X.shape,  y.shape)
Out: (1000000, 10) (1000000,)


After importing just the datasets module, we ask, using the make_classification command, for 1 million examples (the n_samples parameter) and 10 useful features (n_features). The random_state should be 101, so we are assured that we can replicate the same datasets at a different time and in a different machine.

For instance, you can type the following command:

In: datasets.make_classification(1, n_features=4, random_state=101)


This will always give you the following output:

Out:(array([[-3.31994186, -2.39469384, -2.35882002,

1.40145585]]),  array([0]))


No matter what the computer and the specific situation are, random_state assures deterministic results that make your experimentations perfectly replicable, due to the fact that all the random numbers involved in this synthetic dataset are actually produced in a deterministic way, based on this number (sometime it's called seed).

Defining the random_state parameter using a specific integer number (in this case, it's 101, but it may be any number that you prefer or find useful) allows easy replication of the same dataset on your machine, the way it is set up, on different operating systems, and on different machines.

By the way, did it take too long?

On an Intel i7 CPU @ 2.3GHz machine, it takes:

In:
%timeit X,y = datasets.make_classification(n_samples=10**6,
\ n_features=10, random_state=101)
Out: 1 loops, best of 3: 815 ms per loop


If it doesn't seem so on your machine and if you are ready, having set up and tested everything up to this point, we can start our data science journey.

## Summary

In this introductory chapter, we installed everything that we will be using throughout this book, from Python packages to examples. They were installed either directly or by using a scientific distribution. We also introduced Jupyter notebooks and demonstrated how you can have access to the data run in the tutorials.

In the next chapter, Data Munging, we will have an overview of the data science pipeline and explore all the key tools to handle and prepare data before you apply any learning algorithm and set up your hypothesis experimentation schedule.

• ##### Alberto Boschetti

Alberto Boschetti is a data scientist with expertise in signal processing and statistics. He holds a Ph.D. in telecommunication engineering and currently lives and works in London. In his work projects, he faces challenges ranging from natural language processing (NLP) and behavioral analysis to machine learning and distributed processing. He is very passionate about his job and always tries to stay updated about the latest developments in data science technologies, attending meet-ups, conferences, and other events.

Browse publications by this author
• ##### Luca Massaron

Luca Massaron is a data scientist and marketing research director specialized in multivariate statistical analysis, machine learning, and customer insight, with over a decade of experience of solving real-world problems and generating value for stakeholders by applying reasoning, statistics, data mining, and algorithms. From being a pioneer of web audience analysis in Italy to achieving the rank of a top-10 Kaggler, he has always been very passionate about every aspect of data and its analysis, and also about demonstrating the potential of data-driven knowledge discovery to both experts and non-experts. Favoring simplicity over unnecessary sophistication, Luca believes that a lot can be achieved in data science just by doing the essentials.

Browse publications by this author

### Latest Reviews

(6 reviews total)