Large Scale Machine Learning with Python

4.5 (2 reviews total)
By Bastiaan Sjardin , Luca Massaron , Alberto Boschetti
    Advance your knowledge in tech with a Packt subscription

  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. First Steps to Scalability

About this book

Large Python machine learning projects involve new problems associated with specialized machine learning architectures and designs that many data scientists have yet to tackle. But finding algorithms and designing and building platforms that deal with large sets of data is a growing need. Data scientists have to manage and maintain increasingly complex data projects, and with the rise of big data comes an increasing demand for computational and algorithmic efficiency. Large Scale Machine Learning with Python uncovers a new wave of machine learning algorithms that meet scalability demands together with a high predictive accuracy.

Dive into scalable machine learning and the three forms of scalability. Speed up algorithms that can be used on a desktop computer with tips on parallelization and memory allocation. Get to grips with new algorithms that are specifically designed for large projects and can handle bigger files, and learn about machine learning in big data environments. We will also cover the most effective machine learning techniques on a map reduce framework in Hadoop and Spark in Python.

Publication date:
August 2016


Chapter 1. First Steps to Scalability

Welcome to this book on scalable machine learning with Python.

In this chapter, we will discuss how to learn effectively from big data with Python and how it can be possible using your single machine or a cluster of other machines, which you can get, for instance, from Amazon Web Services (AWS) or the Google Cloud Platform.

In the book, we will be using Python's implementation of machine learning algorithms that are scalable. This means that they can work with a large amount of data and do not crash because of memory constraints. They also take a reasonable amount of time, which is something manageable for a data science prototype and also deployment in production. Chapters are organized around solutions (such as streaming data), algorithms (such as neural networks or ensemble of trees), and frameworks (such as Hadoop or Spark). We will also provide you with some basic reminders about the machine learning algorithms and explain how to make them scalable and suitable to problems with massive datasets.

Given such premises as a start, you'll need to learn the basics (so as to figure out the perspective under which this book has been written) and set up all your basic tools to start reading the chapters immediately.

In this chapter, we will introduce you to the following topics:

  • What scalability actually means

  • What bottlenecks you should pay attention to when dealing with data

  • What kind of problems this book will help you solve

  • How to use Python to analyze datasets at scale effectively

  • How to set up your machine quickly to execute the examples presented in this book

Let's start this journey together around scalable solutions with Python!


Explaining scalability in detail

Even if the hype now is about big data, large datasets existed long before the term itself had been coined. Large collections of texts, DNA sequences, and vast amounts of data from radio telescopes have always represented a challenge for scientists and data analysts. As most machine learning algorithms have a computational complexity of O(n2) or even O(n3), where n is the number of training instances, the challenge from massive datasets has been previously faced by data scientists and analysts by resorting to data algorithms that could be more efficient. A machine learning algorithm is deemed scalable when it can work after an appropriate setup, in case of large datasets. A dataset can be large because of a large number of cases or variables, or because of both, but a scalable algorithm can deal with it in an efficient way as its running time increases almost linearly accordingly to the size of the problem. Therefore, it is just a matter of exchanging 1:1 more time (or more computational power) with more data. Instead, a machine learning algorithm doesn't scale if it's faced with large amounts of data; it simply stops working or operates with a running time that increases in a nonlinear way, for instance, exponentially, thus making learning unfeasible.

The introduction of cheap data storage, a large RAM, and multiprocessor CPU dramatically changed everything, increasing the ability of single laptops to analyze large amounts of data. Another big game changer arrived on the scene in the past years, shifting the attention from single powerful machines to clusters of commodity computers (cheaper, easily available machines). This big change has been the introduction of MapReduce and the open source framework Apache Hadoop with its Hadoop Distributed File System (HDFS) and, in general, of parallel computation on networks of computers.

In order to figure out how both of these changes deeply and positively affected your capabilities of solving your large scale problems, we should first start from what actually prevented you (and still prevents, depending on how massive is your problem) from analyzing large datasets.

No matter what your problem is, you will eventually find out that you cannot analyze your data because of any of these limits:

  • Computing affecting the time taken to execute the analysis

  • I/O affecting how much of your data you can take from storage to memory in a time unit

  • Memory affecting how much large data you can process at a time

Your computer has limitations that will determine if you can learn from your data and how long it will take before you hit a wall. Computing limitations occur in many intensive calculations, I/O problems will bottleneck your prompt access to data, and finally memory limitations can constraint you to take on only a part of your data, thus limiting the kind of matrix computations that you may have access to or the precision or even exactness of your estimations.

Each of these hardware limitations will also affect you differently in severity with regard to the data you are analyzing:

  • Tall data, which is characterized by a large number of cases

  • Wide data, which is characterized by a large number of features

  • Tall and wide data, which has a large number of both cases and features

  • Sparse data, which is characterized by a large number of zero entries or entries that could be transformed into zeros (that is, the data matrix may be tall and/or wide but informative, but not all the matrix entries have informative value)

Finally, it comes down to the algorithm that you are going to use in order to learn from the data. Each algorithm has its own characteristics, being able to map data using a solution differently affected by bias or variance. Therefore, with respect to your problem that, so far, you solved by machine learning, you considered, based on experience or empirical tests, that certain algorithms may work better than others did. With large scale problems, you have to add other and different considerations when deciding on the algorithm:

  • How complex your algorithm is; that is, if the number of rows and columns in your data affects the number of computations in a linear or nonlinear way. Most machine learning solutions are based on algorithms of quadratic or cubic complexity, thus strongly limiting their applicability to big data.

  • How many parameters your model has; here, it's not just a problem of variance of the estimates (overfitting), but of the time it may take to compute them all.

  • If the optimization processes are parallelizable; that is, can you easily split the computations across multiple nodes or CPU cores, or do you have to rely on a single, sequential, optimization process?

  • Should the algorithm learn from all the data at once or can you use single examples or small batches of data instead?

If you cross-evaluate hardware limitations with data characteristics and these kind of algorithms, you'll get a host of possible problematic combinations that can prevent you from getting results from large scale analysis. From a practical point of view, all the problematic combinations can be solved by three approaches:

  • Scaling up, that is, improving performances on a single machine by software or hardware modifications (more memory, faster CPU, faster storage disk, and using GPUs)

  • Scaling out, that is, distributing the computation (and the performances) across multiple machines leveraging outside resources, namely other storage disks and other CPUs (or GPUs)

  • Scaling up and out, that is, taking the best of the scaling up and out solutions together

Making large scale examples

Some motivating examples may make things clearer and more memorable for you. Let's take two simple examples:

  • Being able to predict the click-through rate (CTR) can help you earn quite a lot these days when Internet advertising is so widespread, diffused, and eating large shares of traditional media communication

  • Being able to propose the right information to your customers, when they are searching the products and services offered by your site, could really enhance your chances to sell if you can guess what to put at the top of their results

In both cases, we have quite large datasets as they are produced by users' interactions on the Internet.

Depending on the business that we have in mind (we can imagine some big players here), we are clearly talking of millions of data points per day in both our examples. In the advertising case, data is certainly tall, being a continuous stream of information as the most recent data, more representative of markets and consumers, replaces the older one. In the search engine case, data is wide, being enriched by the feature provided by the results you offered to your customers: for instance, if you are in the travels business, you will have quite a lot of features about hotels, locations, and services offered.

Clearly, scalability is an issue for both these problems:

  • You have to learn from data that is growing every day and you have to learn fast because as you are learning, new data keeps arriving. Yet, you have to deal with data that clearly cannot fit in memory because the matrix is too tall or too large.

  • You frequently need to update your machine learning model in order to accommodate new data. You need an algorithm that can process the information in a timely manner. O(n2) or O(n3) complexities could be impossible for you to handle because of the data quantity; you need some algorithm that can work with lower complexity (such as O(n)) or by dividing the data so that n will be much, much smaller.

  • You have to be able to predict fast because the predictions have to be delivered only to new customers. Again, the complexity of your algorithm does matter.

The scalability problem can be solved in one or multiple ways:

  • Scaling up by reducing the dimensionality of the problem; for instance, in the case of the search engine, by effectively selecting the relevant features to be used

  • Scaling up using the right algorithm; for instance, in the case of advertising data, there are appropriate algorithms to learn effectively from streams

  • Scaling out the learning process by leveraging multiple machines

  • Scaling up the deployment process using multiprocessing and vectorization on a single server effectively

In this book, we will point out for you what kind of practical problems can be solved by each one of the solutions or algorithms proposed. It will become automatic for you to connect a particular constraint in time and execution (CPU, memory, or I/O) to the most suitable solution among the ones that we propose.

Introducing Python

As our treatise will depend on Python—our open source language of choice for this book—we have to stop for a brief moment and present the language before clarifying how Python can easily help you scale up and out with your massive data problem.

Created in 1991 as a general-purpose, interpreted, object-oriented language, Python has slowly and steadily conquered the scientific community and grown into a mature ecosystem of specialized packages for data processing and analysis. It allows you to have uncountable and fast experimentations, easy theory developments, and prompt deployments of scientific applications.

As a machine learning practitioner, you will find using Python interesting for various reasons:

  • It offers a large, mature system of packages for data analysis and machine learning. It guarantees that you will get all that you may need in the course of a data analysis, and sometimes even more.

  • It is very versatile. No matter what your programming background or style is (object-oriented or procedural), you will enjoy programming with Python.

  • If you don't know it yet but you know other languages such as C/C++ or Java well, then it is very simple to learn and use. After you grasp the basics, there's no other better way to learn more than by immediately starting with the coding.

  • It is cross-platform; your solutions will work perfectly and smoothly on Windows, Linux, and macOS systems. You won't have to worry about portability.

  • Although interpreted, it is undoubtedly fast compared to other mainstream data analysis languages such as R and MATLAB (though it is not comparable to C, Java, and the newly emerged Julia language).

  • It can work with in-memory big data because of its minimal memory footprint and excellent memory management. The memory garbage collector will often save the day when you load, transform, dice, slice, save, or discard data using the various iterations and reiterations of data wrangling.


If you are not already an expert (and actually we require some basic knowledge of Python in order to be able to make the most out of this book), you can read everything about the language and find the basic installations files directly from the Python foundations at

Scale up with Python

Python is an interpreted language; it runs the reading of your script from memory and executes it during runtime, thus accessing the necessary resources (files, objects in memory, and so on). Apart from being interpreted, another important aspect to take into consideration when using Python for data analysis and machine learning is that Python is single-threaded. Being single-threaded means that any Python program is executed sequentially from the start to the end of the script and that Python cannot take advantage of the extra processing power offered by the multiple threads and processors likely present in your computer (most computers nowadays are multicore).

Given such a situation, scaling up using Python can be achieved by different strategies:

  • Compiling Python scripts in order to achieve more speed of execution. Though easily possible using, for instance, PyPy—a Just-in-Time (JIT) compiler that can be found at, we actually didn't resort to such a solution in our book because it requires writing algorithms in Python from scratch.

  • Using Python as a wrapping language; thus putting together the operations executed by Python with the execution of external libraries and programs, some capable of multicore processing. In our book, you will find many examples of this when we call specialized libraries such as the Library for Support Vector Machines (LIBSVM) or programs such as Vowpal Wabbit (VW), XGBoost, or H2O in order to execute machine learning activities.

  • Effectively using vectorization techniques, that is, special libraries for matrix computations. This can be achieved using NumPy or pandas, both using computations from GPUs. GPUs are just like multicore CPUs, each one with their own memory and ability to process calculations in parallel (you can figure out that they have multiple tiny cores). Especially when working with neural networks, vectorization techniques based on GPUs can speed up computations incredibly. However, GPUs have their own limitations; first of all, their available memory has a certain I/O in passing your data to their memory and getting the results back to your CPU, and they require parallel programming via a special API, such as CUDA for NVIDIA-manufactured GPUs (so you have to install the appropriate drivers and programs).

  • Reducing a large problem into chunks and solving each chunk one at a time in-memory (divide and conquer algorithms). This leads to the partitioning or subsampling of data from memory or disk and managing approximate solutions of your machine learning problem, which is quite effective. It is important to notice that both partitioning and subsampling can operate for cases and features (and both). If the original data is kept on a disk storage, I/O constraints will become quite determinant of the resulting performances.

  • Effectively leveraging both multiprocessing and multithreading, depending on the learning algorithm that you will be using. Some algorithms will naturally be able to split their operations into parallel ones. In such cases, the only constraint will be your CPU's and your memory (as your data will have to be replicated for every parallel worker that you will be using). Some other algorithms will instead take advantage of multithreading, thus managing more operations at the same time on the same memory blocks.

Scale out with Python

Scaling out solutions simply involve connecting together multiple machines into a cluster. As you connect the machines (scaling out), you can also scale up each one of them using configurations that are more powerful (thus augmenting CPU, memory, and I/O), applying the techniques we mentioned in the previous paragraph and enhancing their performances.

By connecting multiple machines, you can leverage their computational power in a parallel fashion. Your data will be distributed across multiple storage disks/memory, limiting I/O transfers by having each machine work only on its available data (that is, its own storage disk or RAM memory).

In our book, this translates into using outside resources effectively by means of the following:

  • The H2O framework

  • The Hadoop framework and its components, such as HDFS, MapReduce, and Yet Another Resource Negotiator (YARN)

  • The Spark framework on top of Hadoop

Each of these frameworks will be controlled by Python (for instance, Spark by its Python interface named pySpark).


Python for large scale machine learning

Given the availability of many useful packages for machine learning and the fact that it is a programming language quite popular among data scientists, Python is our language of choice for all the code presented in this book.

In this book, when necessary, we will provide further instructions in order to install any further necessary library or tool. Here, we will instead start installing the basics, that is, the Python language and the most frequently used packages for computations and machine learning.

Choosing between Python 2 and Python 3

Before starting, it is important to know that there are two main branches of Python: versions 2 and 3. As many core functionalities have changed, scripts built for one version are sometimes incompatible with the other one (they won't work without raising errors and warnings). Although the third version is the newest, the older one is still the most used version in the scientific area and the default version for many operative systems (mainly for compatibility in upgrades). When version 3 was released (in 2008), most scientific packages weren't ready so the scientific community stuck with the previous version. Fortunately, since then, almost all packages have been updated leaving just a few (see for a compatibility overview) as orphans of Python 3 compatibility.

In spite of the recent growth in popularity of Python 3 (which, we shouldn't forget, is the future of Python), Python 2 is still widely used among data scientists and data analysts. Moreover, for a long time Python 2 has been the default Python installation (for instance, on Ubuntu), so it is the most likely version that most of the readers should have ready at hand. For all these reasons, we will adopt Python 2 for this book. It is not merely love for the old technologies, it is just a practical choice in order to make Large Scale Machine Learning with Python accessible to the largest audience:

  • The Python 2 code will immediately address the existing audience of data experts.

  • Python 3 users will find it very easy to convert our scripts in order to work under their favored Python version because the code we wrote is easily convertible and we will provide a Python 3 version of all our scripts and notebooks, freely downloadable from the Packt website.


In case you need to understand the differences between Python 2 and Python 3 in depth, we suggest reading this web page about writing Python 2-3 compatible code:

From Python-Future, you may also find reading about how to convert Python 2 code to Python 3 useful:

Installing Python

As the first step, we are going to create a working environment for data science that you can use to replicate and test the examples in the book and prototype your own large solutions.

No matter in what language you are going to develop your application, Python will gift you with an easy time getting your data, building your model from it, and extracting the right parameters you need to make your predictions in a production environment.

Python is an open source, object-oriented, cross-platform programming language that, compared with its direct competitors (for instance, C/C++ and Java), produces very concise and readable code. It allows you to build a working software prototype in a very short time and tests, maintains, and scales it in the future. It has become the most used language in the data scientist's toolbox because, in the end, it is a general-purpose language turned very flexible thanks to a large variety of available packages that can easily and rapidly help you solve a wide spectrum of both common and niche problems.

Step-by-step installation

If you have never used Python (but this doesn't mean that you may not already have it installed on your machine), you need to first download the installer from the main website of the project, (remember, we're using version 3), and then install it on your local machine.

This section provides you with full control over what can be installed on your machine. This is very useful when you are going to use Python as both your prototyping and production language. Furthermore, it could help you keep track of the packages' versions that you are using. Anyway, be warned that a step-by-step installation really takes time and effort. Instead, installing a ready-made scientific distribution will lessen the burden of installation procedures and it may be well-suited to first start and learn because it can save you quite a lot of time, though it will install a large number of packages (that for the most part you won't maybe ever use) on your computer all at once. Therefore, if you want to start immediately and don't want to bother much about controlling your installation, just skip this part and proceed to the next section, Scientific distributions.

Being a multiplatform programming language, you'll find installers for computers that either run on Windows or Linux-/Unix-like operating systems. Remember that some Linux distributions (such as Ubuntu) already have Python 2 packed in the repository, which makes the installation process even easier.

  1. Open a Python shell, type python in the terminal, or click on the Python icon.

  2. Then, to test the installation, run the following code in the Python interactive shell or its Read-Eval-Print Loop (REPL) interface provided by Python's standard IDE or other solutions such as Spyder or PyCharm:

    >>> import sys
    >>> print sys.version

If a syntax error has been raised, it means that you are running Python 2 instead of Python 3. If you don't experience an error and you can read that your Python version is 3.4.x or 3.5.x (at the time of writing, the latest version is 3.5.2), then congratulations for running the version of Python that we elected for this book.

To clarify, when a command is given in the terminal command line, we prefix the command with $. Otherwise, if it's for the Python REPL, it's preceded by >>>.

The installation of packages

Depending on your system and past installations, Python may not come bundled with all that you need unless you have installed a distribution (which, on the other hand, usually is stuffed with much more than you may need).

To install any packages that you need, you can use either the pip or easy_install commands; however, easy_install is going to be dropped in the future and pip has important advantages over it.

pip is a tool to install Python packages directly accessing the Internet and picking them from the Python Package Index ( PyPI is a repository containing third-party open source packages, which are constantly maintained and stored in the repository by their authors.

It is preferable to install everything using pip because of the following reasons:

  • It is the preferred package manager for Python and starting with Python 2.7.9 and Python 3.4, it is included by default with the Python binary installers

  • It provides an uninstall functionality

  • It rolls back and leaves your system clear if, for whatever reason, the package installation fails

The pip command runs in the command line and makes the process of installation, upgrade, and removal of Python packages a breeze.

As we mentioned, if you're running at least Python 2.7.9 or Python 3.4, the pip command should already be there. To assure which tools have been installed on your local machine, directly test with the following command if any error is raised:

$ pip –V

In some Linux and Mac installations, Python 3 and not Python 2 being installed, the command may be present as pip3, so if you receive an error when looking for pip, try running the following command:

$ pip3 –V

If this is the case, remember that pip3 is suitable only to install packages on Python 3. As we are working with Python 2 in the book (unless you decide to use the most recent Python 3.4), pip should always be your choice to install packages.

Alternatively, you can also test whether the old easy_install command is available:

$ easy_install --version


Using easy_install in spite of pip and its advantages makes sense if you are working on Windows because pip will not install binary packages; therefore, if you are experiencing unexpected difficulties installing a package, easy_install can save your day.

If your test ends with an error, you really need to install pip from scratch (and in doing so, also easy_install at the same time).

To install pip, simply follow the instructions given at The safest way is to download the script from and then run it using the following:

$ python

By the way, the script will also install the setup tool from, which contains easy_install.

As an alternative, if you are running a Debian/Ubuntu Unix-like system, then a fast shortcut would be to install everything using apt-get:

$ sudo apt-get install python3-pip

After checking this basic requirement, you're now ready to install all the packages that you need in order to run the examples provided in this book. To install a generic <pk> package, you just need to run the following command:

$ pip install <pk>

Alternatively, if you prefer to use easy_install, you can also run the following command:

$ easy_install <pk>

After this, the <pk> package and all its dependencies will be downloaded and installed.

If you're not sure whether a library has been installed or not, just try to import a module in it. If the Python interpreter raises an ImportError error, it can be concluded that the package has not been installed.

Let's take an example. This is what happens when the NumPy library has been installed:

>>> import numpy

This is what happens if it's not installed:

>>> import numpy
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: No module named numpy

In the latter case, before importing it, you'll need to install it through pip or easy_install.

Take care that you don't confuse packages with modules. With pip, you install a package; in Python, you import a module. Sometimes, the package and module have the same name, but in many cases, they don't match. For example, the sklearn module is included in the package named Scikit-learn.

Package upgrades

More often than not, you will find yourself in a situation where you have to upgrade a package because the new version is either required by a dependency or has additional features that you would like to use. To do so, first check the version of the library that you have installed by glancing at the __version__ attribute, as shown in the following example using the NumPy package:

>>> import numpy
>>> numpy.__version__ # 2 underscores before and after

Now, if you want to update it to a newer release, say precisely the 1.9.2 version, you can run the following command from the command line:

$ pip install -U numpy==1.9.2

Alternatively (but we do not recommend it unless it proves necessary), you can also use the following command:

$ easy_install --upgrade numpy==1.9.2

Finally, if you're just interested in upgrading it to the latest available version, simply run the following command:

$ pip install -U numpy

You can also run the easy_install alternative:

$ easy_install --upgrade numpy

Scientific distributions

As you've read so far, creating a working environment is a time-consuming operation for a data scientist. You first need to install Python and then, one by one, you can install all the libraries that you will need. (Sometimes, the installation procedures may not go as smoothly as you'd hoped for earlier.)

If you want to save time and effort and want to ensure that you have a fully working Python environment that is ready to use, you can just download, install, and use the scientific Python distribution. Apart from Python, they also include a variety of preinstalled packages, and sometimes they even have additional tools and an IDE setup for your usage. A few of them are very well-known among data scientists, and in the sections that follow, you will find some of the key features for two of these packages that we found most useful and practical.

To immediately focus on the contents of the book, we suggest that you first promptly download and install a scientific distribution, such as Anaconda (which is the most complete one around, in our opinion), and decide to fully uninstall the distribution and set up Python alone after practicing the examples in the book, which can be accompanied by just the packages you need for your projects.

Again, if possible, download and install the version containing Python 3.

The first package that we would recommend you to try is Anaconda (, which is a Python distribution offered by Continuum Analytics that includes nearly 200 packages, including NumPy, SciPy, pandas, IPython, matplotlib, Scikit-learn, and StatsModels. It's a cross-platform distribution that can be installed on machines with other existing Python distributions and versions, and its base version is free. Additional add-ons that contain advanced features are charged separately. Anaconda introduces conda, a binary package manager, as a command-line tool to manage your package installations. As stated on its website, Anaconda's goal is to provide enterprise-ready Python distribution for large-scale processing, predictive analytics and scientific computing. As for Python version 2.7, we recommend the Anaconda distribution 4.0.0. (In order to have a look at the packages installed with Anaconda, you can have a look at the list at

As a second suggestion, if you are working on Windows and you desire a portable distribution, WinPython ( could be a quite interesting alternative (sorry, no Linux or MacOS versions). WinPython is also a free, open source Python distribution maintained by the community. It is also designed with scientists in mind, and it includes many essential packages such as NumPy, SciPy, matplotlib, and IPython (basically the same as Anaconda's). It also includes Spyder as an IDE, which can be helpful if you have experience using the MATLAB language and interface. Its crucial advantage is that it is portable (you can put it in any directory or even in a USB flash drive), so you can have different versions present on your computer, move a version from a Windows computer to another, and you can easily replace an older version with a newer one just by replacing its directory. When you run WinPython or its shell, it will automatically set all the environment variables necessary to run Python as if it were regularly installed and registered on your system.


At the time of writing, Python 2.7 was the most recent distribution prepared on October 2015 with the release 2.7.10; since then, WinPython has published only updates of the Python 3 version of the distribution. After installing the distribution on your system, you may need to update some of the key packages necessary for the examples present in this book.

Introducing Jupyter/IPython

IPython was initiated in 2001 as a free project by Fernando Perez, addressing a lack in the Python stack for scientific investigations using a user-programming interface that could incorporate the scientific approach (mainly experimenting and interactively discovering) in the process of software development.

A scientific approach implies the fast experimentation of different hypotheses in a reproducible fashion (as does the data exploration and analysis task in data science), and when using IPython, you will be able to implement an explorative, iterative, and trial-and-error research strategy more naturally during your code writing.

Recently, a large part of the IPython project has moved to a new one called Jupyter. This new project extends the potential usability of the original IPython interface to a wide range of programming languages. (For a complete list, visit

Thanks to the powerful idea of kernels, programs that run the user's code are communicated by the frontend interface and provide feedback on the results of the executed code to the interface itself; you can use the same interface and interactive programming style, no matter what language you are developing in.

Jupyter (IPython is the zero kernel, the original starting one) can be simply described as a tool for interactive tasks operable by a console or web-based notebook, which offers special commands that help developers better understand and build the code that is being currently written.

Contrary to an IDE, which is built around the idea of writing a script, running it afterward and evaluating its results, Jupyter lets you write your code in chunks named cells, run each of them sequentially, and evaluate the results of each one separately, examining both textual and graphic outputs. Besides graphical integration, it provides you with further help, thanks to customizable commands, a rich history (in the JSON format), and computational parallelism for an enhanced performance when dealing with heavy numeric computations.

Such an approach is also particularly fruitful for the tasks involving developing code based on data as it automatically accomplishes the often neglected duty of documenting and illustrating how data analysis has been done, its premises and assumptions, and its intermediate and final results. If a part of your job is to also present your work and persuade internal or external stakeholders to the project, Jupyter can really do the magic of storytelling for you with few additional efforts. There are many examples on, some of which you may find inspiring for your work as we did.

Actually, we have to confess that keeping a clean, up-to-date Jupyter Notebook has saved us uncountable times when meetings with managers/stakeholders have suddenly popped up, requiring us to hastily present the state of our work.

In short, Jupyter offers you the following features:

  • Seeing intermediate (debugging) results for each step of the analysis

  • Running only some sections (or cells) of the code

  • Storing intermediate results in the JSON format and having the ability to do version control on them

  • Presenting your work (this will be a combination of text, code, and images), sharing it via the Jupyter Notebook Viewer service (, and easily exporting it to HTML, PDF, or even slideshows

Jupyter is our favored choice throughout this book, and it is used to clearly and effectively illustrate storytelling operations with scripts and data and their consequent results.

Though we strongly recommend using Jupyter, if you are using an REPL or IDE, you can use the same instructions and expect identical results (except for print formats and extensions of the returned results).

If you do not have Jupyter installed on your system, you can promptly set it up using the following command:

$ pip install jupyter


You can find complete instructions about the Jupyter installation (covering different operating systems) at

If you already have Jupyter installed, it should be upgraded to at least version 4.1.

After installation, you can immediately start using Jupyter, calling it from the command line:

$ jupyter notebook

Once the Jupyter instance has opened in the browser, click on the New button, and in the Notebooks section, choose Python 2 (other kernels may be present in the section, depending on what you installed):

At this point, your new empty notebook will look like the following screenshot and you can start entering the commands in the cells:

For instance, you may start typing the following in the cell:

In: print ("This is a test")

After writing in cells, you just press the play button (below the Cell tab) to run it and obtain an output. Then, another cell will appear for your input. As you are writing in a cell, if you press the plus button on the above menu bar, you will get a new cell, and you can move from a cell to another using the arrows on the menu.

Most of the other functions are quite intuitive and we invite you to try them. In order to know better how Jupyter works, you may use a quick-start guide such as or you can get a book specialized in Jupyter functionalities.


For a complete treatise of the full range of Jupyter functionalities when running the IPython kernel, refer to the following two Packt Publishing books:

  • IPython Interactive Computing and Visualization Cookbook by Cyrille Rossant, Packt Publishing, September 25, 2014

  • Learning IPython for Interactive Computing and Data Visualization by Cyrille Rossant, Packt Publishing, April 25, 2013

For our illustrative purposes, just consider that every Jupyter block of instructions has a numbered input statement and an output one, so you will find the code presented in this book structured in to two blocks—at least when the output is not trivial at all—otherwise, just expect only the input part:

In:  <the code you have to enter>
Out: <the output you should get>

As a rule, you just have to type the code after In: in your cells and run it. You can then compare your output with the output that we provide using Out: followed by the output that we actually obtained on our computers when we tested the code.


Python packages

The packages that we are going to introduce in the present paragraph will be frequently used in the book. If you are not using a scientific distribution, we offer you a walkthrough on what versions you should decide on and how to install them quickly and successfully.


NumPy, which is Travis Oliphant's creation, is at the core of every analytical solution in the Python language. It provides the user with multidimensional arrays along with a large set of functions to operate multiple mathematical operations on these arrays. Arrays are blocks of data arranged along multiple dimensions, which implement mathematical vectors and matrices. Arrays are useful not just to store data, but also for fast matrix operations (vectorization), which are indispensable when you wish to solve ad hoc data science problems.

  • Website:

  • Version at the time of writing: 1.11.1

  • Suggested install command:

    $ pip install numpy


As a convention that is largely adopted by the Python community, when importing NumPy, it is suggested that you alias it as np:

import numpy as np


An original project by Travis Oliphant, Pearu Peterson, and Eric Jones, SciPy completes NumPy's functionalities, offering a larger variety of scientific algorithms for linear algebra, sparse matrices, signal and image processing, optimization, fast Fourier transformation, and much more.

  • Website:

  • Version at the time of writing: 0.17.1

  • Suggested install command:

    $ pip install scipy


Pandas deals with everything that NumPy and SciPy cannot do. In particular, thanks to its specific object data structures, DataFrames, and Series, it allows the handling of complex tables of data of different types (something that NumPy's arrays cannot) and time series. Thanks to Wes McKinney's creation, you will be able to easily and smoothly load data from a variety of sources, and then slice, dice, handle missing elements, add, rename, aggregate, reshape, and finally visualize it at your will.


Conventionally, pandas is imported as pd:

import pandas as pd


Started as part of SciKits (SciPy Toolkits), Scikit-learn is the core of data science operations in Python. It offers all that you may need in terms of data preprocessing, supervised and unsupervised learning, model selection, validation, and error metrics. Expect us to talk at length about this package throughout the book.

Scikit-learn started in 2007 as a Google Summer of Code project by David Cournapeau. Since 2013, it has been taken over by the researchers at Inria (French Institute for Research in Computer Science and Automation).

Scikit-learn offers modules for data processing (sklearn.preprocessing and sklearn.feature_extraction), model selection and validation (sklearn.cross_validation, sklearn.grid_search, and sklearn.metrics), and a complete set of methods (sklearn.linear_model) in which the target value, being a number or probability, is expected to be a linear combination of the input variables.


Note that the imported module is named sklearn.

The matplotlib package

Originally developed by John Hunter, matplotlib is the library containing all the building blocks to create quality plots from arrays and visualize them interactively.

You can find all the MATLAB-like plotting frameworks inside the PyLab module.

  • Website:

  • Version at the time of writing: 1.5.1

  • Suggested install command:

    $ pip install matplotlib

You can simply import just what you need for your visualization purposes:

import matplotlib as mpl
from matplotlib import pyplot as plt


Gensim, programmed by Radim Řehůřek, is an open source package suitable to analyze large textual collections by the usage of parallel distributable online algorithms. Among advanced functionalities, it implements Latent Semantic Analysis (LSA), topic modeling by Latent Dirichlet Allocation (LDA), and Google's word2vec, a powerful algorithm to transform texts into vector features to be used in supervised and unsupervised machine learning.


H2O is an open source framework for big data analysis created by the start-up (previously named as 0xdata). It is usable by R, Python, Scala, and Java programming languages. H2O easily allows using a standalone machine (leveraging multiprocessing) or Hadoop cluster (for example, a cluster in an AWS environment), thus helping you scale up and out.

In order to install the package, you first have to download and install Java on your system, (You need to have Java Development Kit (JDK) 1.8 installed as H2O is Java-based.) then you can refer to the online instructions provided at

We can overview all the installation steps together in the following lines.

You can install both H2O and its Python API, as we have been using in our book, by the following instructions:

$ pip install -U requests
$ pip install -U tabulate
$ pip install -U future
$ pip install -U six

These steps will install the required packages, and then we can install the framework, taking care to remove any previous installation:

$ pip uninstall h2o
$ pip install h2o

In order to have installed the same version as we have in our book, you can change the last pip install command with the following:

$ pip install

If you run into problems, please visit the H2O Google groups page, where you can get help with your problems:!forum/h2ostream


XGBoost is a scalable, portable, and distributed gradient boosting library (a tree ensemble machine learning algorithm). It is available for Python, R, Java, Scala, Julia, and C++ and it can work on a single machine (leveraging multithreading), both in Hadoop and Spark clusters.

Detailed instructions to install XGBoost on your system can be found at

The installation of XGBoost on both Linux and Mac OS is quite straightforward, whereas it is a little bit trickier for Windows users. For this reason, we provide specific installations steps to have XGBoost working on Windows:

  1. First of all, download and install Git for Windows (

  2. Then you need a Minimalist GNU for Windows (MinGW) compiler present on your system. You can download it from according to the characteristics of your system.

  3. From the command line, execute the following:

    $ git clone --recursive
    $ cd xgboost
    $ git submodule init
    $ git submodule update
  4. Then, from the command line, copy the configuration for 64-bit systems to be the default one:

    $ copy make\

    Alternatively, you can copy the plain 32-bit version:

    $ copy make\
  5. After copying the configuration file, you can run the compiler, setting it to use four threads in order to speed up the compiling procedure:

    $ make -j4
  6. Finally, if the compiler completed its work without errors, you can install the package in your Python by executing the following commands:

    $ cd python-package
    $ python install


Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multidimensional arrays efficiently. Basically, it provides you with all the building blocks that you need to create deep neural networks.

The installation of Theano should be straightforward as it is now a package on PyPI:

$ pip install Theano

If you want the most updated version of the package, you can get them with GitHub cloning:

$ git clone git://

Then you can proceed with the direct Python installation:

$ cd Theano
$ python install

To test your installation, you can run the following from the shell/CMD and verify the reports:

$ pip install nose
$ pip install nose-parameterized
$ nosetests theano

If you are working on a Windows OS and the previous instructions don't work, you can try these steps:

  1. Install TDM-GCC x64 (

  2. Open the Anaconda command prompt and execute the following:

    $ conda update conda
    $ conda update –all
    $ conda install mingw libpython
    $ pip install git+git://


Theano needs libpython, which isn't compatible yet with version 3.5, so if your Windows installation is not working, that could be the likely cause.

In addition, Theano's website provides some information to Windows users that could support you when everything else fails:

An important requirement for Theano to scale out on GPUs is to install NVIDIA CUDA drivers and SDK for code generation and execution on GPU. If you do not know too much about the CUDA Toolkit, you can actually start from this web page in order to understand more about the technology being used:

Therefore, if your computer owns an NVIDIA GPU, you can find all the necessary instructions in order to install CUDA using this tutorial page from NVIDIA itself:


Just like Theano, TensorFlow is another open source software library for numerical computation using data flow graphs instead of just arrays. Nodes in such a graph represent mathematical operations, whereas the graph edges represent the multidimensional data arrays (the so-called tensors) moved between the nodes. Originally, Google researchers, being part of the Google Brain Team, developed TensorFlow and recently they made it open source for the public.

For the installation of TensorFlow on your computer, follow the instructions found at the following link:

Windows support is not present at the moment but it is in the current roadmap:

For Windows users, a good compromise could be to run the package on a Linux-based virtual machine or Docker machine. (The preceding OS set-up page offers directions to do so.)

The sknn library

The sknn library (for extensions, scikit-neuralnetwork) is a wrapper for Pylearn2, helping you to implement deep neural networks without requiring you to become an expert on Theano. As a bonus, the library is compatible with the Scikit-learn API.

Optionally, if you want to take advantage of the most advanced features such as convolution, pooling, or upscaling, you have to complete the installation as follows:

$ pip install -r

After installation, you also have to execute the following:

$ git clone
$ cd scikit-neuralnetwork
$ python develop

As seen for XGBoost, this will make the sknn package available in your Python installation.


The theanets package is a deep learning and neural network toolkit written in Python and uses Theano to accelerate computations. Just as with sknn, it tries to make it easier to interface with Theano functionalities in order to create deep learning models.

You can also download the current version from GitHub and install the package directly in Python:

$ git clone
$ cd theanets
$ python develop


Keras is a minimalist, highly modular neural networks library written in Python and capable of running on top of either TensorFlow or Theano.

  • Website:

  • Version at the time of writing: 1.0.5

  • Suggested installation from PyPI:

    $ pip install keras

You can also install the latest available version (advisable as the package is in continuous development) using the following command:

$ pip install git+git://

Other useful packages to install on your system

Concluding this long tour of the many packages that you will see in action among the pages of this book, we close with three simple, yet quite useful, packages, that need little presentation but need to be installed on your system: memory profiler, climate, and NeuroLab.

Memory profiler is a package monitoring memory usage by a process. It also helps dissecting memory consumption by a specific Python script, line by line. It can be installed as follows:

$ pip install -U memory_profiler

Climate just consists of some basic command-line utilities for Python. It can be promptly installed as follows:

$ pip install climate

Finally, NeuroLab is a very basic neural network package loosely based on the Neural Network Toolbox (NNT) in MATLAB. It is based on NumPy and SciPy, not Theano; consequently, do not expect astonishing performances but know that it is a good learning toolbox. It can be easily installed as follows:

$ pip install neurolab


In this introductory chapter, we have illustrated the different ways in which we can make machine learning algorithms scalable using Python (scale up and scale out techniques). We also proposed some motivating examples and set the stage for the book by illustrating how to install Python on your machine. In particular, we introduced you to Jupyter and covered all the most important packages that will be used in this book.

In the next chapter, we will dive into discussing how stochastic gradient descent can help you deal with massive datasets by leveraging I/O on a single machine. Basically, we will cover different ways of streaming data from large files or data repositories and feed it into a basic learning algorithm. You will be amazed at how simple solutions can be effective, and you will discover that even your desktop computer can easily crunch big data.

About the Authors

  • Bastiaan Sjardin

    Bastiaan Sjardin is a data scientist and founder with a background in artificial intelligence and mathematics. He has a MSc degree in cognitive science obtained at the University of Leiden together with on campus courses at Massachusetts Institute of Technology (MIT). In the past 5 years, he has worked on a wide range of data science and artificial intelligence projects. He is a frequent community TA at Coursera in the social network analysis course from the University of Michigan and the practical machine learning course from Johns Hopkins University. His programming languages of choice are Python and R. Currently, he is the cofounder of Quandbee (, a company providing machine learning and artificial intelligence applications at scale.

    Browse publications by this author
  • Luca Massaron

    Luca Massaron is a data scientist and marketing research director specialized in multivariate statistical analysis, machine learning, and customer insight, with over a decade of experience of solving real-world problems and generating value for stakeholders by applying reasoning, statistics, data mining, and algorithms. From being a pioneer of web audience analysis in Italy to achieving the rank of a top-10 Kaggler, he has always been very passionate about every aspect of data and its analysis, and also about demonstrating the potential of data-driven knowledge discovery to both experts and non-experts. Favoring simplicity over unnecessary sophistication, Luca believes that a lot can be achieved in data science just by doing the essentials.

    Browse publications by this author
  • Alberto Boschetti

    Alberto Boschetti is a data scientist with expertise in signal processing and statistics. He holds a Ph.D. in telecommunication engineering and currently lives and works in London. In his work projects, he faces challenges ranging from natural language processing (NLP) and behavioral analysis to machine learning and distributed processing. He is very passionate about his job and always tries to stay updated about the latest developments in data science technologies, attending meet-ups, conferences, and other events.

    Browse publications by this author

Latest Reviews

(2 reviews total)
Book Title
Unlock this book and the full library for only $5/m
Access now