Regression Analysis with Python

4.7 (3 reviews total)
By Luca Massaron , Alberto Boschetti
  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Regression – The Workhorse of Data Science

About this book

Regression is the process of learning relationships between inputs and continuous outputs from example data, which enables predictions for novel inputs. There are many kinds of regression algorithms, and the aim of this book is to explain which is the right one to use for each set of problems and how to prepare real-world data for it. With this book you will learn to define a simple regression problem and evaluate its performance. The book will help you understand how to properly parse a dataset, clean it, and create an output matrix optimally built for regression. You will begin with a simple regression algorithm to solve some data science problems and then progress to more complex algorithms. The book will enable you to use regression models to predict outcomes and take critical business decisions. Through the book, you will gain knowledge to use Python for building fast better linear models and to apply the results in Python or in any computer language you prefer.

Publication date:
February 2016
Publisher
Packt
Pages
312
ISBN
9781785286315

 

Chapter 1. Regression – The Workhorse of Data Science

Welcome to this presentation on the workhorse of data science, linear regression, and its related family of linear models.

Nowadays, interconnectivity and data explosion are realities that open a world of new opportunities for every business that can read and interpret data in real time. Everything is facilitating the production and diffusion of data: the omnipresent Internet diffused both at home and at work, an army of electronic devices in the pockets of large portions of the population, and the pervasive presence of software producing data about every process and event. So much data is generated daily that humans cannot deal with it because of its volume, velocity, and variety. Thus, machine learning and AI are on the rise.

Coming from a long and glorious past in the field of statistics and econometrics, linear regression, and its derived methods, can provide you with a simple, reliable, and effective tool to learn from data and act on it. If carefully trained with the right data, linear methods can compete well against the most complex and fresh AI technologies, offering you unbeatable ease of implementation and scalability for increasingly large problems.

In this chapter, we will explain:

  • Why linear models can be helpful as models to be evaluated in a data science pipeline or as a shortcut for the immediate development of a scalable minimum viable product

  • Some quick indications for installing Python and setting it up for data science tasks

  • The necessary modules for implementing linear models in Python

 

Regression analysis and data science


Imagine you are a developer hastily working on a very cool application that is going to serve thousands of customers using your company's website everyday. Using the available information about customers in your data warehouse, your application is expected to promptly provide a pretty smart and not-so-obvious answer. The answer unfortunately cannot easily be programmatically predefined, and thus will require you to adopt a learning-from-data approach, typical of data science or predictive analytics.

In this day and age, such applications are quite frequently found assisting numerous successful ventures on the Web, for instance:

  • In the advertising business, an application delivering targeted advertisements

  • In e-commerce, a batch application filtering customers to make more relevant commercial offers or an online app recommending products to buy on the basis of ephemeral data such as navigation records

  • In the credit or insurance business, an application selecting whether to proceed with online inquiries from users, basing its judgment on their credit rating and past relationship with the company

There are numerous other possible examples, given the constantly growing number of use cases about machine learning applied to business problems. The core idea of all these applications is that you don't need to program how your application should behave, but you just set some desired behaviors by providing useful examples. The application will learn by itself what to do in any circumstance.

After you are clear about the purpose of your application and decide to use the learning-from-data approach, you are confident that you don't have to reinvent the wheel. Therefore, you jump into reading tutorials and documentation about data science and machine learning solutions applied to problems similar to yours (they could be papers, online blogs, or books talking about data science, machine learning, statistical learning, and predictive analytics).

After reading a few pages, you will surely be exposed to the wonders of many complex machine learning algorithms you likely have never heard of before.

However, you start being puzzled. It isn't simply because of the underlying complex mathematics; it is mostly because of the large amount of possible solutions based on very different techniques. You also often notice the complete lack of any discussion about how to deploy such algorithms in a production environment and whether they would scale up to real-time server requests.

At this point, you are completely unsure where should you start. This is when this book will come to your rescue.

Let's start from the beginning.

Exploring the promise of data science

Given a more interconnected world and the growing availability of data, data science has become quite a hot topic in recent years.

In the past, analytical solutions had strong constrains: the availability of data. Useful data was generally scarce and always costly to obtain and store. Given the current data explosion, now abundant and cheaper information at hand makes learning from data a reality, thus opening the doors to a wide range of predictive applications that were simply impractical before.

In addition, being in an interconnected world, most of your customers are now reachable (and susceptible of being influenced) through the Internet or through mobile devices. This simply means that being smart in developing automated solutions based on data and its predictive powers can directly and almost instantaneously affect how your business works and performs. Being able to reach your customers instantly everywhere, 24 hours a day, 365 days a year, enables your company to turn data into profits, if you know the right things to be done. In the 21st century, data is the new oil of the digital economy, as a memorable and still undisputed article on Wired stated not too long ago (http://www.wired.com/insights/2014/07/data-new-oil-digital-economy/).However, as with oil, data has to be extracted, refined, and distributed.

Being at the intersection of substantive expertise (knowing how to do business and make profits), machine learning (learning from data), and hacking skills (integrating various systems and data sources), data science promises to find the mix of tools to leverage your available data and turn it into profits.

However, there's another side to the coin.

The challenge

Unfortunately, there are quite a few challenging issues in applying data science to a business problem:

  • Being able to process unstructured data or data that has been modeled for completely different purposes

  • Figuring out how to extract such data from heterogeneous sources and integrate it in a timely manner

  • Learning (from data) some effective general rules allowing you to correctly predict your problem

  • Understanding what has been learned and being able to effectively communicate your solution to a non-technical managerial audience

  • Scaling to real-time predictions given big data inputs

The first two points are mainly problems that require data manipulation skills, but from the third point onwards, we really need a data science approach to solve the problem.

The data science approach, based on machine learning, requires careful testing of different algorithms, estimating their predictive capabilities with respect to the problem, and finally selecting the best one to implement. This is exactly what the science in data science means: coming up with various different hypotheses and experimenting with them to find the one that best fits the problem and allows generalization of the results.

Unfortunately, there is no white unicorn in data science; there is no single hypothesis that can successfully fit all the available problems. In other words, we say that there is no free lunch (the name of a famous theorem from the optimization domain), meaning that there are no algorithms or procedures in data science that can always assure you the best results; each algorithm can be less or more successful, depending on the problem.

Data comes in all shapes and forms and reflects the complexity of the world we live in. The existing algorithms should have certain sophistication in order to deal with the complexity of the world, but don't forget that they are just models. Models are nothing but simplifications and approximations of the system of rules and laws we want to successfully represent and replicate for predictive reasons since you can control only what you can measure, as Lord Kelvin said. An approximation should be evaluated based on its effectiveness, and the efficacy of learning algorithms applied to real problems is dictated by so many factors (type of problem, data quality, data quantity, and so on) that you really cannot tell in advance what will work and what won't. Under such premises, you always want to test the simpler solutions first, and follow the principle of Occam's razor as much as possible, favoring simpler models against more complex ones when their performances are comparable.

Sometimes, even when the situation allows the introduction of more complex and more performant models, other factors may still favor the adoption of simpler yet less performant solutions. In fact, the best model is not always necessarily the most performant one. Depending on the problem and the context of application, issues such as ease of implementation in production systems, scalability to growing volumes of data, and performance in live settings, may deeply redefine how important the role of predictive performance is in the choice of the best solution.

In such situations, it is still advisable to use simpler, well-tuned models or easily explainable ones, if they provide an acceptable solution to the problem.

The linear models

In your initial overview of the problem of what machine learning algorithm to use, you may have also stumbled upon linear models, namely linear regression and logistic regression. They both have been presented as basic tools, building blocks of a more sophisticated knowledge that you should achieve before hoping to obtain the best results.

Linear models have been known and studied by scholars and practitioners for a long time. Before being promptly adopted into data science, linear models were always among the basic statistical models to start with in predictive analytics and data mining. They also have been a prominent and relevant tool part of the body of knowledge of statistics, economics, and many other quantitative subjects.

By a simple check (via a query from an online bookstore, from a library, or just from Google Books—https://books.google.com/), you will discover there is quite a vast availability of publications about linear regression. There is also quite an abundance of publications about logistic regression, and about other different variants of the regression algorithm, the so-called generalized linear models, adapted in their formulation to face and solve more complex problems.

As practitioners ourselves, we are well aware of the limits of linear models. However, we cannot ignore their strong positive key points: simplicity and efficacy. We also cannot ignore that linear models are indeed among the most used learning algorithms in applied data science, making them real workhorses in data analysis (in business as well as in many scientific domains).

Far from being the best tool at hand, they are always a good starting point in a data science path of discovery because they don't require hacking with too many parameters and they are very fast to train. Thus, linear models can point out the predictive power of your data at hand, identify the most important variables, and allow you to quickly test useful transformations of your data before applying more complex algorithms.

In the course of this book, you will learn how to build prototypes based on linear regression models, keeping your data treatment and handling pipeline prompt for possible development reiterations of the initial linear model into more powerful and complex ones, such as neural networks or support vector machines.

Moreover, you will learn that you maybe don't even need more complex models, sometimes. If you are really working with lots of data, after having certain volumes of input data feed into a model, using simple or complex algorithms won't matter all that much anymore. They will all perform to the best of their capabilities.

The capability of big data to make even simpler models as effective as a complex one has been pointed out by a famous paper co-authored by Alon Halevy, Peter Norvig, and Fernando Pereira from Google about The Unreasonable Effectiveness of Data (http://static.googleusercontent.com/media/research.google.com/it//pubs/archive/35179.pdf). Before that, the idea was already been known because of a less popular scientific paper by Microsoft researchers, Michele Banko and Eric Brill, Scaling to Very Very Large Corpora for Natural Language Disambiguation (http://ucrel.lancs.ac.uk/acl/P/P01/P01-1005.pdf).

In simple and short words, the algorithm with more data wins most of the time over other algorithms (no matter their complexity); in such a case, it could well be a linear model.

However, linear models can be also helpful downstream in the data science process and not just upstream. As they are fast to train, they are also fast to be deployed and you do not need coding complex algorithms to do so, allowing you to write the solution in any script or programming language you like, from SQL to JavaScript, from Python to C/C++.

Given their ease of implementation, it is not even unusual that, after building complex solutions using neural networks or ensembles, such solutions are reverse-engineered to find a way to make them available in production as a linear model and achieve a simpler and scalable implementation.

What you are going to find in the book

In the following pages, the book will explain algorithms as well as their implementation in Python to solve practical real-world problems.

Linear models can be counted among supervised algorithms, which are those algorithms that can formulate predictions on numbers and classes if previously given some correct examples to learn from. Thanks to a series of examples, you will immediately distinguish if a problem could be tractable using this algorithm or not.

Given the statistical origins of the linear models family, we cannot neglect starting from a statistical perspective. After contextualizing the usage of linear models, we will provide all the essential elements for understanding on what statistical basis and for what purpose the algorithm has been created. We will use Python to evaluate the statistical outputs of a linear model, providing information about the different statistical tests used.

The data science approach is quite practical (to solve a problem for its business impact), and many limitations of the statistical versions of linear models actually do not apply. However, knowing how the R-squared coefficient works or being able to evaluate the residuals of a regression or highlighting the collinearity of its predictors, can provide you with more means to obtain good results from your work in regression modeling.

Starting from regression models involving a single predictive variable, we will move on to consider multiple variables, and from predicting just numbers we will progress to estimating the probability of there being a certain class among two or many.

We will particularly emphasize how to prepare data, both the target variable (a number or a class) to be predicted and the predictors; variables contributing to a correct prediction. No matter what your data is made of, numbers, nouns, text, images, or sounds, we will provide you with the method to correctly prepare your data and transform it in such a way that your models will perform the best.

You will also be introduced to the scientific methodology at the very foundations of data science, which will help you understand why the data science approach is not just simply theoretical but also quite practical, since it allows obtaining models that can really work when applied to real-world problems.

The last pages of the book will cover some of the more advanced techniques for handling big data and complexity in models. We will also provide you with a few examples from relevant business domains and offer plenty of details about how to proceed to build a linear model, validate it, and later on implement it into a production environment.

 

Python for data science


Given the availability of many useful packages for creating linear models and given the fact that it is a programming language quite popular among developers, Python is our language of choice for all the code presented in this book.

Created in 1991 as a general-purpose, interpreted, object-oriented language, Python has slowly and steadily conquered the scientific community and grown into a mature ecosystem of specialized packages for data processing and analysis. It allows you to perform uncountable and fast experiments, easy theory development, and prompt deployments of scientific applications.

As a developer, you will find using Python interesting for various reasons:

  • It offers a large, mature system of packages for data analysis and machine learning. It guarantees that you will get all that you need in the course of a data analysis, and sometimes even more.

  • It is very versatile. No matter what your programming background or style is (object-oriented or procedural), you will enjoy programming with Python.

  • If you don't know it yet, but you know other languages well such as C/C++ or Java, it is very simple to learn and use. After you grasp the basics, there's no better way to learn more than by immediately starting to code.

  • It is cross-platform; your solutions will work perfectly and smoothly on Windows, Linux, and Mac OS systems. You won't have to worry about portability.

  • Although interpreted, it is undoubtedly fast compared to other mainstream data analysis languages such as R and MATLAB (though it is not comparable to C, Java, and the newly emerged Julia language).

  • There are packages that allow you to call other platforms, such as R and Julia, outsourcing some of the computations to them and improving your script performance. Moreover, there are also static compilers such as Cython or just-in-time compilers such as PyPy that can transform Python code into C for higher performance.

  • It can work better than other platforms with in-memory data because of its minimal memory footprint and excellent memory management. The memory garbage collector will often save the day when you load, transform, dice, slice, save, or discard data using the various iterations and reiterations of data wrangling.

Installing Python

As a first step, we are going to create a fully working data science environment you can use to replicate and test the examples in the book and prototype your own models.

No matter in what language you are going to develop your application, Python will provide an easy way to access your data, build your model from it, and extract the right parameters you need to make predictions in a production environment.

Python is an open source, object-oriented, cross-platform programming language that, compared with its direct competitors (for instance, C/C++ and Java), produces very concise and very readable code. It allows you to build a working software prototype in a very short time, to maintain it easily, and to scale it to larger quantities of data. It has become the most used language in the data scientist's toolbox because it is a general-purpose language made very flexible thanks to a large variety of available packages that can easily and rapidly help you solve a wide spectrum of both common and niche problems.

Choosing between Python 2 and Python 3

Before starting, it is important to know that there are two main branches of Python: version 2 and 3. Since many core functionalities have changed, scripts built for one versions are often incompatible (they won't work without raising errors and warnings) with the other one. Although the third version is the newest, the older one is still the most used version in the scientific area, and the default version for many operating systems (mainly for compatibility in upgrades). When version 3 was released in 2008, most scientific packages weren't ready, so the scientific community was stuck with the previous version. Fortunately, since then, almost all packages have been updated, leaving just a few orphans of Python 3 compatibility (see http://py3readiness.org/ for a compatibility overview).

In this book, which should address a large audience of developers, we agreed that it would have been better to work with Python 3 rather than the older version. Python 3 is the future of Python; in fact, it is the only version that will be further developed and improved by the Python foundation. It will be the default version of the future. If you are currently working with version 2 and you prefer to keep on working with it, we suggest you to run these following few lines of code at the beginning every time you start the interpreter. By doing so, you'll render Python 2 capable of executing most version 3 code with minimal or no problems at all (the code will patch just a few basic incompatibilities, after installing the future package using the command pip install future, and let you safely run all the code in this book):

from __future__ import unicode_literals 
# to make all string literals into unicode strings
from __future__ import print_function # To print multiple strings
from six import reraise as raise_ # Raising exceptions with a traceback
from __future__ import division # True division
from __future__ import absolute_import # Flexible Imports

Tip

The from __future__ import commands should always occur at the beginning of your script or you may experience Python reporting an error.

Step-by-step installation

If you have never used Python (but that doesn't mean that you may not already have it installed on your machine), you need to first download the installer from the main website of the project, https://www.python.org/downloads/ (remember, we are using version 3), and then install it on your local machine.

This section provides you with full control over what can be installed on your machine. This is very useful when you are going to use Python as both your prototyping and production language. Furthermore, it could help you keep track of the versions of packages you are using. Anyway, please be warned that a step-by-step installation really takes time and effort. Instead, installing a ready-made scientific distribution will lessen the burden of installation procedures and may well facilitate initial learning because it can save you quite a lot of time, though it will install a large number of packages (that for the most part you may never use) on your computer all at once. Therefore, if you want to start immediately and don't need to control your installation, just skip this part and proceed to the next section about scientific distributions.

As Python is a multiplatform programming language, you'll find installers for computers that either run on Windows or Linux/Unix-like operating systems. Please remember that some Linux distributions (such as Ubuntu) already have Python packed in the repository, which makes the installation process even easier:

  1. Open a Python shell, type python in the terminal or click on the Python IDLE icon. Then, to test the installation, run the following code in the Python interactive shell or REPL:

    >>> import sys
    >>> print (sys.version)
    

    Tip

    Downloading the example code

    You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

If a syntax error is raised, it means that you are running Python 2 instead of Python 3. Otherwise, if you don't experience an error and you read that your Python version is 3.x (at the time of writing this book, the latest version was 3.5.0), then congratulations on running the version of Python we elected for this book.

To clarify, when a command is given in the terminal command line, we prefix the command with $>. Otherwise, if it's for the Python REPL, it's preceded by >>>.

Installing packages

Depending on your system and past installations, Python may not come bundled with all you need, unless you have installed a distribution (which, on the other hand, is usually stuffed with much more than you may need).

To install any packages you need, you can use the commands pip or easy_install; however, easy_install is going to be dropped in the future and pip has important advantages over it. It is preferable to install everything using pip because:

  • It is the preferred package manager for Python 3 and, starting with Python 2.7.9 and Python 3.4, it is included by default with the Python binary installers

  • It provides an uninstall functionality

  • It rolls back and leaves your system clear if, for whatever reason, the package installation fails

The command pip runs on the command line and makes the process of installing, upgrading, and removing Python packages simply a breeze.

As we mentioned, if you're running at least Python 2.7.9 or Python 3.4 the pip command should already be there. To verify which tools have been installed on your local machine, directly test with the following command if any error is raised:

$> pip -V

In some Linux and Mac installations, the command is present as pip3 (more likely if you have both Python 2 and 3 on your machine), so, if you received an error when looking for pip, also try running the following:

$> pip3 -V

Alternatively, you can also test if the old command easy_install is available:

$> easy_install --version

Tip

Using easy_install in spite of pip's advantages makes sense if you are working on Windows because pip will not install binary packages (it will try to build them); therefore, if you are experiencing unexpected difficulties installing a package, easy_install can save your day.

If your test ends with an error, you really need to install pip from scratch (and in doing so, also easy_install at the same time).

To install pip, simply follow the instructions given at https://pip.pypa.io/en/stable/installing/. The safest way is to download the get-pi.py script from https://bootstrap.pypa.io/get-pip.py and then run it using the following:

$> python get-pip.py

By the way, the script will also install the setup tool from https://pypi.python.org/pypi/setuptools, which contains easy_install.

As an alternative, if you are running a Debian/Ubuntu Unix-like system, then a fast shortcut would be to install everything using apt-get:

$> sudo apt-get install python3-pip

After checking this basic requirement, you're now ready to install all the packages you need to run the examples provided in this book. To install a generic package, <pk>, you just need to run the following command:

$> pip install <pk>

Alternatively, if you prefer to use easy_install, you can also run the following command:

$> easy_install <pk>

After that, the <pk>package and all its dependencies will be downloaded and installed.

If you are not sure whether a library has been installed or not, just try to import a module inside it. If the Python interpreter raises an Import Error error, it can be concluded that the package has not been installed.

Let's take an example. This is what happens when the NumPy library has been installed:

>>> import numpy

This is what happens if it is not installed:

>>> import numpy 
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: No module named numpy

In the latter case, before importing it, you'll need to install it through pip or easy_install.

Take care that you don't confuse packages with modules. With pip, you install a package; in Python, you import a module. Sometimes, the package and the module have the same name, but in many cases they don't match. For example, the sklearn module is included in the package named Scikit-learn.

Package upgrades

More often than not, you will find yourself in a situation where you have to upgrade a package because the new version either is required by a dependency or has additional features that you would like to use. To do so, first check the version of the library you have installed by glancing at the __version__ attribute, as shown in the following example using the NumPy package:

>>> import numpy
>>> numpy.__version__ # 2 underscores before and after
'1.9.2'

Now, if you want to update it to a newer release, say the 1.10.1 version, you can run the following command from the command line:

$> pip install -U numpy==1.10.1

Alternatively, but we do not recommend it unless it proves necessary, you can also use the following command:

$> easy_install --upgrade numpy==1.10.1

Finally, if you are just interested in upgrading it to the latest available version, simply run the following command:

$> pip install -U numpy

You can alternatively also run the easy_install alternative:

$> easy_install --upgrade numpy

Scientific distributions

As you've read so far, creating a working environment is a time-consuming operation for a data scientist. You first need to install Python and then, one by one, you can install all the libraries that you will need (sometimes, the installation procedures may not go as smoothly as you'd hoped for earlier).

If you want to save time and effort and want to ensure that you have a working Python environment that is ready to use, you can just download, install, and use a scientific Python distribution. Apart from Python itself, distributions also include a variety of preinstalled packages, and sometimes they even have additional tools and an IDE set up for your usage. A few of them are very well known among data scientists and, in the sections that follow, you will find some of the key features for two of these packages that we found most useful and practical.

To immediately focus on the contents of the book, we suggest that you first download and install a scientific distribution, such as Anaconda (which is the most complete one around, in our opinion). Then, after practicing the examples in the book, we suggest you to decide to fully uninstall the distribution and set up Python alone, which can be accompanied by just the packages you need for your projects.

Again, if possible, download and install the version containing Python 3.

The first package that we would recommend you try is Anaconda (https://www.continuum.io/downloads), which is a Python distribution offered by Continuum Analytics that includes nearly 200 packages, including NumPy, SciPy, Pandas, IPython, Matplotlib, Scikit-learn, and Statsmodels. It's a cross-platform distribution that can be installed on machines with other existing Python distributions and versions, and its base version is free. Additional add-ons that contain advanced features are charged separately. Anaconda introduces conda, a binary package manager, as a command-line tool to manage your package installations. As stated on its website, Anaconda's goal is to provide enterprise-ready Python distribution for large-scale processing, predictive analytics, and scientific computing.

As a second suggestion, if you are working on Windows, WinPython (http://winpython.sourceforge.net) could be a quite interesting alternative (sorry, no Linux or MacOS versions). WinPython is also a free, open source Python distribution maintained by the community. It is designed with scientists in mind, and it includes many essential packages such as NumPy, SciPy, Matplotlib, and IPython (the same as Anaconda's). It also includes Spyder as an IDE, which can be helpful if you have experience using the MATLAB language and interface. A crucial advantage is that it is portable (you can put it into any directory, or even on a USB flash drive, without the need for any administrative elevation). Using WinPython, you can have different versions present on your computer, move a version from a Windows computer to another, and you can easily replace an older version with a newer one just by replacing its directory. When you run WinPython or its shell, it will automatically set all the environment variables necessary for running Python as it were regularly installed and registered on your system.

Finally, another good choice for a distribution that works on Windows could be Python(x,y). Python(x,y) (http://python-xy.github.io) is a free, open source Python distribution maintained by the scientific community. It includes a number of packages, such as NumPy, SciPy, NetworkX, IPython, and Scikit-learn. It also features Spyder, the interactive development environment inspired by the MATLAB IDE.

Introducing Jupyter or IPython

IPython was initiated in 2001 as a free project by Fernando Perez. It addressed a lack in the Python stack for scientific investigations. The author felt that Python lacked a user programming interface that could incorporate the scientific approach (mainly meaning experimenting and interactively discovering) in the process of software development.

A scientific approach implies fast experimentation with different hypotheses in a reproducible fashion (as do data exploration and analysis tasks in data science), and when using IPython you will be able to more naturally implement an explorative, iterative, trial-and-error research strategy in your code writing.

Recently, a large part of the IPython project has been moved to a new one called Jupyter (http://jupyter.org):

This new project extends the potential usability of the original IPython interface to a wide range of programming languages such as the following:

For a complete list of available kernels, please visit: https://github.com/ipython/ipython/wiki/IPython-kernels-for-other-languages.

You can use the same IPython-like interface and interactive programming style no matter what language you are developing in, thanks to the powerful idea of kernels, which are programs that run the user's code, as communicated by the frontend interface; they then provide feedback on the results of the executed code to the interface itself.

IPython (Python is the zero kernel, the original starting point) can be simply described as a tool for interactive tasks operable by a console or by a web-based notebook, which offers special commands that help developers to better understand and build the code currently being written.

Contrary to an IDE interface, which is built around the idea of writing a script, running it afterwards, and finally evaluating its results, IPython lets you write your code in chunks, run each of them sequentially, and evaluate the results of each one separately, examining both textual and graphic outputs. Besides graphical integration, it provides further help, thanks to customizable commands, a rich history (in the JSON format), and computational parallelism for enhanced performance when dealing with heavy numeric computations.

In IPython, you can easily combine code, comments, formulas, charts and interactive plots, and rich media such as images and videos, making it a complete scientific sketchpad for all your experimentations and their results together. Moreover, IPython allows reproducible research, allowing any data analysis and model building to be recreated easily under different circumstances:

IPython works on your favorite browser (which could be Explorer, Firefox, or Chrome, for instance) and when started presents a cell waiting for code to written in. Each block of code enclosed in a cell can be run and its results are reported in the space just after the cell. Plots can be represented in the notebook (inline plot) or in a separate window. In our example, we decided to plot our chart inline.

Notes can be written easily using the Markdown language, a very easy and accessible markup language (http://daringfireball.net/projects/markdown).

Such an approach is also particularly fruitful for tasks involving developing code based on data, since it automatically accomplishes the often-neglected duty of documenting and illustrating how data analysis has been done, as well as its premises, assumptions, and intermediate/final results. If part of your job is also to present your work and attract internal or external stakeholders to the project, IPython can really perform the magic of storytelling for you with little additional effort. On the web page https://github.com/ipython/ipython/wiki/A-gallery-of-interesting-IPython-Notebooks, there are many examples, some of which you may find inspiring for your work as we did.

Actually, we have to confess that keeping a clean, up-to-date IPython Notebook has saved us uncountable times when meetings with managers/stakeholders have suddenly popped up, requiring us to hastily present the state of our work.

As an additional resource, IPython offers you a complete library of many magic commands that allow you to execute some useful actions such as measuring the time it takes for a command to execute, or creating a text file with the output of a cell. We distinguish between line magic and cell magic, depending on whether they operate a single line of code or the code contained in an entire cell. For instance, the magic command %timeit measures the time it takes to execute the command on the same line of the line magic, whereas %%time is a cell magic that measures the execution time of an entire cell.

If you want to explore more about magic commands, just type %quickref into an IPython cell and run it: a complete guide will appear to illustrate all available commands.

In short, IPython lets you:

  • See intermediate (debugging) results for each step of the analysis

  • Run only some sections (or cells) of the code

  • Store intermediate results in JSON format and have the ability to version-control them

  • Present your work (this will be a combination of text, code, and images), share it via the IPython Notebook Viewer service (http://nbviewer.ipython.org/), and easily export it into HTML, PDF, or even slide shows

IPython is our favored choice throughout this book, and it is used to clearly and effectively illustrate operations with scripts and data and their consequent results.

Note

For a complete treatise on the full range of IPython functionalities, please refer to the two Packt Publishing books IPython Interactive Computing and Visualization Cookbook, Cyrille Rossant, Packt Publishing, September 25 2014, and Learning IPython for Interactive Computing and Data Visualization, Cyrille Rossant, Packt Publishing, April 25 2013.

For our illustrative purposes, just consider that every IPython block of instructions has a numbered input statement and an output one, so you will find the code presented in this book structured in two blocks, at least when the output is not at all trivial; otherwise just expect only the input part:

In:  <the code you have to enter>
Out: <the output you should get>

Please notice that we do not number the inputs or the outputs.

Though we strongly recommend using IPython, if you are using a REPL approach or an IDE interface, you can use the same instructions and expect identical results (but for print formats and extensions of the returned results).

 

Python packages and functions for linear models


Linear models diffuse in many different scientific and business applications and can be found, under different functions, in quite a number of different Python packages. We have selected a few for use in this book. Among them, Statsmodels is our choice for illustrating the statistical properties of models, and Scikit-learn is instead the package we recommend for easily and seamlessly preparing data, building models, and deploying them. We will present models built with Statsmodels exclusively to illustrate the statistical properties of the linear models, resorting to Scikit-learn to demonstrate how to approach modeling from a data science point of view.

NumPy

NumPy, which is Travis Oliphant's creation, is at the core of every analytical solution in the Python language. It provides the user with multidimensional arrays, along with a large set of functions to operate multiple mathematical operations on these arrays. Arrays are blocks of data arranged along multiple dimensions and that implement mathematical vectors and matrices. Arrays are useful not just for storing data, but also for fast matrix operations (vectorization), which are indispensable when you wish to solve ad hoc data science problems.

In the book, we are primarily going to use the module linalg from NumPy; being a collection of linear algebra functions, it will provide help in explaining the nuts and bolts of the algorithm:

  • Website: http://www.numpy.org/

  • Import conventions: import numpy as np

  • Version at the time of print: 1.9.2

  • Suggested install command: pip install numpy

Tip

As a convention largely adopted by the Python community, when importing NumPy, it is suggested that you alias it as np:

import numpy as np

There are importing conventions also for other Python features that we will be using in the code presented in this book.

SciPy

An original project by Travis Oliphant, Pearu Peterson, and Eric Jones, SciPy completes NumPy's functionalities, offering a larger variety of scientific algorithms for linear algebra, sparse matrices, signal and image processing, optimization, fast Fourier transformation, and much more.

The scipy.optimize package provides several commonly used optimization algorithms, used to detail how a linear model can be estimated using different optimization approaches:

  • Website: http://www.scipy.org/

  • Import conventions: import scipy as sp

  • Version at time of print: 0.16.0

  • Suggested install command: pip install scipy

Statsmodels

Previously part of Scikit, Statsmodels has been thought to be a complement to SciPy statistical functions. It features generalized linear models, discrete choice models, time series analysis, and a series of descriptive statistics as well as parametric and nonparametric tests.

In Statsmodels, we will use the statsmodels.api and statsmodels.formula.api modules, which provide functions for fitting linear models by providing both input matrices and formula's specifications:

  • Website: http:/statsmodels.sourceforge.net/

  • Import conventions: import statsmodels.api as sm and import statsmodels.formula.api as smf

  • Version at the time of print: 0.6.1

  • Suggested install command: pip install statsmodels

Scikit-learn

Started as part of the SciPy Toolkits (SciKits), Scikit-learn is the core of data science operations on Python. It offers all that you may need in terms of data preprocessing, supervised and unsupervised learning, model selection, validation, and error metrics. Expect us to talk at length about this package throughout the book.

Scikit-learn started in 2007 as a Google Summer of Code project by David Cournapeau. Since 2013, it has been taken over by the researchers at INRA (French Institute for Research in Computer Science and Automation).

Scikit-learn offers modules for data processing (sklearn.preprocessing, sklearn.feature_extraction), model selection, and validation (sklearn.cross_validation, sklearn.grid_search, and sklearn.metrics) and a complete set of methods (sklearn.linear_model) in which the target value, being both a number or a probability, is expected to be a linear combination of the input variables:

  • Website: http://scikit-learn.org/stable/

  • Import conventions: None; modules are usually imported separately

  • Version at the time of print: 0.16.1

  • Suggested install command: pip install scikit-learn

Tip

Note that the imported module is named sklearn.

 

Summary


In this chapter, we glanced at the usefulness of linear models under the data science perspective and we introduced some basic concepts of the data science approach that will be explained in more detail later and will be applied to linear models. We have also provided detailed instructions on how to set up the Python environment; these will be used throughout the book to present examples and provide useful code snippets for the fast development of machine learning hypotheses.

In the next chapter, we will begin presenting linear regression from its statistical foundations. Starting from the idea of correlation, we will build up the simple linear regression (using just one predictor) and provide the algorithm's formulations.

About the Authors

  • Luca Massaron

    Luca Massaron is a data scientist and marketing research director specialized in multivariate statistical analysis, machine learning, and customer insight, with over a decade of experience of solving real-world problems and generating value for stakeholders by applying reasoning, statistics, data mining, and algorithms. From being a pioneer of web audience analysis in Italy to achieving the rank of a top-10 Kaggler, he has always been very passionate about every aspect of data and its analysis, and also about demonstrating the potential of data-driven knowledge discovery to both experts and non-experts. Favoring simplicity over unnecessary sophistication, Luca believes that a lot can be achieved in data science just by doing the essentials.

    Browse publications by this author
  • Alberto Boschetti

    Alberto Boschetti is a data scientist with expertise in signal processing and statistics. He holds a Ph.D. in telecommunication engineering and currently lives and works in London. In his work projects, he faces challenges ranging from natural language processing (NLP) and behavioral analysis to machine learning and distributed processing. He is very passionate about his job and always tries to stay updated about the latest developments in data science technologies, attending meet-ups, conferences, and other events.

    Browse publications by this author

Latest Reviews

(3 reviews total)
Excellent
haven't used it properly yet. Have only skimmed through it. Haven't seen any problems so far
The book is well written, covering both the theoretical basis of regression and their implementation using the statsmodels module. If you have a basic knowledge of statistics this is the book I recommend to gain proficiency in doing regression analysis, especially if you are someone who learns best through a hands-on approach.

Recommended For You

Book Title
Access this book, plus 7,500 other titles for FREE
Access now