In this chapter, you'll get fully set up to perform business intelligence tasks with Python. We'll start by installing a distribution of Python called Anaconda. Next, we'll get MongoDB up and running for storing data. After that, we'll install additional Python libraries, install a GUI tool for MongoDB, and finally take a look at the dataset that we'll be using throughout this book.
Without further ado, let's get started!
Throughout this book, we'll be using Python as the main tool for performing business intelligence tasks. This recipe shows you how to get a specific Python distribution—Anaconda, installed.
Regardless of which operating system you use, open a web browser and browse to the Anaconda download page at http://continuum.io/downloads.
The download page will automatically detect your operating system.
In this section, we have listed the steps to install Anaconda for all the major operating systems: Mac OS X, Windows, and Linux.
Click on the I WANT PYTHON 3.4 link. We'll be using Python 3.4 throughout this book.
Next, click on the Mac OS X — 64-Bit Python 3.4 Graphical Installer button to download Anaconda.
Once the download completes, browse your computer to find the downloaded Anaconda, and double-click on the Anaconda installer file (a
.pkg
file) to begin the installation.Walk through the installer steps to complete the installation. I recommend keeping the default settings.
To verify that Anaconda is installed correctly, open a terminal and type the following command:
python
If the installer was successful, you should see something like this:
Click on the I WANT PYTHON 3.4 link. We'll be using Python 3.4 throughout this book.
Next, click on the Windows 64-Bit Python 3.4 Graphical Installer button to download Anaconda.
Once the download completes, browse your computer to find the downloaded Anaconda, and double-click on the
Anaconda3-2.3.0-Windows-x86_64.exe
file to begin the installation.Walk through the installer steps to complete the installation. I recommend keeping the default settings.
To verify that Anaconda has installed correctly, open a terminal, or open a command prompt in Windows. Now type the following command:
python
If the installation was successful, you should see something like this:
Linux servers have no graphical user interface (GUI), so you'll first need to log into your server and get a command prompt. With that complete, do the following:
On the Anaconda downloads page, select Linux.
Choose the Python 3.4 link.
Right-click on the Linux X 64-Bit button, and copy the link.
At the command prompt on your server, use
curl
to download the file, pasting the following download link:curl –O <LINK TO DOWNLOAD>
I've created a special shortcut on my website that is a bit easier to type at the command line: http://robertwdempsey.com/anaconda3-linux.
Once Anaconda downloads, use the following command to start the installer:
bash Anaconda3-2.3.0-Linux-x86_64.sh
Accept the license agreement to begin installation.
When asked if you would like Anaconda to prepend the Anaconda3 install location to the
PATH
variable, typeyes
.To have the
PATH
update take effect immediately after the installation completes, type the following command in the command line:source ~/.bashrc
Once the installation is complete, verify the installation by typing
python
in the command line. If everything worked correctly, you should see something like this:
Anaconda holds many advantages over downloading Python from http://www.python.org or using the Python distribution included with your computer, some of which are as follows:
Almost 90 percent of what you'll use on a day-to-day basis is already included. In fact, it contains over 330 of the most popular Python packages.
Using Anaconda on both the computer you use for development and the server where your solutions will be deployed helps ensure that you are using the same version of the Python packages that your applications require.
It's constantly updated; so, you will always be using the latest version of Python and the Python packages.
It works on all the major operating systems—Linux, Mac, and Windows.
It comes with tools to connect and integrate with Microsoft Excel.
At the time of writing this, the current version of Anaconda for Python 3 is 2.3.0.
Seven Python libraries make up our Python business intelligence toolkit:
Pandas: A set of high-performance, easy-to-use data structures and data analysis tools. Pandas are the backbone of all our business intelligence tasks.
Scikit-learn: Gives us simple and efficient tools for data mining and data analysis including classification, regression, clustering, dimensionality reduction, model selection, and preprocessing. This will be the workhorse library for our analysis.
Numpy: An efficient multi-dimensional container of generic data that allows for arbitrary datatypes to be defined. We won't use numpy directly; however, Pandas relies on it.
Matplotlib: A 2D plotting library. We'll use this to generate all our charts.
PyMongo: Allows us to connect to and use MongoDB. We'll use this to insert and retrieve data from MongoDB.
XlsxWriter: This allows us to access and create Microsoft Excel files. This library will be used to generate reports in the Excel format.
IPython Notebook (Jupyter): An interactive computational environment. We'll use this to write our code so that we can get feedback faster than running a script over and over again.
In this section, you'll see how to install, configure, and run MongoDB on all the major operating systems—Mac OS X, Windows, and Linux.
Open a web browser and visit: https://www.mongodb.org/downloads.
The following steps explain how to install, configure, and run MongoDB on Mac OS X:
On the download page, click on the Mac OS X tab, and select the version you want.
Unpack the downloaded file and copy to any directory that you like. I typically create an
Applications
folder in my home directory where I install apps like this.For our purpose, we're going to set up a single instance of MongoDB. This means there is literally nothing to configure. To run MongoDB, open a command prompt and do the following:
At the root of your computer, make a data directory:
sudo mkdir data
Make your user the owner of the directory using the
chown
command:chown your_user_name:proper_group data
Go to the directory where you have MongoDB.
Go to the MongoDB directory.
Type the following command:
./mongod
You should see the following output from Mongo:
The following steps explain how to install, configure, and run MongoDB on Windows:
Once downloaded, browse to the folder where Mongo was downloaded, and double-click on the installer file.
When asked which setup type you want, select Complete
Follow the instructions to complete the installation.
Create a data folder at
C:\data\db. MongoDB
needs this directory in order to run. This is where, by default, Mongo is going to store all its database files.Next, at the command prompt, navigate to the directory where Mongo was installed and run Mongo:
cd C:\Program Files\MongoDB\Server\3.0\bin Mongod.exe
You should see an output like the following screenshot from Mongo, letting you know it's working:
The easiest way to install MongoDB in Linux is by using apt
. At the time of writing, there are apt
packages for 64-bit long-term support Ubuntu releases, specifically 12.04 LTS and 14.04 LTS. Since the URL for the public key can change, please visit the Mongo Installation Tutorial to ensure that you have the most recent one: https://docs.mongodb.org/manual/tutorial/install-mongodb-on-ubuntu/.
Install Mongo as follows:
Log in to your Linux box
Import the public key:
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 -- recv 7F0CEB10
Create a list file for MongoDB:
echo "deb http://repo.mongodb.org/apt/ubuntu "$(lsb_release -sc)"/mongodb-org/3.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-3.0.list
Update
apt:
sudo apt-get update
Install the latest version of Mongo:
sudo apt-get install -y mongodb-org
Run Mongo with the following command:
sudo service mongod start
Verify that MongoDB is running by checking the contents of the log file at
/var/log/mongodb/mongod.log
for a line that looks like this:[initandlisten] waiting for connections on port 27017
You can stop MongoDB by using the following
mongod
command:sudo service mongod stop
Restart MongoDB with this command:
sudo service mongod restart
MongoDB's document data model makes it easy for you to store data of any structure and to dynamically modify the schema. In layman's terms, MongoDB provides a vast amount of flexibility when it comes to storing your data. This comes in very handy when we import our data. Unlike with an SQL database, we won't have to create a table, set up a scheme, or create indexes—all of that will happen automatically when we import the data.
IPython Notebook, an interactive, browser-based tool for developing in Python, has become the de facto standard for creating and sharing code. We'll be using it throughout this book. The Python library that we're about to install—Rodeo—is an alternative you can use. The difference between IPython Notebook and Rodeo is that Rodeo has a built-in functionality to view data in a Pandas data frame, a functionality that can come in handy when you want to view, real-time, the changes that you are making to your data. Having said that, IPython Notebook is the current standard.
Regardless of the operating system, you install Rodeo with the following command:
pip install rodeo
That's all there is to it!
Using this recipe, you will get to learn how to start Rodeo.
Robomongo is a GUI tool for managing MongoDB that runs on Mac OS X, Windows, and Linux. It allows you to create new databases and collections and to run queries. It gives you the full power of the MongoDB shell in a GUI application, and has features including multiple shells, multiple results, and autocompletion. And to top it all, it's free.
Open a web browser, and browse to http://robomongo.org/.
The following steps explain how to install Robomongo on Mac OS X:
Click on the Download for Mac OS X button.
Click on the Mac OS X Installer (.dmg) link to download the file.
Once downloaded, double-click on the installer file.
Drag the Robomongo application to the
Applications
folder.Open the
Applications
folder, and double-click on Robomongo to start it up.In the MongoDB Connections window, create a new connection:
Click on Save.
Assuming that you have MongoDB running, you should see the default system database.
The following steps explain how to install Robomongo on Windows:
Click on the Download for Windows button.
Click on the Windows Installer (.exe) link to download the file.
Once downloaded, double-click on the installer file, and follow the install instructions, accepting all the defaults.
Finally, run Robomongo.
In the MongoDB Connections window, create a new connection:
Click on Save.
Highlight your new connection, and click on Connect.
In the View menu, select Explorer to start browsing the existing MongoDB databases. As this is a brand new instance, you will only have the system collection.
Robomongo allows you to run any query against a MongoDB that would use the MongoDB command-line utility. This is a great way to test the queries that you'll write and to view the results.
To use this recipe, you need to have a working installation of MongoDB and have Robomongo installed.
You can use Robomongo to run any query against MongoDB that you would run at the command line. Use the following command to retrieve a single record:
db.getCollection('accidents').findOne()
You can view the results in multiple formats:
Tree mode
Table mode
Text mode
By default, Robomongo will show you the results in tree mode as shown in the following screenshot:

In this section, we're going to download and take a bird's eye view of the dataset we'll be using throughout this book—the UK Road Safety Data. In total, this dataset provides more than 15 million rows across three CSV files.
Visit the following URL: http://data.gov.uk/dataset/road-accidents-safety-data/resource/80b76aec-a0a1-4e14-8235-09cc6b92574a.
Click on the red Download button on the right side of the page. I suggest creating a data directory to hold the data files.
Unpack the provided zip files in the directory you created.
You should see the following four files included in the expanded directory:
Accidents7904.csv
Casualty7904.csv
Road-Accident-Safety-Data-Guide-1979-2004.xls
Vehicles7904.csv
The CSV files contain the data that we are going to use in the recipes throughout this book. The Excel file is pure magic, though. It contains a reference for all the data, including a list of the fields in each dataset as well as the coding used.
Coding data is a very important preprocessing step. Most analysis tools that you will use expect to see numbers rather than labels such as city or road type. The reason for this is that computers don't understand context like we humans do. Is Paris a city or a person? It depends. Computers can't make that judgment call. To get around this, we assign numbers to each text value. That's been done with this dataset.
It is said that up to 90 percent of the time spent on most data projects is for preparing the data for analysis. Anecdotal evidence from this author and those I speak with holds this to be true. While you will learn a number of techniques for cleaning and standardizing data, also known as preprocessing in the data world, the UK Road Safety Data dataset is an analysis-ready dataset. In addition, it provides a large amount of data—millions of rows—for us to work with.
This dataset contains detailed road safety data about the circumstances of personal injury road accidents in GB from 1979, the types (including Make and Model) of vehicles involved and the consequential casualties.