Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Learning Predictive Analytics with Python
Learning Predictive Analytics with Python

Learning Predictive Analytics with Python: Gain practical insights into predictive modelling by implementing Predictive Analytics algorithms on public datasets with Python

eBook
€28.99 €32.99
Paperback
€41.99
Subscription
Free Trial
Renews at €18.99p/m

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Table of content icon View table of contents Preview book icon Preview Book

Learning Predictive Analytics with Python

Chapter 2. Data Cleaning

Without any further ado, lets kick-start the engine and start our foray into the world of predictive analytics. However, you need to remember that our fuel is data. In order to do any predictive analysis, one needs to access and import data for the engine to rev up.

I assume that you have already installed Python and the required packages with an IDE of your choice. Predictive analytics, like any other art, is best learnt when tried hands-on and practiced as frequently as possible. The book will be of the best use if you open a Python IDE of your choice and practice the explained concepts on your own. So, if you haven't installed Python and its packages yet, now is the time. If not all the packages, at-least pandas should be installed, which are the mainstay of the things that we will learn in this chapter.

After reading this chapter, you should be familiar with the following topics:

  • Handling various kind of data importing scenarios that is importing...

Reading the data – variations and examples

Before we delve deeper into the realm of data, let us familiarize ourselves with a few terms that will appear frequently from now on.

Data frames

A data frame is one of the most common data structures available in Python. Data frames are very similar to the tables in a spreadsheet or a SQL table. In Python vocabulary, it can also be thought of as a dictionary of series objects (in terms of structure). A data frame, like a spreadsheet, has index labels (analogous to rows) and column labels (analogous to columns). It is the most commonly used pandas object and is a 2D structure with columns of different or same types. Most of the standard operations, such as aggregation, filtering, pivoting, and so on which can be applied on a spreadsheet or the SQL table can be applied to data frames using methods in pandas.

The following screenshot is an illustrative picture of a data frame. We will learn more about working with them as we progress in the...

Various methods of importing data in Python

pandas is the Python library/package of choice to import, wrangle, and manipulate datasets. The datasets come in various forms; the most frequent being in the .csv format. The delimiter (a special character that separates the values in a dataset) in a CSV file is a comma. Now we will look at the various methods in which you can read a dataset in Python.

Case 1 – reading a dataset using the read_csv method

Open an IPython Notebook by typing ipython notebook in the command line.

Download the Titanic dataset from the shared Google Drive folder (any of .xls or .xlsx would do). Save this file in a CSV format and we are good to go. This is a very popular dataset that contains information about the passengers travelling on the famous ship Titanic on the fateful sail that saw it sinking. If you wish to know more about this dataset, you can go to the Google Drive folder and look for it.

A common practice is to share a variable description file with...

The read_csv method

The name of the method doesn't unveil its full might. It is a kind of misnomer in the sense that it makes us think that it can be used to read only CSV files, which is not the case. Various kinds of files, including .txt files having delimiters of various kinds can be read using this method.

Let's learn a little bit more about the various arguments of this method in order to assess its true potential. Although the read_csv method has close to 30 arguments, the ones listed in the next section are the ones that are most commonly used.

The general form of a read_csv statement is something similar to:

pd.read_csv(filepath, sep=', ', dtype=None, header=None, skiprows=None, index_col=None, skip_blank_lines=TRUE, na_filter=TRUE)

Now, let us understand the significance and usage of each of these arguments one by one:

  • filepath: filepath is the complete address of the dataset or file that you are trying to read. The complete address includes the address of the directory...

Use cases of the read_csv method

The read_csv method can be put to a variety of uses. Let us look at some such use cases.

Passing the directory address and filename as variables

Sometimes it is easier and viable to pass the directory address and filename as variables to avoid hard-coding. More importantly so, when one doesn't want to hardcode the full address of the file and intend to use this full address many times. Let us see how we can do so while importing a dataset.

import pandas as pd
path = 'E:/Personal/Learning/Datasets/Book'
filename = 'titanic3.csv'
fullpath = path+'/'+filename
data = pd.read_csv(fullpath)

For such cases, alternatively, one can use the following snippet that uses the path.join method in an os package:

import pandas as pd
import os
path = 'E:/Personal/Learning/Datasets/Book'
filename = 'titanic3.csv'
fullpath = os.path.join(path,filename)
data = pd.read_csv(fullpath)

One advantage of using the latter method is...

Case 2 – reading a dataset using the open method of Python

pandas is a very robust and comprehensive library to read, explore, and manipulate a dataset. But, it might not give an optimal performance with very big datasets as it reads the entire dataset, all at once, and blocks the majority of computer memory. Instead, you can try one of the Python's file handling methods—open. One can read the dataset line by line or in chunks by running a for loop over the rows and delete the chunks from the memory, once they have been processed. Let us look at some of the use case examples of the open method.

Reading a dataset line by line

As you might be aware that while reading a file using the open method, we can specify to use a particular mode that is read, write, and so on. By default, the method opens a file in the read-mode. This method can be useful while reading a big dataset, as this method reads data line-by-line (not at once, unlike what pandas does). You can read datasets...

Reading the data – variations and examples


Before we delve deeper into the realm of data, let us familiarize ourselves with a few terms that will appear frequently from now on.

Data frames

A data frame is one of the most common data structures available in Python. Data frames are very similar to the tables in a spreadsheet or a SQL table. In Python vocabulary, it can also be thought of as a dictionary of series objects (in terms of structure). A data frame, like a spreadsheet, has index labels (analogous to rows) and column labels (analogous to columns). It is the most commonly used pandas object and is a 2D structure with columns of different or same types. Most of the standard operations, such as aggregation, filtering, pivoting, and so on which can be applied on a spreadsheet or the SQL table can be applied to data frames using methods in pandas.

The following screenshot is an illustrative picture of a data frame. We will learn more about working with them as we progress in the chapter:

Fig...

Various methods of importing data in Python


pandas is the Python library/package of choice to import, wrangle, and manipulate datasets. The datasets come in various forms; the most frequent being in the .csv format. The delimiter (a special character that separates the values in a dataset) in a CSV file is a comma. Now we will look at the various methods in which you can read a dataset in Python.

Case 1 – reading a dataset using the read_csv method

Open an IPython Notebook by typing ipython notebook in the command line.

Download the Titanic dataset from the shared Google Drive folder (any of .xls or .xlsx would do). Save this file in a CSV format and we are good to go. This is a very popular dataset that contains information about the passengers travelling on the famous ship Titanic on the fateful sail that saw it sinking. If you wish to know more about this dataset, you can go to the Google Drive folder and look for it.

A common practice is to share a variable description file with the dataset...

Basics – summary, dimensions, and structure

After reading in the data, there are certain tasks that need to be performed to get the touch and feel of the data:

  • To check whether the data has read in correctly or not
  • To determine how the data looks; its shape and size
  • To summarize and visualize the data
  • To get the column names and summary statistics of numerical variables

Let us go back to the example of the Titanic dataset and import it again. The head() method is used to look at the first first few rows of the data, as shown:

import pandas as pd
data=pd.read_csv('E:/Personal/Learning/Datasets/Book/titanic3.csv')
data.head()

The result will look similar to the following screenshot:

Basics – summary, dimensions, and structure

Fig. 2.6: Thumbnail view of the Titanic dataset obtained using the head() method

In the head() method, one can also specify the number of rows they want to see. For example, head(10) will show the first 10 rows.

The next attribute of the dataset that concerns us is its dimension, that is the number of rows...

Basics – summary, dimensions, and structure


After reading in the data, there are certain tasks that need to be performed to get the touch and feel of the data:

  • To check whether the data has read in correctly or not

  • To determine how the data looks; its shape and size

  • To summarize and visualize the data

  • To get the column names and summary statistics of numerical variables

Let us go back to the example of the Titanic dataset and import it again. The head() method is used to look at the first first few rows of the data, as shown:

import pandas as pd
data=pd.read_csv('E:/Personal/Learning/Datasets/Book/titanic3.csv')
data.head()

The result will look similar to the following screenshot:

Fig. 2.6: Thumbnail view of the Titanic dataset obtained using the head() method

In the head() method, one can also specify the number of rows they want to see. For example, head(10) will show the first 10 rows.

The next attribute of the dataset that concerns us is its dimension, that is the number of rows and columns present...

Handling missing values


Checking for missing values and handling them properly is an important step in the data preparation process, if they are left untreated they can:

  • Lead to the behavior between the variables not being analyzed correctly

  • Lead to incorrect interpretation and inference from the data

To see how; move up a few pages to see how the describe method is explained. Look at the output table; why are the counts for many of the variables different from each other? There are 1310 rows in the dataset, as we saw earlier in the section. Why is it then that the count is 1046 for age, 1309 for pclass, and 121 for body. This is because the dataset doesn't have a value for 264 (1310-1046) entries in the age column, 1 (1310-1309) entry in the pclass column, and 1189 (1310-121) entries in the body column. In other words, these many entries have missing values in their respective columns. If a column has a count value less than the number of rows in the dataset, it is most certainly because the...

Creating dummy variables


Creating dummy variables is a method to create separate variable for each category of a categorical variable., Although, the categorical variable contains plenty of information and might show a causal relationship with output variable, it can't be used in the predictive models like linear and logistic regression without any processing.

In our dataset, sex is a categorical variable with two categories that are male and female. We can create two dummy variables out of this, as follows:

dummy_sex=pd.get_dummies(data['sex'],prefix='sex')

The result of this statement is, as follows:

Fig. 2.17: Dummy variable for the sex variable in the Titanic dataset

This process is called dummifying, the variable creates two new variables that take either 1 or 0 value depending on what the sex of the passenger was. If the sex was female, sex_female would be 1 and sex_male would be 0. If the sex was male, sex_male would be 1 and sex_female would be 0. In general, all but one dummy variable...

Visualizing a dataset by basic plotting


Plots are a great way to visualize a dataset and gauge possible relationships between the columns of a dataset. There are various kinds of plots that can be drawn. For example, a scatter plot, histogram, box-plot, and so on.

Let's import the Customer Churn Model dataset and try some basic plots:

import pandas as pd
data=pd.read_csv('E:/Personal/Learning/Predictive Modeling Book/Book Datasets/Customer Churn Model.txt')

While plotting any kind of plot, it helps to keep these things in mind:

  • If you are using IPython Notebook, write % matplotlib inline in the input cell and run it before plotting to see the output plot inline (in the output cell).

  • To save a plot in your local directory as a file, you can use the savefig method. Let's go back to the example where we plotted four scatter plots in a 2x2 panel. The name of this image is specified in the beginning of the snippet, as a figure parameter of the plot. To save this image one can write the following code...

Left arrow icon Right arrow icon

Key benefits

  • *A step-by-step guide to predictive modeling including lots of tips, tricks, and best practices
  • *Get to grips with the basics of Predictive Analytics with Python
  • *Learn how to use the popular predictive modeling algorithms such as Linear Regression, Decision Trees, Logistic Regression, and Clustering

Description

Social Media and the Internet of Things have resulted in an avalanche of data. Data is powerful but not in its raw form - It needs to be processed and modeled, and Python is one of the most robust tools out there to do so. It has an array of packages for predictive modeling and a suite of IDEs to choose from. Learning to predict who would win, lose, buy, lie, or die with Python is an indispensable skill set to have in this data age. This book is your guide to getting started with Predictive Analytics using Python. You will see how to process data and make predictive models from it. We balance both statistical and mathematical concepts, and implement them in Python using libraries such as pandas, scikit-learn, and numpy. You’ll start by getting an understanding of the basics of predictive modeling, then you will see how to cleanse your data of impurities and get it ready it for predictive modeling. You will also learn more about the best predictive modeling algorithms such as Linear Regression, Decision Trees, and Logistic Regression. Finally, you will see the best practices in predictive modeling, as well as the different applications of predictive modeling in the modern world.

Who is this book for?

If you wish to learn how to implement Predictive Analytics algorithms using Python libraries, then this is the book for you. If you are familiar with coding in Python (or some other programming/statistical/scripting language) but have never used or read about Predictive Analytics algorithms, this book will also help you. The book will be beneficial to and can be read by any Data Science enthusiasts. Some familiarity with Python will be useful to get the most out of this book, but it is certainly not a prerequisite.

What you will learn

  • *Understand the statistical and mathematical concepts behind Predictive Analytics algorithms and implement Predictive Analytics algorithms using Python libraries
  • *Analyze the result parameters arising from the implementation of Predictive Analytics algorithms
  • *Write Python modules/functions from scratch to execute segments or the whole of these algorithms
  • *Recognize and mitigate various contingencies and issues related to the implementation of Predictive Analytics algorithms
  • *Get to know various methods of importing, cleaning, sub-setting, merging, joining, concatenating, exploring, grouping, and plotting data with pandas and numpy
  • *Create dummy datasets and simple mathematical simulations using the Python numpy and pandas libraries
  • *Understand the best practices while handling datasets in Python and creating predictive models out of them

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Feb 15, 2016
Length: 354 pages
Edition : 1st
Language : English
ISBN-13 : 9781783983278
Category :
Languages :

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Product Details

Publication date : Feb 15, 2016
Length: 354 pages
Edition : 1st
Language : English
ISBN-13 : 9781783983278
Category :
Languages :

Packt Subscriptions

See our plans and pricing
Modal Close icon
€18.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
€189.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts
€264.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total 115.97
Designing Machine Learning Systems with Python
€36.99
Python Machine Learning
€36.99
Learning Predictive Analytics with Python
€41.99
Total 115.97 Stars icon

Table of Contents

11 Chapters
1. Getting Started with Predictive Modelling Chevron down icon Chevron up icon
2. Data Cleaning Chevron down icon Chevron up icon
3. Data Wrangling Chevron down icon Chevron up icon
4. Statistical Concepts for Predictive Modelling Chevron down icon Chevron up icon
5. Linear Regression with Python Chevron down icon Chevron up icon
6. Logistic Regression with Python Chevron down icon Chevron up icon
7. Clustering with Python Chevron down icon Chevron up icon
8. Trees and Random Forests with Python Chevron down icon Chevron up icon
9. Best Practices for Predictive Modelling Chevron down icon Chevron up icon
A. A List of Links Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Half star icon Empty star icon 3.4
(11 Ratings)
5 star 36.4%
4 star 9.1%
3 star 27.3%
2 star 9.1%
1 star 18.2%
Filter icon Filter
Top Reviews

Filter reviews by




adnan baloch Mar 28, 2016
Full star icon Full star icon Full star icon Full star icon Full star icon 5
You don't have to be married to a physicist to appreciate the role of the team at CERN that confirmed the existence of the Higgs Boson. Who better to be a reviewer of this book than a member of that team? That fact itself should inspire confidence in the utility of this book. The author uses interesting analogies to explain the different aspects of predictive analytics and even goes so far as to present comparison tables, serving to drive home his points. The ease and power of the Python programming language is put to good use in explaining the process of data cleaning and wrangling. The better part of the first half of the book is dedicated to exploring the various aspects of these two critical processes with easy to follow examples and code. A whole chapter is devoted to laying out the statistical concepts that are integral to getting the most out of the remainder of the book. The latter part of the book details supervised and unsupervised predictive modelling algorithms, shows how to implement them in Python and furthermore, delves deep into the mathematics of these widely used algorithms so that readers become well equipped to tackle real world challenges of predictive analytics in ANY programming language of their choice. In my opinion, the author really succeeded in making the serious subject matter of this book sound cool and exciting.
Amazon Verified review Amazon
A. Zubarev Apr 18, 2016
Full star icon Full star icon Full star icon Full star icon Full star icon 5
In my view Learning Predictive Analytics with Python is one of the most successful publications on such a difficult to initially grasp subject as Machine Learning. Yes, despite the name of the book does not imply so, it is in fact a gentle submersion into the Machine Learning, a so highly praised Data Science topic. Luckily, learning it would be much easier with Learning Predictive Analytics with Python from such a talented author. It is the most exciting yet easy to follow, logical and at the same time entertaining material I ever read so far. Tasteful, relevant examples, based on free software and datasets anyone can obtain. And the book also has several gems, these are the coverage of the ID3 algorithm (based on my observation looks like totally omitted in the most modern literature, but undeservedly), building various regressions and testing your model. One small advice to the reader: get familiarized yourself with iPython, and perhaps read some theory on statistics, not really necessary, but if you are going to apply the newly acquired knowledge at work or study then it could be a great deal of steering you into the right direction.
Amazon Verified review Amazon
Julian Cook Mar 13, 2016
Full star icon Full star icon Full star icon Full star icon Full star icon 5
If you are familiar with Packt (the publisher), you will know that they tend to carpet bomb particular areas, with multiple overlapping titles. This makes it difficult to recommend just one title if anyone asks you, since different books have different strengths.The strength of this book is that the author really does explain how to use PANDAS (python data analysis library) and statistical analysis from the ground up. Most pandas users will be familiar with pd.read_csv, but he covered a lot of options that I had never really understood properly, because I chiefly learnt from examples that don't really give you the 'why' of things.You might say, why not read the original book by Wes McKinney? I would have to say that this is a much more interesting read and has better flow. The Wes McKinney book sometimes reads like documentation and you are not sure what to really focus on.The coverage of statistical learning is also good, for instance he does a nice explanation of logistic regression and the underlying methodology with just enough math to properly explain the distinction between linear regression and logistic regression.I think the book is thorough enough that you could actually use it as a coursebook for statistical learning w/python, which a high praise for a book with a fairly generic title.
Amazon Verified review Amazon
a reader Sep 26, 2020
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This is a good book. I do not understand why there are bad reviews for it. I would like to thank the author for the good job! Well done! Unfortunately, the author deleted the datasets the book uses from the Google drive.
Amazon Verified review Amazon
Jeremie Oct 04, 2017
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
Book deserves three to four stars max. It is ok and interesting. It is introduces a lot of concepts but shame it doesn't go a little bit more into details especially in the end of the book when talking about clustering and regression. It is one thing to talk about clustering but there is nothing about what to do with it once it is done.there isnt much discussion about regression tree and random forest algorithms which deserve more such as for example what can one do to improve the algos if thstbdont work well or what other algos are available.perhaps simply the book needs to advise on further reading
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.

Modal Close icon
Modal Close icon