Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Python Data Mining Quick Start Guide
Python Data Mining Quick Start Guide

Python Data Mining Quick Start Guide: A beginner's guide to extracting valuable insights from your data

By Nathan Greeneltch
₱1,346.99 ₱941.99
Book Apr 2019 188 pages 1st Edition
eBook
₱1,346.99 ₱941.99
Print
₱1,683.99
Subscription
Free Trial
eBook
₱1,346.99 ₱941.99
Print
₱1,683.99
Subscription
Free Trial

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now

Product Details


Publication date : Apr 25, 2019
Length 188 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781789800265
Category :
Concepts :
Table of content icon View table of contents Preview book icon Preview Book

Python Data Mining Quick Start Guide

Data Mining and Getting Started with Python Tools

In a sense, data mining is a necessary and predictable response to the dawn of the information age. Indeed, every piece of the modern global economy relies more each year on information and an immense in-stream of data. The path from information pool to actionable insights has many steps. Data mining is typically defined as the pattern and/or trend discovery phase in the pipeline.

This book is a quick-start guide for data mining and will include utilitarian descriptions of the most important and widely used methods, including the mainstays among data professionals such as k-means clustering, random forest prediction, and principal component dimensionality reduction. Along the way, I will give you tips I've learned and introduce helpful scripting tools to make your life easier. Not only will I introduce the tools, but I will clearly describe what makes them so helpful and why you should take the time to learn them.

The first half of the book will cover the nuts and bolts of data collection and preparation. The second half will be more conceptual and will introduce the topics of transformation, clustering, and prediction. The conceptual discussions start in the middle of Chapter 4, Cleaning and Readying Data for Analysis, and are written solely as a conversation between myself and the reader. These conversations are ported mostly from the many adhoc training sessions I've done over the years on Intel office marker boards. The last chapter of the book will be on the deployment of these models. This topic is the natural next step for new practitioners and I will provide an introduction and references for when you think you are ready to take the next steps.

The following topics will be covered in this chapter:

  • Descriptive, predictive, and prescriptive analytics
  • What will and will not be covered in this book
  • Setting up Python environments for data mining
  • Installing the Anaconda distribution and Conda package manager
  • Launching the Spyder IDE
  • Launching a Jupyter Notebook
  • Installing a high performance Python distribution
  • Recommended libraries and how to install
Practitioners should be familiar with the previous data selection, preprocessing, and transformation steps as well as the subsequent pattern and trend evaluation. Knowledge of the full process and an understanding of the goals will orient your data mining efforts in space and keep you aligned with the overall goal.

Descriptive, predictive, and prescriptive analytics

Practitioners in the field of data analysis usually break down their work into three genres of analytics, given as follows:

  • Descriptive: Descriptive is the oldest field of analytics study and involves digging deep into the data to hunt down and extract previously unidentified trends, groupings, or other patterns. This was the predominant type of analytics done by the pioneering groups in the field of data mining, and for a number of years the two terms were considered more or less to mean the same thing. However, predictive analytics blossomed in the early 2000s along with the burgeoning field of machine learning, and the many of the techniques that came out of the data mining community proved useful for prediction.
  • Predictive: Predictive analytics, as the name suggests, focuses on predicting future outcomes and relies on the assumption that past descriptions necessarily lead to future behavior. This concept demonstrates the strong and unavoidable connection between descriptive and predictive analytics. In recent years, industry has naturally taken the next logical step of using prediction to feed into prescriptive solutions.
  • Prescriptive: Prescriptive analytics relies heavily on customer goals, seeks personalized scoring systems for predictions, and is still a relatively immature field of study and practice. This is accomplished by modeling various response strategies and scoring against the personalized score system.

Please see the following table for a summary:

Type of analytics

Problem statement addressed
Descriptive What happened?
Predictive What will happen next?
Prescriptive How should we respond?

What will and will not be covered in this book

A quick and dirty description of data mining I hear in the field can be paraphrased as: "Descriptive and predictive analytics with a focus on previously hidden relationships or trends". As such, this book will cover these topics and skip the predictive analytics that focus on automation of obvious prediction, along with the entire field of prescriptive analytics entirely. This text is meant to be a quick start guide, so even the relevant fields of study will only be skimmed over and summarized. Please see the Recommended reading for further explanation section for inquiring minds that want to delve deeper into some of the subjects covered in this book.

Preprocessing and data transformation are typically considered to be outside of the data mining category. One of the goals of this book is to provide full working data mining examples, and basic preprocessing is required to do this right. So, this book will cover those topics, before delving in to the more traditional mining strategies.

Throughout this book, I will throw in tips I've learned along my career journey around how to apply data mining to solve real-world problems. I will denote them in a special tip box like this one.

Recommended readings for further explanation

These books are good for more in-depth discussions and as an introduction to important and relevant topics. I recommend that you start with these if you want to become an expert:

  • Data mining in practice:

Data Mining: Practical Machine Learning Tools and Techniques, 4th Edition by Ian H. Witten (author), Eibe Frank (author), Mark A. Hall (author), Christopher J. Pal

  • Data mining advanced discussion and mathematical foundation:

Data Mining and Analysis: Fundamental Concepts and Algorithms, 1st Edition by Mohammed J. Zaki (author), Wagner Meira Jr (author)

  • Computer science taught with Python:

Python Programming: An Introduction to Computer Science, 3rd Edition by John Zelle (author)

  • Python machine learning and analytics:

Python Machine Learning: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow, 2nd Edition Paperback—September 20, 2017 by Sebastian Raschka (author), Vahid Mirjalili (author)

Advanced Machine Learning with Python Paperback—July 28, 2016 by John Hearty

Setting up Python environments for data mining

A computing setup conducive to advanced data mining requires a comfortable development environment and working libraries for data management, analytics, plotting, and deployment. The popular bundled Python distribution from Anaconda is a perfect fit for the job. It is targeted at scientists and engineers, and includes all the required packages to get started. Conda itself is a package manager for maintaining working Python environments and, of course, is included in the bundle. The package manager will allow you to install/remove combinations of libraries into segregated Python environments, all the while reconciling any version dependencies between the distinct libraries.

It includes an integrated development environment called The Scientific Python Development Environment (Spyder) and a ready-to-use implementation of Jupyter Notebook interface. Both of these development environments use the interactive Python console called IPython. IPython gives you a live console for scripting. You can run a single line of code, check results, then run another line of code in same console in an interactive fashion. A few trial-and-error sessions with IPython will demonstrate very clearly why these Python tools are so beloved by practitioners working in a rapid prototyping environment.

Installing the Anaconda distribution and Conda package manager

These tools from Anaconda are available on both Windows and Linux systems. See the following install instructions.

Installing on Linux

To install the distribution, follow these steps given as follows:

  1. First, download the latest installer build from https://www.anaconda.com/download/#linux.
  2. Then, in the Linux Terminal, pass this bash command:
$ bash Anaconda-latest-Linux-x86_64.sh
  1. Follow the prompts in the terminal and it will begin installing. Once done, you will be asked if you want to allow Conda to be auto-initialized with a .bashrc entry. I recommend choosing N and activating it manually when needed, just in case you decide to have multiple versions of Conda on your system. In this case, you can launch the Conda prompt by using the following command:
$ source /{anaconda3_dir}/bin/activate

This is will source the Conda activate shell script and call it to activate the base environment, which is the default Anaconda Python bundle. Adding new environments will be discussed in the following section on how to install specific libraries. At this point, passing the Python command will open an interactive shell where you can execute Python code line-by-line, as shown in the following code snippet:

(base) $ Python
Python 3.7.0 (default, Jun 28 2018, 13:15:42)
[GCC 7.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy
>>> numpy.random.random(10)
array([0.48489815, 0.80944492, 0.89740441, 0.93031125, 0.71774534,
0.63817451, 0.93231809, 0.75820457, 0.17550135, 0.62126858])

Alternatively, you can execute the code in a stored Python script by using the following command:

(base) $ Python script.py

Installing on Windows

To install on Windows, follow the steps given as follows:

  1. First, download the executable from https://conda.io/docs/user-guide/install/windows.html
  2. Then, launch the Anaconda prompt that can be found in a program search from the Windows Start menu

Anaconda prompt is a Windows command prompt with all the environment variables set to point to Anaconda. That's it; you are ready to use your base Python environment.

Installing on macOS

To install on macOS, follow the steps given as follows:

  1. First, download the graphical installer from the Anaconda distribution site https://www.anaconda.com/distribution/
  2. Launch the package and follow the on-screen prompts, which should set up everything you need automatically

Launching the Spyder IDE

Spyder can be started by passing spyder into the Anaconda prompt as follows:

(base) $ spyder

As mentioned earlier, Spyder uses the IPython interactive console. So, you can pass code line-by-line directly into the console. See the following screenshot for two lines of Python code passed one at a time:

Alternatively, you can edit a script in the editor and execute by pressing the green play button at the top of the IDE. This causes the script to be dumped into the IPython console, then run line-by-line:

The interactive IPython console also can display images and plots inline in the same console window. This is yet another feature that is very convenient for interactive data mining and rapid prototyping of analytics models:

Launching a Jupyter Notebook

The Jupyter project spun out of the popular IPython Notebook work of the early 2000s. These notebooks provide a visual interface with sequential text and code cells. This allows you to add some text to describe a solution, then follow it with code examples. The Jupyter Notebook also use the IPython console (similar to Spyder), so you have an interactive code interpretor that can plot images inline. Launching the notebook from the Anaconda prompt is simple:

(base) $ jupyter notebook

The Jupyter project maintains a few basic notebooks. Let's look at a screenshot from one of them, as follows (it can be found at http://nbviewer.jupyter.org/github/temporaer/tutorial_ml_gkbionics):

The concept is self-explanatory if we look at a few examples. The following are recommendations for some relevant and helpful Jupyter Notebooks on data mining and analytics from around the web:

https://github.com/rasbt/python-machine-learning-book/blob/master/code/ch01/ch01.ipynb

http://nbviewer.jupyter.org/github/amplab/datascience-sp14/blob/master/hw2/HW2.ipynb

https://github.com/TomAugspurger/PyDataSeattle/blob/master/notebooks/1.%20Basics.ipynb

Installing high-performance Python distribution

Intel Corp has built a bundle of Python libraries with accelerations for High-Performance Computing (HPC) on CPUs. The vast majority of the accelerations come with no code changes, because they are snuck in under the hood. All the concepts and libraries introduced in the rest of the book will run faster in the HPC Intel Python environment. Luckily, Intel has a Conda version of their distribution, so you can add it as a new Conda environment via the following few command lines in the Anaconda prompt:

(base) $ Conda create -n idp -c channel intelpython3_full Python=3
(base) $ Conda activate idp

Full disclosure: I work for Intel, so I won't focus too much on this HPC distribution. I will merely let the performance numbers speak for themselves. See the following graph for raw speedup numbers (optimized versus stock) when using unchanged Scikit-learn code on CPU:

Recommended libraries and how to install

Its easy to add or remove libraries from the Anaconda prompt. Once you have an the preferred environment activated, the simple Conda install command will search the Anaconda cloud repo for a matching package, and will begin download if it exists. Conda will warn if there are version dependencies with your other libraries. Always pay attention to these warnings, so that you know if any other library versions are affected. If, at any time, you need a reminder of what is in your environment, use the Conda list command to check package names and versions.

Let's look at some example commands:

  1. Create a new environment called my_env with Python version 3 using the following command:
(base) $ Conda create -n my_env Python=3
  1. Check whether my_env was created successfully by using the following command:
(base) $ Conda info --envs

You will see the following screen on the execution of the preceding command:

  1. Activate a new environment by using the following command:
(base) $ Conda activate my_env
  1. Install the numpy math library by using the following command:
(my_env) $ Conda install numpy
  1. Use Conda list as follows, to check whether a new library was installed or not and all other libraries and versions in my_env:
(my_env) $ Conda list 

You will see the following screen on the execution of the preceding command:

Recommended libraries

If you choose to manage a smaller environment than the full bundle from Anaconda, I recommend the following essential libraries for data mining. They will be used throughout this book:

  • numpy: The fundamental math library for Python. Brings with it the numpy array data structure.
  • scipy: Provides science and engineering routines built on the base of the numpy array. This library also has some good statistical functions.
  • pandas: Offers relational data tables for storing, labeling, viewing, and manipulating data. You will never look at an array of numbers in the same way for the rest of your career after you've gotten comfortable with pandas and its popular data structure, called a dataframe.
  • matplotlib: Python's core visualization library with line and scatter plots, bar and pie charts, histograms and spectrograms, and so on.
  • seaborn: As statistical visualization library. Built on top of matplotlib and much easier to use. You can build complicated visual representations with, in many cases, a single line of code. This library takes pandas dataframes as input.
  • statsmodels: Library focused on statistics functions and statistical testing. For example, it has a .summary() function that returns helpful summary stats and information about a model you've applied.
  • scikit-learn: Python's workhorse machine learning library. It is easy to use and is maintained by an army of developers. The best part is the documentation on http:\\scikit-learn.org . It is so extensive that one could learn the field of machine learning just by reading though the entirety of it.
Editorial: Python has become ubiquitous in the fields of advanced data analysis in the last decade. This is partially due to the scripting nature of the language and approachability to non-programmers, but that is not the whole story. The pandas, scikit-learn, and seaborn libraries are essential to Python's growth in this domain. The power, ease-of-use, well-defined targeted scope, and open source nature of these three libraries are unmatched among free or paid packages. I recommend you learn them inside and out as you embark on a career in data mining.

Summary

This chapter introduced the contents of the book and covered getting started with your software environment. It also covered how to get high-speed Python and popular libraries such as pandas, scikit-learn, and seaborn. After reading this chapter and setting your environment, you should be ready to follow along with the demonstrations throughout the rest of the book.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Grasp the basics of data loading, cleaning, analysis, and visualization
  • Use the popular Python libraries such as NumPy, pandas, matplotlib, and scikit-learn for data mining
  • Your one-stop guide to build efficient data mining pipelines without going into too much theory

Description

Data mining is a necessary and predictable response to the dawn of the information age. It is typically defined as the pattern and/ or trend discovery phase in the data mining pipeline, and Python is a popular tool for performing these tasks as it offers a wide variety of tools for data mining. This book will serve as a quick introduction to the concept of data mining and putting it to practical use with the help of popular Python packages and libraries. You will get a hands-on demonstration of working with different real-world datasets and extracting useful insights from them using popular Python libraries such as NumPy, pandas, scikit-learn, and matplotlib. You will then learn the different stages of data mining such as data loading, cleaning, analysis, and visualization. You will also get a full conceptual description of popular data transformation, clustering, and classification techniques. By the end of this book, you will be able to build an efficient data mining pipeline using Python without any hassle.

What you will learn

Explore the methods for summarizing datasets and visualizing/plotting data Collect and format data for analytical work Assign data points into groups and visualize clustering patterns Learn how to predict continuous and categorical outputs for data Clean, filter noise from, and reduce the dimensions of data Serialize a data processing model using scikit-learn’s pipeline feature Deploy the data processing model using Python’s pickle module

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now

Product Details


Publication date : Apr 25, 2019
Length 188 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781789800265
Category :
Concepts :

Table of Contents

9 Chapters
Preface Chevron down icon Chevron up icon
Data Mining and Getting Started with Python Tools Chevron down icon Chevron up icon
Basic Terminology and Our End-to-End Example Chevron down icon Chevron up icon
Collecting, Exploring, and Visualizing Data Chevron down icon Chevron up icon
Cleaning and Readying Data for Analysis Chevron down icon Chevron up icon
Grouping and Clustering Data Chevron down icon Chevron up icon
Prediction with Regression and Classification Chevron down icon Chevron up icon
Advanced Topics - Building a Data Processing Pipeline and Deploying It Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Filter icon Filter
Top Reviews
Rating distribution
Empty star icon Empty star icon Empty star icon Empty star icon Empty star icon 0
(0 Ratings)
5 star 0%
4 star 0%
3 star 0%
2 star 0%
1 star 0%

Filter reviews by


No reviews found
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.