Reader small image

You're reading from  Python Data Mining Quick Start Guide

Product typeBook
Published inApr 2019
Reading LevelBeginner
PublisherPackt
ISBN-139781789800265
Edition1st Edition
Languages
Concepts
Right arrow
Author (1)
Nathan Greeneltch
Nathan Greeneltch
author image
Nathan Greeneltch

Nathan Greeneltch, PhD is a ML engineer at Intel Corp and resident data mining and analytics expert in the AI consulting group. Hes worked with Python analytics in both the start-up realm and the large-scale manufacturing sector over the course of the last decade. Nathan regularly mentors new hires and engineers fresh to the field of analytics, with impromptu chalk talks and division-wide knowledge-sharing sessions at Intel. In his past life, he was a physical chemist studying surface enhancement of the vibration signals of small molecules; a topic on which he wrote a doctoral thesis while at Northwestern University in Evanston, IL. Nathan hails from the southeastern United States, with family in equal parts from Arkansas and Florida
Read more about Nathan Greeneltch

Right arrow

Preface

This book introduces data mining with popular free Python libraries. It is written in a conversational style, aiming to be approachable while imparting intuition on the reader. Data mining is a broad field of analytical methods designed to uncover insights from your data that are not obvious or discoverable by conventional analysis techniques. The field of data mining is vast, so the topics in this quick start guide were chosen by their relevance to not only their field of origin, but also the adjacent applications of machine learning and artificial intelligence. After a procedural first half, focused on getting the reader comfortable with data collection, loading, and munging, the book will move to a completely conceptual discussion. The concepts are introduced from first principles intuition and broadly grouped as transformation, clustering, and prediction. Popular methods such as principal component analysis, k-means clustering, support vector machines, and random forest are all covered in the conceptual second half of the book. The book ends with a discussion on pipe-lining and deploying your analytical models.

Who this book is for

This book is targeted at individuals who are new to the field or data mining and analytics with Python. Very little background is assumed in Python programming or math above the high-school level. All of the Python libraries used in the book are freely available at no cost on a variety of platforms, so anyone with access to the internet should be able to learn and practice the concepts introduced.

What this book covers

The first three and a half chapters of the book are focused on the procedural nuts and bolts of a data mining project. This includes creating a data mining Python environment, loading data from a variety of sources, and munging the data for downstream analysis. The remaining content in the book is mostly conceptual, and delivered in a conversational style very close to how I would train a new hire at my company.

Chapter 1, Data Mining and Getting Started with Python Tools, covers the topic of getting started with your software environment. It also covers how to download and install high-speed Python and popular libraries such as pandas, scikit-learn, and seaborn. After reading this chapter and setting up your environment, you should be ready to follow along with the demonstrations throughout the rest of the book.

Chapter 2, Basic Terminology and our End-to-End Example, covers the basic statistics and data terminology that are required for working in data mining. The final portion of the chapter is dedicated to a full working example, which combined the types of techniques that will be introduced later on in this book. You will also have a better understanding of the thought processes behind analysis and the common steps taken to address a problem statement that you may encounter in the field.

Chapter 3, Collecting, Exploring, and Visualizing Data, covers the basics of loading data from databases, disks, and web sources. It also covers the basic SQL queries, and pandas' access and search functions. The last sections of the chapter introduce the common types of plots using Seaborn.

Chapter 4, Cleaning and Readying Data for Analysis, covers the basics of data cleanup and dimensionality reduction. After reading it, you will understand how to work with missing values, rescale input data, and handle categorical variables. You will also understand the troubles of high-dimensional data, and how to combat this with feature reduction techniques including filter, wrapper, and transformation methods.

Chapter 5, Grouping and Clustering Data, introduces the background and thought processes that goes into designing a clustering algorithm for data mining work. It then introduces common clustering methods in the field and carries out a comparison between all of them with toy datasets. After reading this chapter, you will know the difference between algorithms that cluster based on means separation, density, and connectivity. You will also be able to look at a plot of incoming data and have some intuition on whether clustering will fit your mining project.

Chapter 6, Prediction with Regression and Classification, covers the basics behind using a computer to learn prediction models by introducing the loss function and gradient descent. It then introduces the concepts of overfitting, underfitting, and the penalty approach to regularize your model during fits. It also covers common regression and classification techniques, and the regularized versions of each of these where appropriate. The chapter finishes with a discussion of best practices for model tuning, including cross-validation and grid search.

Chapter 7, Advanced Topics – Building a Data Processing Pipeline and Deploying, This chapter covers a strategy for pipe-lining and deploying using built-in Scikit-learn methods. It also introduces the pickle module for model persistence and storage, as well as discussing Python-specific concerns at deployment time.

To get the most out of this book

You should have basic understanding of the mathematical principles taught in American primary and high schools. The most complex math required is the understanding of the contents of a matrix and the relation implied by the sigma (sum) symbol. You should have some rudimentary knowledge of Python, including lists, dictionaries, and functions. If you feel deficient in any of these prerequisites, a quick internet search to brush up on the concepts prior to reading should get you ready quickly.

This book is meant as a beginner's text, so the most important prerequisite is an open mind and the drive to learn.

Download the example code files

You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

  1. Log in or register at www.packt.com.
  2. Select the SUPPORT tab.
  3. Click on Code Downloads & Errata.
  4. Enter the name of the book in the Search box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

  • WinRAR/7-Zip for Windows
  • Zipeg/iZip/UnRarX for Mac
  • 7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Python-Data-Mining-Quick-Start-Guide. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

Conventions used

There are a number of text conventions used throughout this book:

A block of code is set as follows, with # used for comment lines:

from sklearn.cluster import Method
clus = Method(args*)
# fit to input data
clus.fit(X_input)
# get cluster assignments of X_input
X_assigned = clus.labels_

Any command-line input or output is written as follows:

(base) $ spyder 

Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "Select System info from the Administration panel."

Warnings or important notes appear like this.
Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at customercare@packtpub.com.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at copyright@packt.com with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packt.com.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Python Data Mining Quick Start Guide
Published in: Apr 2019Publisher: PacktISBN-13: 9781789800265
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Nathan Greeneltch

Nathan Greeneltch, PhD is a ML engineer at Intel Corp and resident data mining and analytics expert in the AI consulting group. Hes worked with Python analytics in both the start-up realm and the large-scale manufacturing sector over the course of the last decade. Nathan regularly mentors new hires and engineers fresh to the field of analytics, with impromptu chalk talks and division-wide knowledge-sharing sessions at Intel. In his past life, he was a physical chemist studying surface enhancement of the vibration signals of small molecules; a topic on which he wrote a doctoral thesis while at Northwestern University in Evanston, IL. Nathan hails from the southeastern United States, with family in equal parts from Arkansas and Florida
Read more about Nathan Greeneltch