You're reading from Python Data Mining Quick Start Guide

Product typeBook

Published inApr 2019

Reading LevelBeginner

PublisherPackt

ISBN-139781789800265

Edition1st Edition

Languages

Python

Concepts

Data Mining

Author (1)

Nathan Greeneltch

Preface

This book introduces data mining with popular free Python libraries. It is written in a conversational style, aiming to be approachable while imparting intuition on the reader. Data mining is a broad field of analytical methods designed to uncover insights from your data that are not obvious or discoverable by conventional analysis techniques. The field of data mining is vast, so the topics in this quick start guide were chosen by their relevance to not only their field of origin, but also the adjacent applications of machine learning and artificial intelligence. After a procedural first half, focused on getting the reader comfortable with data collection, loading, and munging, the book will move to a completely conceptual discussion. The concepts are introduced from first principles intuition and broadly grouped as transformation, clustering, and prediction. Popular methods such as principal component analysis, k-means clustering, support vector machines, and random forest are all covered in the conceptual second half of the book. The book ends with a discussion on pipe-lining and deploying your analytical models.

Who this book is for

This book is targeted at individuals who are new to the field or data mining and analytics with Python. Very little background is assumed in Python programming or math above the high-school level. All of the Python libraries used in the book are freely available at no cost on a variety of platforms, so anyone with access to the internet should be able to learn and practice the concepts introduced.

What this book covers

The first three and a half chapters of the book are focused on the procedural nuts and bolts of a data mining project. This includes creating a data mining Python environment, loading data from a variety of sources, and munging the data for downstream analysis. The remaining content in the book is mostly conceptual, and delivered in a conversational style very close to how I would train a new hire at my company.

Chapter 1, Data Mining and Getting Started with Python Tools, covers the topic of getting started with your software environment. It also covers how to download and install high-speed Python and popular libraries such as pandas, scikit-learn, and seaborn. After reading this chapter and setting up your environment, you should be ready to follow along with the demonstrations throughout the rest of the book.

Chapter 2, Basic Terminology and our End-to-End Example, covers the basic statistics and data terminology that are required for working in data mining. The final portion of the chapter is dedicated to a full working example, which combined the types of techniques that will be introduced later on in this book. You will also have a better understanding of the thought processes behind analysis and the common steps taken to address a problem statement that you may encounter in the field.

Chapter 3, Collecting, Exploring, and Visualizing Data, covers the basics of loading data from databases, disks, and web sources. It also covers the basic SQL queries, and pandas' access and search functions. The last sections of the chapter introduce the common types of plots using Seaborn.

Chapter 4, Cleaning and Readying Data for Analysis, covers the basics of data cleanup and dimensionality reduction. After reading it, you will understand how to work with missing values, rescale input data, and handle categorical variables. You will also understand the troubles of high-dimensional data, and how to combat this with feature reduction techniques including filter, wrapper, and transformation methods.

Chapter 5, Grouping and Clustering Data, introduces the background and thought processes that goes into designing a clustering algorithm for data mining work. It then introduces common clustering methods in the field and carries out a comparison between all of them with toy datasets. After reading this chapter, you will know the difference between algorithms that cluster based on means separation, density, and connectivity. You will also be able to look at a plot of incoming data and have some intuition on whether clustering will fit your mining project.

Chapter 6, Prediction with Regression and Classification, covers the basics behind using a computer to learn prediction models by introducing the loss function and gradient descent. It then introduces the concepts of overfitting, underfitting, and the penalty approach to regularize your model during fits. It also covers common regression and classification techniques, and the regularized versions of each of these where appropriate. The chapter finishes with a discussion of best practices for model tuning, including cross-validation and grid search.

Chapter 7, Advanced Topics – Building a Data Processing Pipeline and Deploying, This chapter covers a strategy for pipe-lining and deploying using built-in Scikit-learn methods. It also introduces the pickle module for model persistence and storage, as well as discussing Python-specific concerns at deployment time.

To get the most out of this book

You should have basic understanding of the mathematical principles taught in American primary and high schools. The most complex math required is the understanding of the contents of a matrix and the relation implied by the sigma (sum) symbol. You should have some rudimentary knowledge of Python, including lists, dictionaries, and functions. If you feel deficient in any of these prerequisites, a quick internet search to brush up on the concepts prior to reading should get you ready quickly.

This book is meant as a beginner's text, so the most important prerequisite is an open mind and the drive to learn.

Download the example code files

You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

Log in or register at www.packt.com.
Select the SUPPORT tab.
Click on Code Downloads & Errata.
Enter the name of the book in the Search box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows
Zipeg/iZip/UnRarX for Mac
7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Python-Data-Mining-Quick-Start-Guide. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this
book. You can download it here: https://www.packtpub.com/sites/default/files/
downloads/9781789800265_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book:

A block of code is set as follows, with # used for comment lines:

from sklearn.cluster import Method
clus = Method(args*)
# fit to input data
clus.fit(X_input)
# get cluster assignments of X_input
X_assigned = clus.labels_

Any command-line input or output is written as follows:

(base) $ spyder

Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "Select System info from the Administration panel."

Warnings or important notes appear like this.

Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at customercare@packtpub.com.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at copyright@packt.com with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packt.com.

The rest of the chapter is locked

You have been reading a chapter from

Python Data Mining Quick Start Guide

Published in: Apr 2019Publisher: PacktISBN-13: 9781789800265

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Nathan Greeneltch

Nathan Greeneltch, PhD is a ML engineer at Intel Corp and resident data mining and analytics expert in the AI consulting group. Hes worked with Python analytics in both the start-up realm and the large-scale manufacturing sector over the course of the last decade. Nathan regularly mentors new hires and engineers fresh to the field of analytics, with impromptu chalk talks and division-wide knowledge-sharing sessions at Intel. In his past life, he was a physical chemist studying surface enhancement of the vibration signals of small molecules; a topic on which he wrote a doctoral thesis while at Northwestern University in Evanston, IL. Nathan hails from the southeastern United States, with family in equal parts from Arkansas and Florida
Read more about Nathan Greeneltch

Other recommended products

Related to this chapter

Applied Unsupervised Learning with R

Starting with the basics, Applied Unsupervised Learning with R explains clustering methods, distribution analysis, data encoders, and all features of R that enable you to understand your data better and get answers to all your business questions.

BookMar 2019320 pages

Feature Engineering Made Easy

Feature engineering is the most important step in creating powerful machine learning systems. This book will take you through the entire feature-engineering journey to make your machine learning much more systematic and effective.

BookJan 2018316 pages

scikit-learn Cookbook

scikit-learn has evolved as a robust library for machine learning applications in python with support for a wide range of supervised and unsupervised learning algorithms. This edition brings to you the various enhancements to its model implementations, API and bug fixes in the latest major release of scikit-learn to support Python. This book covers easy to follow recipes right from mathematical operations to implementing various supervised, unsupervised and deep learning algorithms with scikit-learn. Get practical hands-on knowledge to implement various models and algorithms like Multi-Layer Perceptrons, time-series split, MAE criterion for regression, criteria for gradient boosting, Classifier, Regressor, and much more.

BookNov 2017374 pages

Machine Learning with scikit-learn Quick Start Guide

Scikit-learn is a robust machine learning library for the Python programming language. It provides a set of supervised and unsupervised learning algorithms. This book is the easiest way to learn how to deploy, optimize and evaluate all the important machine learning algorithms that scikit-learn provides.

BookOct 2018172 pages

Data Analysis with IBM SPSS Statistics

SPSS Statistics is a software package used for logical batched and non-batched statistical analysis. Analytical tools such as SPSS can readily provide even a novice user with an overwhelming amount of information and a broad range of options for analyzing patterns in the data. This book will have a comprehensive coverage of IBM’s premier statistics and data analysis tool – IBM SPSS Statistics. It is designed for business professionals who wish to analyze their data. By the end of this book, you will have a firm understanding of the various statistical analysis techniques offered by SPSS Statistics, and be able to master its use for data analysis with ease.

BookSep 2017446 pages

Ensemble Machine Learning Cookbook

This book uses a recipe-based approach to showcase the power of machine learning algorithms to build ensemble models using Python libraries. Through this book, you will be able to pick up the code, understand in depth how it works, execute and implement it efficiently. This will be a desk reference to implement a wide range of tasks and solve the common and uncommon problems in ensemble machine learning domain.

BookJan 2019336 pages

Mastering Machine Learning with scikit-learn

This book examines machine learning models including k-nearest neighbors, logistic regression, naive Bayes, random forests, and support vector machines. You will work through document classification, image recognition, and other example problems.

BookJul 2017254 pages

Training Systems using Python Statistical Modeling

This book will acquaint you with various aspects of statistical analysis in Python. You will work with different types of prediction models, such as decision trees, random forests and neural networks. By the end of this book, you will be confident in using various Python packages to train your own models for effective machine learning.

BookMay 2019290 pages

Python Data Analysis

This book takes a practical approach to Python data analysis, showing you how to use Python libraries such as pandas, NumPy, SciPy, and scikit-learn to analyze a variety of data. You’ll also get up to speed with everything from data manipulation to visualization systematically.

BookFeb 2021478 pages5

Python Machine Learning

This second edition of Python Machine Learning by Sebastian Raschka is for developers and data scientists looking for a practical approach to machine learning and deep learning. In this updated edition, you’ll explore the machine learning process using Python and the latest open source technologies, including scikit-learn and TensorFlow 1.x.

BookSep 2017622 pages

Data Science for Marketing Analytics

This book on marketing analytics with Python will quickly get you up and running using practical data science and machine learning to improve your approach to marketing. You'll learn how to analyze sales, understand customer data, predict outcomes, and present conclusions with clear visualizations.

BookSep 2021636 pages

Python Machine Learning

This third edition is updated with TensorFlow 2 and the latest additions to scikit-learn. Packed with clear explanations, visualizations, and working examples, the book covers essential machine learning techniques in depth, along with two cutting-edge machine learning techniques: reinforcement learning and generative adversarial networks.

BookDec 2019772 pages5

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages