Packt+ | Advance your knowledge in tech

You're reading from Applied Data Science with Python and Jupyter

Product typeBook

Published inOct 2018

Reading LevelBeginner

Publisher

ISBN-139781789958171

Edition1st Edition

Languages

Python

Tools

Jupyter

Concepts

Data Science

Author (1)

Alex Galea

Chapter 2: Data Cleaning and Advanced Machine

Activity 2: Preparing to Train a Predictive Model for the Employee-Retention Problem

Scroll to the Activity A section of the lesson-2-workbook.ipynb notebook file.
Check the head of the table by running the following code:
```
%%bash
head ../data/hr-analytics/hr_data.csv
```
Judging by the output, convince yourself that it looks to be in standard CSV format. For CSV files, we should be able to simply load the data with pd.read_csv.
Load the data with Pandas by running df = pd.read_csv('../data/hr- analytics/hr_data.csv'). Write it out yourself and use tab completion to help type the file path.
Inspect the columns by printing df.columns and make sure the data has loaded as expected by printing the DataFrame head and tail with df.head() and df.tail():
Figure 2.45: Output for inspecting head and tail of columns
We can see that it appears to have loaded correctly. Based on the tail index values, there are nearly 15,000 rows; let's make sure we didn't miss any.
Check the number of rows (including the header) in the CSV file with the following code:
```
with open('../data/hr-analytics/hr_data.csv') as f: print(len(f.read().splitlines()))
```
Figure 2.46: Output after checking for number of rows
Compare this result to len(df) to make sure you've loaded all the data:
Figure 2.47: Output after checking for number of sample uploaded
Now that our client's data has been properly loaded, let's think about how we can use predictive analytics to find insights into why their employees are leaving.
Let's run through the first steps for creating a predictive analytics plan:
Look at the available data: You've already done this by looking at the columns, datatypes, and the number of samples.
Determine the business needs: The client has clearly expressed their needs: reduce the number of employees who leave.
Assess the data for suitability: Let's try to determine a plan that can help satisfy the client's needs, given the provided data
Recall, as mentioned earlier, that effective analytics techniques lead to impactful business decisions. With that in mind, if we were able to predict how likely an employee is to quit, the business could selectively target those employees for special treatment. For example, their salary could be raised or their number of projects reduced. Furthermore, the impact of these changes could be estimated using the model!
To assess the validity of this plan, let's think about our data. Each row represents an employee who either works for the company or has left, as labeled by the column named left. We can therefore train a model to predict this target, given a set of features.
Assess the target variable. Check the distribution and number of missing entries by running the following code:
```
df.left.value_counts().plot('barh') print(df.left.isnull().sum()
```
Figure 2.48: Distribution of the target variables
Here's the output of the second code line:
Figure 2.49: Output to check missing data points
About three-quarters of the samples are employees who have not left. The group that has left make up the other quarter of the samples. This tells us we are dealing with an imbalanced classification problem, which means we'll have to take special measures to account for each class when calculating accuracies. We also see that none of the target variables are missing (no NaN values).
Now, we'll assess the features:
Print the datatype of each by executing df.dtypes. Observe how we have a mix of continuous and discrete features:
Figure 2.50: Printing data types for verification
Display the feature distributions by running the following code:
```
for f in df.columns: try:
fig = plt.figure()
…
…
print('-'*30)
```
Note
For the complete code, refer to the following: https://bit.ly/2D3iKL2.
This code snippet is a little complicated, but it's very useful for showing an overview of both the continuous and discrete features in our dataset. Essentially, it assumes each feature is continuous and attempts to plot its distribution, and reverts to simply plotting the value counts if the feature turns out to be discrete.
The result is as follows:
Figure 2.51: Distribution of all features: satisfaction_level and last_evaluation
Figure 2.52: Distribution of all remaining features
Figure 2.53: Distribution for the variable promotion_last_5years
For many features, we see a wide distribution over the possible values, indicating a good variety in the feature spaces. This is encouraging; features that are strongly grouped around a small range of values may not be very informative for the model. This is the case for promotion_last_5years, where we see that the vast majority of samples are 0.
The next thing we need to do is remove any NaN values from the dataset.
Check how many NaN values are in each column by running the following code:
```
df.isnull().sum() / len(df) * 100
```
Figure 2.54: Verification for the number of NaN values
We can see there are about 2.5% missing for average_montly_hours, 1% missing for time_spend_company, and 98% missing for is_smoker! Let's use a couple of different strategies that you've learned to handle these.
Drop the is_smoker column as there is barely any information in this metric. Do this by running: del df['is_smoker'].
Fill the NaN values in the time_spend_company column. This can be done with the following code:
```
fill_value = df.time_spend_company.median()
df.time_spend_company = df.time_spend_company.fillna(fill_ value)
```
The final column to deal with is average_montly_hours. We could do something similar and use the median or rounded mean as the integer fill value. Instead though, let's try to take advantage of its relationship with another variable. This may allow us to fill the missing data more accurately.
Make a boxplot of average_montly_hours segmented by number_project. This can be done by running the following code:
```
sns.boxplot(x='number_project', y='average_montly_hours', data=df)
```
Figure 2.55: Boxplot for average_monthly_hours and number_project
We can see how the number of projects is correlated with average_ monthly_hours, a result that is hardly surprising. We'll exploit this relationship by filling in the NaN values of average_montly_hours differently, depending on the number of projects for that sample.
Specifically, we'll use the mean of each group.
Calculate the mean of each group by running the following code:
```
mean_per_project = df.groupby('number_project')\
.average_montly_hours.mean() mean_per_project = dict(mean_per_project) print(mean_per_project)
```
Figure 2.56: Calculation of mean values for average_monthly_hours
We can then map this onto the number_project column and pass the resulting series object as the argument to fillna.

Fill the NaN values in average_montly_hours by executing the following code:

fill_values = df.number_project.map(mean_per_project)
df.average_montly_hours = df.average_montly_hours. fillna(fill_values)

Confirm that df has no more NaN values by running the following assertion test. If it does not raise an error, then you have successfully removed the NaNs from the table:
```
assert df.isnull().sum().sum() == 0
```
Note
We pass index=False so that the index is not written to file. In this case, the index is a set of integers spanning from 0 to the DataFrame length, and it therefore tells us nothing important.
Transform the string and Boolean fields into integer representations. In particular, we'll manually convert the target variable left from yes and no to 1 and 0 and build the one-hot encoded features. Do this by running the following code:
```
df.left = df.left.map({'no': 0, 'yes': 1}) df = pd.get_dummies(df)
```
Print df.columns to show the fields:
Figure 2.57: A screenshot of the different fields in the dataframe
We can see that department and salary have been split into various binary features.
The final step to prepare our data for machine learning is scaling the features, but for various reasons (for example, some models do not require scaling), we'll do it as part of the model-training workflow in the next activity.
We have completed the data preprocessing and are ready to move on to training models! Let's save our preprocessed data by running the following code:
```
df.to_csv('../data/hr-analytics/hr_data_processed.csv', index=False)
```

The rest of the page is locked

You have been reading a chapter from

Applied Data Science with Python and Jupyter

Published in: Oct 2018Publisher: ISBN-13: 9781789958171

Author (1)

Alex Galea

Alex Galea has been professionally practicing data analytics since graduating with a masters degree in physics from the University of Guelph, Canada. He developed a keen interest in Python while researching quantum gases as part of his graduate studies. Alex is currently doing web data analytics, where Python continues to play a key role in his work. He is a frequent blogger about data-centric projects that involve Python and Jupyter Notebooks.
Read more about Alex Galea

Other recommended products

Related to this chapter

The Applied Data Science Workshop

The Applied Data Science Workshop explores the key elements and interesting applications of data science techniques with the help of practical examples and interactive exercises. Following a hands-on approach, it allows you the freedom of analyzing data in the Jupyter Notebook effectively using many diverse open-source Python libraries.??

BookJul 2020352 pages

Become a Python Data Analyst

Become a Python Data Analyst book introduces you to the mainstream libraries of Python’s Data Science stack. With proven examples and real-world datasets, this book teaches how to effectively perform data manipulation, visualize and analyze data patterns and brings you to the ladder of advanced topics like Predictive Analytics.

BookAug 2018178 pages

Mastering Predictive Analytics with scikit-learn and TensorFlow

In this book, you will find a range of methods to improve the performance of almost any predictive model, from ensemble methods to dimensionality reduction and cross-validation. You will learn the tools to produce advanced predictive models. In addition, you will dive into the exiting field of Deep Learning using TensorFlow.

BookSep 2018154 pages

The Data Science Workshop

Cut through the noise and get real results with a step-by-step approach to data science

BookJan 2020818 pages

Data Science for Marketing Analytics

Data Science for Marketing Analytics opens doors to looking at data with a different approach and new tools. Drawing on machine learning and data science concepts, this book broadens the range of tools that you can use to transform the market analysis process.

BookMar 2019420 pages

scikit-learn Cookbook

scikit-learn has evolved as a robust library for machine learning applications in python with support for a wide range of supervised and unsupervised learning algorithms. This edition brings to you the various enhancements to its model implementations, API and bug fixes in the latest major release of scikit-learn to support Python. This book covers easy to follow recipes right from mathematical operations to implementing various supervised, unsupervised and deep learning algorithms with scikit-learn. Get practical hands-on knowledge to implement various models and algorithms like Multi-Layer Perceptrons, time-series split, MAE criterion for regression, criteria for gradient boosting, Classifier, Regressor, and much more.

BookNov 2017374 pages

The Data Science Workshop

The Data Science Workshop equips you with the basic skills you need to start working on a variety of data science projects. You’ll work through the essential building blocks of a data science project gradually through the book, and then put all the pieces together to consolidate your knowledge and apply your learnings in the real world.

BookAug 2020824 pages5

The Supervised Learning Workshop

Taking an engaging and practical approach, The Supervised Learning Workshop teaches you how to predict the output of new data, based on the relationship and behavior of?existing datasets. You’ll learn at your own pace and use Python libraries and Jupyter to build intelligent predictive models.?

BookFeb 2020532 pages

Applied Supervised Learning with Python

Applied Supervised Learning with Python provides you a rich understanding of machine learning, one of the most pursued topics in information science, and Python, one of the most popular scripting languages. Through this book, you'll learn Jupyter Notebooks, the technology used in academic and commercial circles with in-line code running support.

BookApr 2019404 pages

Data Science for Marketing Analytics

This book on marketing analytics with Python will quickly get you up and running using practical data science and machine learning to improve your approach to marketing. You'll learn how to analyze sales, understand customer data, predict outcomes, and present conclusions with clear visualizations.

BookSep 2021636 pages

Hands-On Automated Machine Learning

This book helps machine learning professionals in developing AutoML systems that can be utilized to build ML solutions. This book covers the necessary foundations and shows the most practical ways possible to get to speed with regards to creating AutoML modules.

BookApr 2018282 pages

Data Science Projects with Python

Data Science Projects with Python will help you build a toolkit for solving data science problems with Python. You will learn how to implement machine learning techniques for deriving insights from data. These skills will help you develop the kind of state-of-the-art predictive models that are used to deliver value to businesses across industries.

BookApr 2019374 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

You're reading from Applied Data Science with Python and Jupyter

Chapter 2: Data Cleaning and Advanced Machine

Activity 2: Preparing to Train a Predictive Model for the Employee-Retention Problem

Note

Note

Unlock this book and the full library FREE for 7 days

Author (1)

The Applied Data Science Workshop

Become a Python Data Analyst

Mastering Predictive Analytics with scikit-learn and TensorFlow

The Data Science Workshop

Cut through the noise and get real results with a step-by-step approach to data science

Data Science for Marketing Analytics

Data Science for Marketing Analytics opens doors to looking at data with a different approach and new tools. Drawing on machine learning and data science concepts, this book broadens the range of tools that you can use to transform the market analysis process.

scikit-learn Cookbook

The Data Science Workshop

The Supervised Learning Workshop

Taking an engaging and practical approach, The Supervised Learning Workshop teaches you how to predict the output of new data, based on the relationship and behavior of?existing datasets. You’ll learn at your own pace and use Python libraries and Jupyter to build intelligent predictive models.?

Applied Supervised Learning with Python

Data Science for Marketing Analytics

This book on marketing analytics with Python will quickly get you up and running using practical data science and machine learning to improve your approach to marketing. You'll learn how to analyze sales, understand customer data, predict outcomes, and present conclusions with clear visualizations.

Hands-On Automated Machine Learning

This book helps machine learning professionals in developing AutoML systems that can be utilized to build ML solutions. This book covers the necessary foundations and shows the most practical ways possible to get to speed with regards to creating AutoML modules.

Data Science Projects with Python

Et al.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Mastering Tableau 2023

Building AI Applications with ChatGPT APIs

Building AI Applications with ChatGPT APIs

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

Modern Data Architecture on AWS

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

TinyML Cookbook