Reader small image

You're reading from  Applied Data Science with Python and Jupyter

Product typeBook
Published inOct 2018
Reading LevelBeginner
Publisher
ISBN-139781789958171
Edition1st Edition
Languages
Tools
Concepts
Right arrow
Author (1)
Alex Galea
Alex Galea
author image
Alex Galea

Alex Galea has been professionally practicing data analytics since graduating with a masters degree in physics from the University of Guelph, Canada. He developed a keen interest in Python while researching quantum gases as part of his graduate studies. Alex is currently doing web data analytics, where Python continues to play a key role in his work. He is a frequent blogger about data-centric projects that involve Python and Jupyter Notebooks.
Read more about Alex Galea

Right arrow

Chapter 2: Data Cleaning and Advanced Machine


Activity 2: Preparing to Train a Predictive Model for the Employee-Retention Problem

  1. Scroll to the Activity A section of the lesson-2-workbook.ipynb notebook file.

  2. Check the head of the table by running the following code:

    %%bash
    head ../data/hr-analytics/hr_data.csv

    Judging by the output, convince yourself that it looks to be in standard CSV format. For CSV files, we should be able to simply load the data with pd.read_csv.

  3. Load the data with Pandas by running df = pd.read_csv('../data/hr- analytics/hr_data.csv'). Write it out yourself and use tab completion to help type the file path.

  4. Inspect the columns by printing df.columns and make sure the data has loaded as expected by printing the DataFrame head and tail with df.head() and df.tail():

    Figure 2.45: Output for inspecting head and tail of columns

    We can see that it appears to have loaded correctly. Based on the tail index values, there are nearly 15,000 rows; let's make sure we didn't miss any.

  5. Check the number of rows (including the header) in the CSV file with the following code:

    with open('../data/hr-analytics/hr_data.csv') as f: print(len(f.read().splitlines()))

    Figure 2.46: Output after checking for number of rows

  6. Compare this result to len(df) to make sure you've loaded all the data:

    Figure 2.47: Output after checking for number of sample uploaded

    Now that our client's data has been properly loaded, let's think about how we can use predictive analytics to find insights into why their employees are leaving.

    Let's run through the first steps for creating a predictive analytics plan:

    Look at the available data: You've already done this by looking at the columns, datatypes, and the number of samples.

    Determine the business needs: The client has clearly expressed their needs: reduce the number of employees who leave.

    Assess the data for suitability: Let's try to determine a plan that can help satisfy the client's needs, given the provided data

    Recall, as mentioned earlier, that effective analytics techniques lead to impactful business decisions. With that in mind, if we were able to predict how likely an employee is to quit, the business could selectively target those employees for special treatment. For example, their salary could be raised or their number of projects reduced. Furthermore, the impact of these changes could be estimated using the model!

    To assess the validity of this plan, let's think about our data. Each row represents an employee who either works for the company or has left, as labeled by the column named left. We can therefore train a model to predict this target, given a set of features.

  7. Assess the target variable. Check the distribution and number of missing entries by running the following code:

    df.left.value_counts().plot('barh') print(df.left.isnull().sum()

    Figure 2.48: Distribution of the target variables

    Here's the output of the second code line:

    Figure 2.49: Output to check missing data points

    About three-quarters of the samples are employees who have not left. The group that has left make up the other quarter of the samples. This tells us we are dealing with an imbalanced classification problem, which means we'll have to take special measures to account for each class when calculating accuracies. We also see that none of the target variables are missing (no NaN values).

    Now, we'll assess the features:

  8. Print the datatype of each by executing df.dtypes. Observe how we have a mix of continuous and discrete features:

    Figure 2.50: Printing data types for verification

  9. Display the feature distributions by running the following code:

    for f in df.columns: try:
    fig = plt.figure()
    …
    …
    print('-'*30)

    Note

    For the complete code, refer to the following: https://bit.ly/2D3iKL2.

    This code snippet is a little complicated, but it's very useful for showing an overview of both the continuous and discrete features in our dataset. Essentially, it assumes each feature is continuous and attempts to plot its distribution, and reverts to simply plotting the value counts if the feature turns out to be discrete.

    The result is as follows:

    Figure 2.51: Distribution of all features: satisfaction_level and last_evaluation

    Figure 2.52: Distribution of all remaining features

    Figure 2.53: Distribution for the variable promotion_last_5years

    For many features, we see a wide distribution over the possible values, indicating a good variety in the feature spaces. This is encouraging; features that are strongly grouped around a small range of values may not be very informative for the model. This is the case for promotion_last_5years, where we see that the vast majority of samples are 0.

    The next thing we need to do is remove any NaN values from the dataset.

  10. Check how many NaN values are in each column by running the following code:

    df.isnull().sum() / len(df) * 100

    Figure 2.54: Verification for the number of NaN values

    We can see there are about 2.5% missing for average_montly_hours, 1% missing for time_spend_company, and 98% missing for is_smoker! Let's use a couple of different strategies that you've learned to handle these.

  11. Drop the is_smoker column as there is barely any information in this metric. Do this by running: del df['is_smoker'].

  12. Fill the NaN values in the time_spend_company column. This can be done with the following code:

    fill_value = df.time_spend_company.median()
    df.time_spend_company = df.time_spend_company.fillna(fill_ value)

    The final column to deal with is average_montly_hours. We could do something similar and use the median or rounded mean as the integer fill value. Instead though, let's try to take advantage of its relationship with another variable. This may allow us to fill the missing data more accurately.

  13. Make a boxplot of average_montly_hours segmented by number_project. This can be done by running the following code:

    sns.boxplot(x='number_project', y='average_montly_hours', data=df)

    Figure 2.55: Boxplot for average_monthly_hours and number_project

    We can see how the number of projects is correlated with average_ monthly_hours, a result that is hardly surprising. We'll exploit this relationship by filling in the NaN values of average_montly_hours differently, depending on the number of projects for that sample.

    Specifically, we'll use the mean of each group.

  14. Calculate the mean of each group by running the following code:

    mean_per_project = df.groupby('number_project')\
    .average_montly_hours.mean() mean_per_project = dict(mean_per_project) print(mean_per_project)

    Figure 2.56: Calculation of mean values for average_monthly_hours

    We can then map this onto the number_project column and pass the resulting series object as the argument to fillna.

  15. Fill the NaN values in average_montly_hours by executing the following code:

    fill_values = df.number_project.map(mean_per_project)
    df.average_montly_hours = df.average_montly_hours. fillna(fill_values)
  16. Confirm that df has no more NaN values by running the following assertion test. If it does not raise an error, then you have successfully removed the NaNs from the table:

    assert df.isnull().sum().sum() == 0

    Note

    We pass index=False so that the index is not written to file. In this case, the index is a set of integers spanning from 0 to the DataFrame length, and it therefore tells us nothing important.

  17. Transform the string and Boolean fields into integer representations. In particular, we'll manually convert the target variable left from yes and no to 1 and 0 and build the one-hot encoded features. Do this by running the following code:

    df.left = df.left.map({'no': 0, 'yes': 1}) df = pd.get_dummies(df)
  18. Print df.columns to show the fields:

    Figure 2.57: A screenshot of the different fields in the dataframe

    We can see that department and salary have been split into various binary features.

    The final step to prepare our data for machine learning is scaling the features, but for various reasons (for example, some models do not require scaling), we'll do it as part of the model-training workflow in the next activity.

  19. We have completed the data preprocessing and are ready to move on to training models! Let's save our preprocessed data by running the following code:

    df.to_csv('../data/hr-analytics/hr_data_processed.csv', index=False)
lock icon
The rest of the page is locked
Previous PageNext Page
You have been reading a chapter from
Applied Data Science with Python and Jupyter
Published in: Oct 2018Publisher: ISBN-13: 9781789958171

Author (1)

author image
Alex Galea

Alex Galea has been professionally practicing data analytics since graduating with a masters degree in physics from the University of Guelph, Canada. He developed a keen interest in Python while researching quantum gases as part of his graduate studies. Alex is currently doing web data analytics, where Python continues to play a key role in his work. He is a frequent blogger about data-centric projects that involve Python and Jupyter Notebooks.
Read more about Alex Galea