# 2. Absenteeism at Work

Overview

In this chapter, you will perform standard data analysis techniques, such as estimating conditional probabilities, Bayes' theorem, and Kolmogorov-Smirnov tests, for distribution comparison. You will also implement data transformation techniques, such as the Box-Cox and Yeo-Johnson transformations, and apply these techniques to a given dataset.

# Introduction

In the previous chapter, we looked at some of the main techniques that are used in data analysis. We saw how hypothesis testing can be used when analyzing data, we got a brief introduction to visualizations, and finally, we explored some concepts related to time series analysis. In this chapter, we will elaborate on some of the topics we've already looked at (such as plotting and hypothesis testing) while introducing new ones coming from probability theory and data transformations.

Nowadays, work relationships are becoming more and more trust-oriented, and conservative contracts (in which working time is strictly monitored) are being replaced with more agile ones in which the employee themselves is responsible for accounting working time. This liberty may lead to unregulated absenteeism and may reflect poorly on an employee's candidature, even if absent hours can be accounted for with genuine reasons. This can significantly undermine healthy working relationships...

# Initial Data Analysis

As a rule of thumb, when starting the analysis of a new dataset, it is good practice to check the dimensionality of the data, type of columns, possible missing values, and some generic statistics on the numerical columns. We can also get the first 5 to 10 entries in order to acquire a feeling for the data itself. We'll perform these steps in the following code snippets:

import pandas as pd import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline # import data from the GitHub page of the book data = pd.read_csv('https://raw.githubusercontent.com'\ '/PacktWorkshops/The-Data-Analysis-Workshop'\ '/master/Chapter02/data/'\ ...

# Initial Analysis of the Reason for Absence

Let's start with a simple analysis of the `Reason for absence`

column. We will try to address questions such as, what is the most common reason for absence? Does being a drinker or smoker have some effect on the causes? Does the distance to work have some effect on the reasons? And so on. Starting with these types of questions is often important when performing data analysis, as this is a good way to obtain confidence and understanding of the data.

The first thing we are interested in is the overall distribution of the absence reasons in the data—that is, how many entries we have for a specific reason for absence in our dataset. We can easily address this question by using the `countplot()`

function from the `seaborn`

package:

# get the number of entries for each reason for absence plt.figure(figsize=(10, 5)) ax = sns.countplot(data=preprocessed_data, x="Reason for absence") ax.set_ylabel("Number of entries per...

# Analysis of Social Drinkers and Smokers

Let's begin with an analysis of the impact of being a drinker or smoker on employee absenteeism. As smoking and frequent drinking have a negative impact on health conditions, we would expect that certain diseases are more frequent in smokers and drinkers than others. Note that in the absenteeism dataset, 56% of the registered employees are drinkers, while only 7% are smokers. We can produce a figure, similar to *Figure 2.6* for the social drinkers and smokers with the following code:

# plot reasons for absence against being a social drinker/smoker plt.figure(figsize=(8, 6)) sns.countplot(data=preprocessed_data, x="Reason for absence", \ hue="Social drinker", hue_order=["Yes", "No"]) plt.savefig('figs/absence_reasons_drinkers.png', \ format...

# Body Mass Index

The **Body Mass Index** (**BMI**) is defined as a person's weight in kilograms, divided by the square of their height in meters:

BMI is a universal way to classify people as `underweight`

, `healthy weight`

, `overweight`

, and `obese`

, based on tissue mass (muscle, fat, and bone) and height. The following plot indicates the relationship between weight and height for the various categories:

According to the preceding plot, we can build the four categories (`underweight`

, `healthy weight`

, `overweight`

, and `obese`

) based on the BMI values:

""" define function for computing the BMI category, based on BMI value """ def get_bmi_category(bmi): if bmi < 18.5: category = "underweight" ...

# Age and Education Factors

Age and education may also influence employees' absenteeism. For instance, older employees might need more frequent medical treatment, while employees with higher education degrees, covering positions of higher responsibility, might be less prone to being absent.

First, let's investigate the correlation between age and absence hours. We will create a regression plot, in which we'll plot the `Age`

column on the *x* axis and `Absenteeism time in hours`

on the *y* axis. We'll also include the Pearson's correlation coefficient and its p-value, where the null hypothesis is that the correlation coefficient between the two features is equal to zero:

from scipy.stats import pearsonr # compute Pearson's correlation coefficient and p-value pearson_test = pearsonr(preprocessed_data["Age"], \ preprocessed_data["Absenteeism time in hours...

# Transportation Costs and Distance to Work Factors

Two possible indicators for absenteeism may also be the distance between home and work (the `Distance from Residence to Work`

column) and transportation costs (the `Transportation expense`

column). Employees who have to travel longer, or whose costs for commuting to work are high, might be more prone to absenteeism.

In this section, we will investigate the relationship between these variables and the absence time in hours. Since we do not believe the aforementioned factors might be indicative of disease problems, we will not consider a possible relationship with the `Reason for absence`

column.

First, let's start our analysis by plotting the previously mentioned columns (`Distance from Residence to Work`

and `Transportation expense`

) against the `Absenteeism time in hours`

column:

# plot transportation costs and distance to work against hours plt.figure(figsize=(10, 6)) sns.jointplot(x="Distance from Residence to Work",...

# Temporal Factors

Factors such as day of the week and month may also be indicators for absenteeism. For instance, employees might prefer to have their medical examinations on Friday when the workload is lower, and it is closer to the weekend. In this section, we will analyze the impact of the `Day of the week`

and `Month of absence`

columns, and their impact on the employees' absenteeism.

Let's begin with an analysis of the number of entries for each day of the week and each month:

# count entries per day of the week and month plt.figure(figsize=(12, 5)) ax = sns.countplot(data=preprocessed_data, \ x='Day of the week', \ order=["Monday", "Tuesday", \ &...

# Summary

In this chapter, we analyzed a dataset containing employees' absences and their relationship to additional health and socially related factors. We introduced various data analysis techniques, such as distribution plots, conditional probabilities, Bayes' theorem, data transformation techniques (such as Box-Cox and Yeo-Johnson), and the Kolmogorov-Smirnov test, and applied these to the dataset.

In the next chapter, we will be analyzing the marketing campaign dataset of a Portuguese bank and the impact it had on acquiring new customers.