Retaining employees using data analytics

Classification and regression trees will be applied to a simulated HR dataset.

The business case is formed around the question: Can we predict those employees who are likely to leave the organization?

To answer this question, we will:

  • Perform any initial data preparations.
  • Split the data into training and validation.
  • Develop an initial model.
  • Which are the most important variables.
  • Report on the accuracy of the model.
  • Interpret two (2) complete paths.
This is a simulated dataset that uses several measures, such as customer satisfaction ratings, employee evaluation, average number of projects and so forth, to predict which employees are at a risk to leave the company.

Imports

In [1]:
# install and load libraries

# for data import
library(readr)
library(readr)

# for data wrangling
library(tidyverse)
library(tidyr)

# for visualization 
library(gmodels)
library(ggplot2)
library(reshape2)
library(scales)
library(ggthemes)
library(ggthemes)
library(dplyr)

# for building classification and regression trees
library(rpart)
library(rpart.plot)
Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ---------------------------------------------------
filter(): dplyr, stats
lag():    dplyr, stats

Attaching package: ‘reshape2’

The following object is masked from ‘package:tidyr’:

    smiths


Attaching package: ‘scales’

The following object is masked from ‘package:purrr’:

    discard

The following object is masked from ‘package:readr’:

    col_factor

Explore

In [2]:
# load data into dataframe
HR_comma_sep <- read_csv("/home/jovyan/HR Analytics+CART/HR_comma_sep.csv")
Parsed with column specification:
cols(
  satisfaction_level = col_double(),
  last_evaluation = col_double(),
  number_project = col_integer(),
  average_montly_hours = col_integer(),
  time_spend_company = col_integer(),
  Work_accident = col_integer(),
  resigned = col_integer(),
  promotion_last_5years = col_integer(),
  department = col_character(),
  salary_grade = col_character()
)
In [3]:
# first rows of dataset
head(HR_comma_sep, 10)
satisfaction_levellast_evaluationnumber_projectaverage_montly_hourstime_spend_companyWork_accidentresignedpromotion_last_5yearsdepartmentsalary_grade
0.38 0.53 2 157 3 0 1 0 sales low
0.80 0.86 5 262 6 0 1 0 sales medium
0.11 0.88 7 272 4 0 1 0 sales medium
0.72 0.87 5 223 5 0 1 0 sales low
0.37 0.52 2 159 3 0 1 0 sales low
0.41 0.50 2 153 3 0 1 0 sales low
0.10 0.77 6 247 4 0 1 0 sales low
0.92 0.85 5 259 5 0 1 0 sales low
0.89 1.00 5 224 5 0 1 0 sales low
0.42 0.53 2 142 3 0 1 0 sales low
In [4]:
# visualizing continous variables

theme_set(theme_gray(base_size = 10))

d <- melt(HR_comma_sep)
ggplot(d,aes(x = value)) +  
    facet_wrap(~variable,scales = "free_x") + 
    geom_histogram(bins = 10, fill = "lightblue", colour = "darkblue", size = .1) +
  scale_y_continuous( name = "Number of employees" ) +
  scale_x_continuous( name = "Frequency" )
Using department, salary_grade as id variables
  • Satisfaction level: Most employees are highly satisfied.
  • Last Evaluation: Most employees are good performers with 75% of the data set being evaluated between 56 percent - 87 percent.
  • Number of Projects: Most employees do a reasonable number of projects.
  • Average Monthly Hours: Most employees spend, fairly, a higher number of hours at work.
  • Time Spent in Company: Fewer employees stay beyond 4 years.
In [5]:
# second glance at the binary continous variable: Work_accident 

wa <- ggplot(data=HR_comma_sep, aes(x=Work_accident, fill=as.factor(Work_accident))) + 
                                scale_x_continuous(breaks=0:1,
                                labels = c("No accident" , "Yes accident")) +
      geom_bar(width=0.5,
               (aes(y = (..count..)/sum(..count..))))+
      scale_y_continuous(labels = scales::percent) + 
      geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
                   y= (..count..)/sum(..count..)), stat= "count", vjust =.5, hjust= 1.2, size=3, color='white') +
      labs(title = "Frequency of accidents at work",
               y = "Percentage of employees") +
      theme(aspect.ratio = .3) +
      coord_flip()  

wa
  • Most employees did not have accidents at work.
In [6]:
# second glance at the binary continous variable: resigned

re <- ggplot(data=HR_comma_sep, aes(x = resigned , fill=as.factor(resigned))) + 
                                scale_x_continuous(breaks=0:1,
                                labels = c("Stayed" , "Resigned")) +
      geom_bar(width=0.5,
               (aes(y = (..count..)/sum(..count..))))+
      scale_y_continuous(labels = scales::percent) + 
      geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
                   y= (..count..)/sum(..count..)), stat= "count", vjust =.5, hjust= 1.2, size=3, color='white') + 
      labs(title = "Frequency of resignations",
               y = "Percentage of employees") +
      theme(aspect.ratio = .3) +
      coord_flip()  

re
  • Most employees stayed with the organization and did not leave.
In [7]:
# second glance at the binary continous variable: promotion in last 5 years

pro <- ggplot(data=HR_comma_sep, aes(x = promotion_last_5years , fill=as.factor(promotion_last_5years))) + 
                                scale_x_continuous(breaks=0:1,
                                labels = c("No promotion" , "Yes promotion")) +
      geom_bar(width=0.5,
               (aes(y = (..count..)/sum(..count..))))+
      scale_y_continuous(labels = scales::percent) + 
      geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
                   y= (..count..)/sum(..count..)), stat= "count", vjust =.5, hjust= .7, size=2, color='black') +
      labs(title = "Frequency of promotions in last 5 years",
               y = "Percentage of employees") +
      theme(aspect.ratio = .3) +
      coord_flip()  

pro
  • Most people have not recieved a promotion in the last five years.
In [8]:
# visualizing categorical variables

# salary_grade
sg <- ggplot(data=HR_comma_sep, aes(x = salary_grade , fill=as.factor(salary_grade))) + 
      geom_bar(width=0.9,
               (aes(y = (..count..)/sum(..count..))))+
      scale_y_continuous(labels = scales::percent) + 
      geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
                   y= (..count..)/sum(..count..)), stat= "count", vjust =.5, hjust= 1.2, size=3, color='white') +
      labs(title = "Salary grade of employees",
               y = "Percentage of employees") +
      theme(aspect.ratio = .4) +
      coord_flip()

sg
In [9]:
# department
dp <- ggplot(data=HR_comma_sep, aes(x = department , fill=as.factor(department))) + 
      geom_bar(width=0.5,
               (aes(y = (..count..)/sum(..count..))))+
      scale_y_continuous(labels = scales::percent) + 
      geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
                   y= (..count..)/sum(..count..)), stat= "count", vjust = 0.5, hjust= 1.5, size=3, color='white') +
      labs(title = "Departments of the organization",
               y = "Percentage of employees") +
      theme(aspect.ratio = 1.2) +
      coord_flip()

dp
In [10]:
# Creating factors of categorical variables and binary, continous variable resigned

hr <- HR_comma_sep %>% 
  mutate(salary_grade = as.factor(salary_grade) ,
         department = as.factor(department),
         resigned = as.factor(resigned),
         random = runif(14999))
In [11]:
# rename factor levels of variable 'resigned'
levels(hr$resigned) <- c("stayed", "resigned")
In [12]:
print((summary(hr$resigned)))
  stayed resigned 
   11428     3571 
In [13]:
print((summary(hr$salary_grade)))
  high    low medium 
  1237   7316   6446 
In [14]:
print((summary(hr$department)))
 accounting          hr          IT  management   marketing product_mng 
        767         739        1227         630         858         902 
      RandD       sales     support   technical 
        787        4140        2229        2720 

Splitting the data into training and validation sets:

In [15]:
set.seed(123)
train <- hr %>% 
  filter(random < .7) %>% 
  select(-random)

val <- hr %>% 
  filter(random >= .7) %>% 
  select(-random)

Creating Regression Tree

Initial model

In [16]:
ct1 <- rpart(resigned ~ . , data = train, method = 'class')
In [17]:
# plotting the model

rpart.plot(ct1)
In [18]:
# complexity parameter table

ct1$cptable
CPnsplitrel errorxerrorxstd
10.25861386 0 1.0000000 1.0000000 0.017357673
20.18257426 1 0.7413861 0.7413861 0.015541498
30.07207921 3 0.3762376 0.3762376 0.011644422
40.05900990 5 0.2320792 0.2356436 0.009384182
50.03128713 6 0.1730693 0.1758416 0.008167650
60.01544554 7 0.1417822 0.1445545 0.007434338
70.01108911 8 0.1263366 0.1283168 0.007018437
80.01000000 9 0.1152475 0.1207921 0.006815861

Interpreting Two Complete Paths:

  • Path 1: Will Not Leave (Stayed/loyal)
    • First condition: satisfaction_level >= 47 percent.
    • Second condition: time_spend_company < 5 years.
    • Third condition: last_evaluation < 81 percent.
Hence, those who stayed are highly satisfied, have spent at least 4 years in the organization, and are good performers with an evaluation of at least 80 percent.
  • Path 2: Will Leave (resign)

    • First condition: satisfaction_level < 47 percent.
    • Second condition: number_project >= 3 projects.
    • Third condition: last_evaluation >= 58 percent.
Hence, those who leave are lowly or moderately satisfied, have a workload of 3 or more projects with their performance being evaluated at least 58 percent.
In [19]:
print(var_importance <- data.frame(ct1$variable.importance))
                      ct1.variable.importance
satisfaction_level                2203.411960
number_project                    1123.734104
average_montly_hours              1089.977545
last_evaluation                   1045.334696
time_spend_company                 826.356734
Work_accident                       30.114909
promotion_last_5years                8.196427
department                           2.017766
In [20]:
# how good is our initial model(ct1)?

val$resign_predicted <- predict(ct1, val, type = 'class')
In [21]:
print(summary(val))
 satisfaction_level last_evaluation  number_project  average_montly_hours
 Min.   :0.0900     Min.   :0.3600   Min.   :2.000   Min.   : 96.0       
 1st Qu.:0.4400     1st Qu.:0.5600   1st Qu.:3.000   1st Qu.:157.0       
 Median :0.6400     Median :0.7200   Median :4.000   Median :201.0       
 Mean   :0.6143     Mean   :0.7151   Mean   :3.785   Mean   :201.8       
 3rd Qu.:0.8100     3rd Qu.:0.8600   3rd Qu.:5.000   3rd Qu.:246.0       
 Max.   :1.0000     Max.   :1.0000   Max.   :7.000   Max.   :310.0       
                                                                         
 time_spend_company Work_accident        resigned    promotion_last_5years
 Min.   : 2.00      Min.   :0.0000   stayed  :3399   Min.   :0.00000      
 1st Qu.: 3.00      1st Qu.:0.0000   resigned:1046   1st Qu.:0.00000      
 Median : 3.00      Median :0.0000                   Median :0.00000      
 Mean   : 3.49      Mean   :0.1471                   Mean   :0.02092      
 3rd Qu.: 4.00      3rd Qu.:0.0000                   3rd Qu.:0.00000      
 Max.   :10.00      Max.   :1.0000                   Max.   :1.00000      
                                                                          
       department   salary_grade  resign_predicted
 sales      :1245   high  : 345   stayed  :3453   
 technical  : 795   low   :2173   resigned: 992   
 support    : 670   medium:1927                   
 IT         : 354                                 
 marketing  : 258                                 
 product_mng: 238                                 
 (Other)    : 885                                 
In [22]:
# rename factor levels of variable 'resigned_predicted'
levels(val$resign_predicted) <- c("predicted_stay", "predicted_resign") 
print(summary(val$resign_predicted))
  predicted_stay predicted_resign 
            3453              992 
In [23]:
# Crosstable of actual resignations versus predicted resignation values by initial model

CrossTable(val$resigned , val$resign_predicted)
 
   Cell Contents
|-------------------------|
|                       N |
| Chi-square contribution |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  4445 

 
             | val$resign_predicted 
val$resigned |   predicted_stay | predicted_resign |        Row Total | 
-------------|------------------|------------------|------------------|
      stayed |             3353 |               46 |             3399 | 
             |          192.296 |          669.351 |                  | 
             |            0.986 |            0.014 |            0.765 | 
             |            0.971 |            0.046 |                  | 
             |            0.754 |            0.010 |                  | 
-------------|------------------|------------------|------------------|
    resigned |              100 |              946 |             1046 | 
             |          624.869 |         2175.072 |                  | 
             |            0.096 |            0.904 |            0.235 | 
             |            0.029 |            0.954 |                  | 
             |            0.022 |            0.213 |                  | 
-------------|------------------|------------------|------------------|
Column Total |             3453 |              992 |             4445 | 
             |            0.777 |            0.223 |                  | 
-------------|------------------|------------------|------------------|

 
In [24]:
# creating a second model for comparision

ct2 <- rpart(resigned ~ . , 
             data = train , 
             method = 'class' ,
             cp = .052)

rpart.plot(ct2)
In [25]:
### how good is our second model(ct2)?

val$resign_predicted2 <- predict(ct2, val, type = 'class')
print(summary(val))
 satisfaction_level last_evaluation  number_project  average_montly_hours
 Min.   :0.0900     Min.   :0.3600   Min.   :2.000   Min.   : 96.0       
 1st Qu.:0.4400     1st Qu.:0.5600   1st Qu.:3.000   1st Qu.:157.0       
 Median :0.6400     Median :0.7200   Median :4.000   Median :201.0       
 Mean   :0.6143     Mean   :0.7151   Mean   :3.785   Mean   :201.8       
 3rd Qu.:0.8100     3rd Qu.:0.8600   3rd Qu.:5.000   3rd Qu.:246.0       
 Max.   :1.0000     Max.   :1.0000   Max.   :7.000   Max.   :310.0       
                                                                         
 time_spend_company Work_accident        resigned    promotion_last_5years
 Min.   : 2.00      Min.   :0.0000   stayed  :3399   Min.   :0.00000      
 1st Qu.: 3.00      1st Qu.:0.0000   resigned:1046   1st Qu.:0.00000      
 Median : 3.00      Median :0.0000                   Median :0.00000      
 Mean   : 3.49      Mean   :0.1471                   Mean   :0.02092      
 3rd Qu.: 4.00      3rd Qu.:0.0000                   3rd Qu.:0.00000      
 Max.   :10.00      Max.   :1.0000                   Max.   :1.00000      
                                                                          
       department   salary_grade          resign_predicted resign_predicted2
 sales      :1245   high  : 345   predicted_stay  :3453    stayed  :3385    
 technical  : 795   low   :2173   predicted_resign: 992    resigned:1060    
 support    : 670   medium:1927                                             
 IT         : 354                                                           
 marketing  : 258                                                           
 product_mng: 238                                                           
 (Other)    : 885                                                           
In [26]:
# Crosstable of actual resignations versus predicted resignation values by second model

CrossTable(val$resigned , val$resign_predicted2)
 
   Cell Contents
|-------------------------|
|                       N |
| Chi-square contribution |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  4445 

 
             | val$resign_predicted2 
val$resigned |    stayed |  resigned | Row Total | 
-------------|-----------|-----------|-----------|
      stayed |      3289 |       110 |      3399 | 
             |   189.606 |   605.488 |           | 
             |     0.968 |     0.032 |     0.765 | 
             |     0.972 |     0.104 |           | 
             |     0.740 |     0.025 |           | 
-------------|-----------|-----------|-----------|
    resigned |        96 |       950 |      1046 | 
             |   616.130 |  1967.547 |           | 
             |     0.092 |     0.908 |     0.235 | 
             |     0.028 |     0.896 |           | 
             |     0.022 |     0.214 |           | 
-------------|-----------|-----------|-----------|
Column Total |      3385 |      1060 |      4445 | 
             |     0.762 |     0.238 |           | 
-------------|-----------|-----------|-----------|

 
Initial model is better than the second model in terms of accuracy.