Commit 54f7270f by Srishti

Add new file

parents
---
title: "Classification and Regression Tree"
author: "Srishti Puri"
date: "4/22/2021"
output: slidy_presentation
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = T)
```
```{r , echo=FALSE , warning=FALSE, message=FALSE}
library(rpart)
library(tidyverse)
library(readr)
library(readr)
library(rpart.plot)
library(gmodels)
library(funModeling)
set.seed(123)
HR_comma_sep <- read_csv("~/Desktop/HR_comma_sep.csv")
```
# Question: Create a model predicting "left" using a classification tree, and interpret the results. Use the HR_comma_sep.csv data.
## Requirements:
- Perform any initial data preparations.
- Split the data into training and validation.
- Develop an initial model.
- Which are the most important variables.
- Report on the accuracy of the model.
- Interpret two (2) complete paths.
# Results of the Model
- The model accurately predicted 98.9% of the loyal employees.
- The model accurately predicted 92.4% of the left employees.
- The model is 97.4% of the times correct.
- The model inaccurately predicted 1.1% of the employees to leave.
- The model inaccurately predicted 7.6% of the employees as loyal.
- 2.6% of the times the model is incorrect.
## Characterizing loyalty:
- 11,428 employees, which is, 76% of the data set are loyal.
- Three conditions which affect the loyalty are:
- a high level of satisfaction (satisfaction_level >= 47 percent),
- have spent at least 4 years in the organization (time_spend_company < 5 years) and
- are good performers with an evaluation of at least 80 percent (last_evaluation < 81 percent).
## Characterizing resigned behavior:
- 3571 employees, which is, 24% of the data set resigned.
- Three conditions which affect 'resignation' are:
- low or moderate satisfaction (satisfaction_level < 47 percent)
- have a work load of 3 or more projects (number_project >= 3 projects) and
- their performance being evaluated at least 58 percent (last_evaluation >= 58 percent).
### Confusion Matrix
```{r , echo=FALSE , warning=FALSE, message=FALSE}
set.seed(123)
hr <- HR_comma_sep %>%
mutate(salary_grade = as.factor(salary_grade) ,
department = as.factor(department),
resigned = as.factor(resigned),
random = runif(14999))
train <- hr %>%
filter(random < .7) %>%
select(-random)
val <- hr %>%
filter(random >= .7) %>%
select(-random)
ct1 <- rpart(resigned ~ . , data = train, method = 'class')
var_importance <- data.frame(
variable_importance = c (ct1$variable.importance))
val$resign_predicted <- predict(ct1, val, type = 'class')
CrossTable(val$resigned , val$resign_predicted)
```
# Initial Preparations of the Data set
## Exploring Data
There are 8 continuous variables and 2 categorical variables.
- Satisfaction level: Most employees are highly satisfied.
- Last Evaluation: Most employees are good performers with 75% of the data set being evaluated between 56 percent - 87 percent.
- Number of Projects: Most employees do a reasonable number of projects.
- Average Monthly Hours: Most employees spend, fairly, a higher amount of hours at work.
- Time Spent in Company: Fewer employees stay beyond 4 years.
```{r , echo=FALSE, message=FALSE, warning=FALSE}
hr_plot <- HR_comma_sep %>%
select(-Work_accident , - salary_grade , -promotion_last_5years , -department, -resigned)
```
```{r , echo=FALSE, message=FALSE, warning=FALSE}
plot_num(hr_plot)
```
### Summary of the variable 'Work_accident'
- Work Accident: Most employees have not meet accidents at work.
```{r echo=FALSE , warning=FALSE, message=FALSE}
freq(HR_comma_sep$Work_accident)
```
### Summary of the variable 'promotion_last_5years'
- Promotion in Last 5years: Most employees have not received promotions in the last five years.
```{r echo=FALSE , warning=FALSE, message=FALSE}
freq(HR_comma_sep$promotion_last_5years)
```
### Summary of the variable 'resigned'
- Resigned: Most employees stay with the organization and do not leave.
```{r , echo=FALSE, message=FALSE, warning=FALSE}
freq(HR_comma_sep$resigned)
```
- 0 denotes those who stayed.
- 1 denotes those who resigned.
### Summary of the variable 'salary_grade'
- Salary : 8.25 percent of the organization are top level with the highest pay, 42.9 percent of the employees are paid a medium salary and 48.7% of the employees are paid low salary.
```{r, echo=FALSE, message=FALSE, warning=FALSE}
freq(HR_comma_sep$salary_grade)
```
### Summary of the variable 'sales'
- departments: Represents the number of employees in each department. Department Sales has the highest number of employees at 27% and management the lowest which forms only 4.2 percent.
```{r, echo=FALSE, message=FALSE, warning=FALSE}
freq(HR_comma_sep$department)
```
# Splitting the Data into Training and Validation:
- Mutated factors of the categorical variables: salary, sales and left.
- Created a new variable 'random' using the runif() function to generate random deviates of the uniform distribution.
To split data into training (train) and test set: validation (val)
- train (10540 observations with 10 variables) and
- val (4459 observations with 10 variables).
# Creating and Interpreting the Classification Tree:
```{r echo=FALSE , warning=FALSE, message=FALSE}
rpart.plot(ct1)
ct1$cptable
```
## Interpreting Two Complete Paths:
- At the top when no condition is applied on the training data set (train) the best guess is determined as 0 (NOT left).
- Ergo, of the total observations 76% did not leave and 24% left.
### Path 1 : Will Not Leave (loyal)
- First condition: satisfaction_level >= 47 percent.
- Second condition: time_spend_company < 5 years.
- Third condition: last_evaluation < 81 percent.
- Hence, those who did NOT leave are highly satisfied, have spent at least 4 years in the organization and are good performers with an evaluation of at least 80 percent.
### Path 2 : Will Leave (or are likely to resign)
- First condition: satisfaction_level < 47 percent.
- Second condition: number_project >= 3 projects.
- Third condition: last_evaluation >= 58 percent.
- Hence, those who leave are lowly or moderately satisfied, have a work load of 3 or more projects with their performance being evaluated at least 58 percent.
# Variable importance:
- The variables are mentioned in the order of their importance below:
```{r , echo=FALSE , warning=FALSE, message=FALSE}
print(var_importance)
```
# Reporting accuracy of the model:
## Using the Validation data set:
- Total number of employees left: 1060
- Total number of employees predicted to leave: 1016
- Total number of loyal employees: 3399
- Total number of employees predicted as loyal: 3443
```{r, echo=FALSE , warning=FALSE, message=FALSE}
summary(val)
```
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment