Add new file

54f7270f · Srishti · 54f7270f
Commit 54f7270f authored Mar 16, 2022 by Srishti
Hide whitespace changes
Inline Side-by-side

Showing with 230 additions and 0 deletions

CART.rmd CART.rmd +230 -0

No files found.
--- a/CART.rmd
+++ b/CART.rmd
+---
+title: "Classification and Regression Tree"
+author: "Srishti Puri"
+date: "4/22/2021"
+output: slidy_presentation
+---
+
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(echo = T)
+```
+
+```{r , echo=FALSE , warning=FALSE, message=FALSE}
+library(rpart)
+library(tidyverse)
+library(readr)
+library(readr)
+library(rpart.plot)
+library(gmodels)
+library(funModeling)
+
+set.seed(123)
+HR_comma_sep <- read_csv("~/Desktop/HR_comma_sep.csv")
+```
+
+# Question: Create a model predicting "left" using a classification tree, and interpret the results. Use the HR_comma_sep.csv data. 
+
+## Requirements: 
+    - Perform any initial data preparations. 
+    - Split the data into training and validation.
+    - Develop an initial model.
+    - Which are the most important variables.
+    - Report on the accuracy of the model.
+    - Interpret two (2) complete paths.
+
+# Results of the Model
+
+- The model accurately predicted 98.9% of the loyal employees. 
+
+- The model accurately predicted 92.4% of the left employees. 
+
+- The model is 97.4% of the times correct.
+
+- The model inaccurately predicted 1.1% of the employees to leave. 
+
+- The model inaccurately predicted 7.6% of the employees as loyal. 
+
+- 2.6% of the times the model is incorrect.
+
+## Characterizing loyalty: 
+
+- 11,428 employees, which is, 76% of the data set are loyal.
+- Three conditions which affect the loyalty are: 
+    - a high level of satisfaction (satisfaction_level >= 47 percent), 
+    - have spent at least 4 years in the organization (time_spend_company < 5 years) and 
+    - are good performers with an evaluation of at least 80 percent (last_evaluation < 81 percent). 
+    
+## Characterizing resigned behavior:
+
+- 3571 employees, which is, 24% of the data set resigned.
+- Three conditions which affect 'resignation' are:
+    - low or moderate satisfaction (satisfaction_level < 47 percent) 
+    - have a work load of 3 or more projects (number_project >= 3 projects) and 
+    - their performance being evaluated at least 58 percent (last_evaluation >= 58 percent). 
+
+### Confusion Matrix
+
+```{r , echo=FALSE , warning=FALSE, message=FALSE}
+set.seed(123)
+hr <- HR_comma_sep %>% 
+  mutate(salary_grade = as.factor(salary_grade) ,
+         department = as.factor(department),
+         resigned = as.factor(resigned),
+         random = runif(14999))
+
+
+train <- hr %>% 
+  filter(random < .7) %>% 
+  select(-random)
+
+val <- hr %>% 
+  filter(random >= .7) %>% 
+  select(-random)
+
+ct1 <- rpart(resigned ~ . , data = train, method = 'class')
+
+var_importance <- data.frame(
+  variable_importance = c (ct1$variable.importance)) 
+
+val$resign_predicted <- predict(ct1, val, type = 'class')
+
+CrossTable(val$resigned , val$resign_predicted)
+```
+
+# Initial Preparations of the Data set
+
+## Exploring Data
+
+There are 8 continuous variables and 2 categorical variables.
+
+- Satisfaction level: Most employees are highly satisfied.
+
+- Last Evaluation: Most employees are good performers with 75% of the data set being evaluated between 56 percent - 87 percent.
+
+- Number of Projects: Most employees do a reasonable number of projects.
+
+- Average Monthly Hours: Most employees spend, fairly, a higher amount of hours at work.
+
+- Time Spent in Company: Fewer employees stay beyond 4 years.
+
+
+```{r , echo=FALSE, message=FALSE, warning=FALSE}
+hr_plot <- HR_comma_sep %>% 
+  select(-Work_accident , - salary_grade , -promotion_last_5years , -department, -resigned)
+```
+
+
+```{r , echo=FALSE, message=FALSE, warning=FALSE}
+plot_num(hr_plot)
+```
+
+
+### Summary of the variable 'Work_accident'
+
+- Work Accident: Most employees have not meet accidents at work.
+
+```{r echo=FALSE , warning=FALSE, message=FALSE}
+freq(HR_comma_sep$Work_accident)
+```
+
+### Summary of the variable 'promotion_last_5years'
+
+- Promotion in Last 5years: Most employees have not received promotions in the last five years.
+
+```{r echo=FALSE , warning=FALSE, message=FALSE}
+freq(HR_comma_sep$promotion_last_5years)
+```
+
+### Summary of the variable 'resigned'
+
+- Resigned: Most employees stay with the organization and do not leave.
+
+```{r , echo=FALSE, message=FALSE, warning=FALSE}
+freq(HR_comma_sep$resigned)
+```
+
+- 0 denotes those who stayed.
+- 1 denotes those who resigned.
+
+### Summary of the variable 'salary_grade'
+
+- Salary : 8.25 percent of the organization are top level with the highest pay, 42.9 percent of the employees are paid a medium salary and 48.7% of the employees are paid low salary.
+
+```{r, echo=FALSE, message=FALSE, warning=FALSE}
+freq(HR_comma_sep$salary_grade)
+```
+
+### Summary of the variable 'sales'
+
+- departments: Represents the number of employees in each department. Department Sales has the highest number of employees at 27% and management the lowest which forms only 4.2 percent.
+
+```{r, echo=FALSE, message=FALSE, warning=FALSE}
+freq(HR_comma_sep$department)
+```
+
+# Splitting the Data into Training and Validation:
+
+- Mutated factors of the categorical variables: salary, sales and left.
+
+- Created a new variable 'random' using the runif() function to generate random deviates of the uniform distribution. 
+
+To split data into training (train) and test set: validation (val)
+    
+    - train (10540 observations with 10 variables) and
+    - val (4459 observations with 10 variables).
+
+# Creating and Interpreting the Classification Tree:
+
+```{r echo=FALSE , warning=FALSE, message=FALSE}
+rpart.plot(ct1)
+
+ct1$cptable
+```
+
+## Interpreting Two Complete Paths:
+
+- At the top when no condition is applied on the training data set (train) the best guess is determined as 0 (NOT left).
+
+- Ergo, of the total observations 76% did not leave and 24% left.
+
+### Path 1 : Will Not Leave (loyal)
+
+- First condition: satisfaction_level >= 47 percent. 
+- Second condition: time_spend_company < 5 years.
+- Third condition: last_evaluation < 81 percent. 
+
+- Hence, those who did NOT leave are highly satisfied, have spent at least 4 years in the organization and are good performers with an evaluation of at least 80 percent. 
+
+### Path 2 : Will Leave (or are likely to resign)
+
+- First condition: satisfaction_level < 47 percent. 
+- Second condition: number_project >= 3 projects.
+- Third condition: last_evaluation >= 58 percent. 
+
+- Hence, those who leave are lowly or moderately satisfied, have a work load of 3 or more projects with their performance being evaluated at least 58 percent. 
+
+
+# Variable importance: 
+- The variables are mentioned in the order of their importance below:
+
+```{r , echo=FALSE , warning=FALSE, message=FALSE}
+print(var_importance)
+```
+
+# Reporting accuracy of the model:
+
+## Using the Validation data set:
+
+- Total number of employees left: 1060
+
+- Total number of employees predicted to leave: 1016
+
+- Total number of loyal employees: 3399
+
+- Total number of employees predicted as loyal: 3443 
+
+```{r, echo=FALSE , warning=FALSE, message=FALSE}
+summary(val)
+```
+
+