DDS Project 2
Project investigating employee dataset for DDS Analytics
<!DOCTYPE html>
DDSAnalytics Case Study
Aaron Abromowitz
2023-08-04
Introduction
This case study was to analyze data that was provided by DDSAnalytics. They are a talent management firm for fortune 500 companies. The dataset was information about individual employees, including salary, satisfaction ratings, gender, etc. The task was to analyze this data, provide any useful insights, derive a model to predict Attrition (categorical) and a model to predict Monthly Salary (numeric).
Github Repository
This link goes to the Github repository where my predictions (plus other files) are being stored.
Exploratory Data Analysis (EDA)
RShiny App
This link goes to an RShiny page which will let you play around with the data in real time.
Load Packages
Several packages will be needed for later analysis
suppressMessages({ # Don't need to see the messages from loading the packages
library(tidyverse)
library(RCurl)
library(e1071)
library(caret)
library(mltools)
library(data.table)
library(EMT)
library(xgboost)
})
Load Dataset
Load the dataset from AWS
data <- read.table(textConnection(getURL(
"https://s3.us-east-2.amazonaws.com/ddsproject1/CaseStudy2-data.csv"
)), sep=",", header=TRUE)
Missing values
Are there any missing values?
any(is.na(data))
## [1] FALSE
It looks like there are no NAs. So no need to impute any data.
Summaries
Does looking at the summary of each column show anything useful about the data?
summary(data)
## ID Age Attrition BusinessTravel
## Min. : 1.0 Min. :18.00 Length:870 Length:870
## 1st Qu.:218.2 1st Qu.:30.00 Class :character Class :character
## Median :435.5 Median :35.00 Mode :character Mode :character
## Mean :435.5 Mean :36.83
## 3rd Qu.:652.8 3rd Qu.:43.00
## Max. :870.0 Max. :60.00
## DailyRate Department DistanceFromHome Education
## Min. : 103.0 Length:870 Min. : 1.000 Min. :1.000
## 1st Qu.: 472.5 Class :character 1st Qu.: 2.000 1st Qu.:2.000
## Median : 817.5 Mode :character Median : 7.000 Median :3.000
## Mean : 815.2 Mean : 9.339 Mean :2.901
## 3rd Qu.:1165.8 3rd Qu.:14.000 3rd Qu.:4.000
## Max. :1499.0 Max. :29.000 Max. :5.000
## EducationField EmployeeCount EmployeeNumber EnvironmentSatisfaction
## Length:870 Min. :1 Min. : 1.0 Min. :1.000
## Class :character 1st Qu.:1 1st Qu.: 477.2 1st Qu.:2.000
## Mode :character Median :1 Median :1039.0 Median :3.000
## Mean :1 Mean :1029.8 Mean :2.701
## 3rd Qu.:1 3rd Qu.:1561.5 3rd Qu.:4.000
## Max. :1 Max. :2064.0 Max. :4.000
## Gender HourlyRate JobInvolvement JobLevel
## Length:870 Min. : 30.00 Min. :1.000 Min. :1.000
## Class :character 1st Qu.: 48.00 1st Qu.:2.000 1st Qu.:1.000
## Mode :character Median : 66.00 Median :3.000 Median :2.000
## Mean : 65.61 Mean :2.723 Mean :2.039
## 3rd Qu.: 83.00 3rd Qu.:3.000 3rd Qu.:3.000
## Max. :100.00 Max. :4.000 Max. :5.000
## JobRole JobSatisfaction MaritalStatus MonthlyIncome
## Length:870 Min. :1.000 Length:870 Min. : 1081
## Class :character 1st Qu.:2.000 Class :character 1st Qu.: 2840
## Mode :character Median :3.000 Mode :character Median : 4946
## Mean :2.709 Mean : 6390
## 3rd Qu.:4.000 3rd Qu.: 8182
## Max. :4.000 Max. :19999
## MonthlyRate NumCompaniesWorked Over18 OverTime
## Min. : 2094 Min. :0.000 Length:870 Length:870
## 1st Qu.: 8092 1st Qu.:1.000 Class :character Class :character
## Median :14074 Median :2.000 Mode :character Mode :character
## Mean :14326 Mean :2.728
## 3rd Qu.:20456 3rd Qu.:4.000
## Max. :26997 Max. :9.000
## PercentSalaryHike PerformanceRating RelationshipSatisfaction StandardHours
## Min. :11.0 Min. :3.000 Min. :1.000 Min. :80
## 1st Qu.:12.0 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:80
## Median :14.0 Median :3.000 Median :3.000 Median :80
## Mean :15.2 Mean :3.152 Mean :2.707 Mean :80
## 3rd Qu.:18.0 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:80
## Max. :25.0 Max. :4.000 Max. :4.000 Max. :80
## StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance
## Min. :0.0000 Min. : 0.00 Min. :0.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.: 6.00 1st Qu.:2.000 1st Qu.:2.000
## Median :1.0000 Median :10.00 Median :3.000 Median :3.000
## Mean :0.7839 Mean :11.05 Mean :2.832 Mean :2.782
## 3rd Qu.:1.0000 3rd Qu.:15.00 3rd Qu.:3.000 3rd Qu.:3.000
## Max. :3.0000 Max. :40.00 Max. :6.000 Max. :4.000
## YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion
## Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 3.000 1st Qu.: 2.000 1st Qu.: 0.000
## Median : 5.000 Median : 3.000 Median : 1.000
## Mean : 6.962 Mean : 4.205 Mean : 2.169
## 3rd Qu.:10.000 3rd Qu.: 7.000 3rd Qu.: 3.000
## Max. :40.000 Max. :18.000 Max. :15.000
## YearsWithCurrManager
## Min. : 0.00
## 1st Qu.: 2.00
## Median : 3.00
## Mean : 4.14
## 3rd Qu.: 7.00
## Max. :17.00
ID: Identification variable EmployeeCount: All 1’s EmployeeNumber: Identification Variable StandardHours: All 80’s YearsAtCompany, YearsSinceLastPromotion: Have max values that are significantly larger than the 3rd quartile value
Relabeling Numeric Categories
- Remove ID, EmployeeCount, EmployeeNumber, StandardHours
- Replace Categorical Variables that are numeric with word equivalents (Source: https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset)
- Set Categorical Variables to be factors
# 1. Remove ID, EmployeeCount, EmployeeNumber, StandardHours
data$ID <- c()
data$EmployeeCount <- c()
data$EmployeeNumber <- c()
data$StandardHours <- c()
# 2. Replace Categorical Variables that are numeric with word equivalents
data$Education[data$Education == 1] <- 'Below College'
data$Education[data$Education == 2] <- 'College'
data$Education[data$Education == 3] <- 'Bachelor'
data$Education[data$Education == 4] <- 'Master'
data$Education[data$Education == 5] <- 'Doctor'
data$EnvironmentSatisfaction[data$EnvironmentSatisfaction == 1] <- 'Low'
data$EnvironmentSatisfaction[data$EnvironmentSatisfaction == 2] <- 'Medium'
data$EnvironmentSatisfaction[data$EnvironmentSatisfaction == 3] <- 'High'
data$EnvironmentSatisfaction[data$EnvironmentSatisfaction == 4] <- 'Very High'
data$JobInvolvement[data$JobInvolvement == 1] <- 'Low'
data$JobInvolvement[data$JobInvolvement == 2] <- 'Medium'
data$JobInvolvement[data$JobInvolvement == 3] <- 'High'
data$JobInvolvement[data$JobInvolvement == 4] <- 'Very High'
data$JobSatisfaction[data$JobSatisfaction == 1] <- 'Low'
data$JobSatisfaction[data$JobSatisfaction == 2] <- 'Medium'
data$JobSatisfaction[data$JobSatisfaction == 3] <- 'High'
data$JobSatisfaction[data$JobSatisfaction == 4] <- 'Very High'
data$PerformanceRating[data$PerformanceRating == 1] <- 'Low'
data$PerformanceRating[data$PerformanceRating == 2] <- 'Good'
data$PerformanceRating[data$PerformanceRating == 3] <- 'Excellent'
data$PerformanceRating[data$PerformanceRating == 4] <- 'Outstanding'
data$RelationshipSatisfaction[data$RelationshipSatisfaction == 1] <- 'Low'
data$RelationshipSatisfaction[data$RelationshipSatisfaction == 2] <- 'Medium'
data$RelationshipSatisfaction[data$RelationshipSatisfaction == 3] <- 'High'
data$RelationshipSatisfaction[data$RelationshipSatisfaction == 4] <- 'Very High'
data$WorkLifeBalance[data$WorkLifeBalance == 1] <- 'Bad'
data$WorkLifeBalance[data$WorkLifeBalance == 2] <- 'Good'
data$WorkLifeBalance[data$WorkLifeBalance == 3] <- 'Better'
data$WorkLifeBalance[data$WorkLifeBalance == 4] <- 'Best'
# 3. Set Categorical Variables to be factors
data$Attrition <- factor(data$Attrition,levels=c("Yes","No")) # Making "Yes" the true value for bayes
data$BusinessTravel <- as.factor(data$BusinessTravel)
data$Department <- as.factor(data$Department)
data$EducationField <- as.factor(data$EducationField)
data$Gender <- as.factor(data$Gender)
data$JobRole <- as.factor(data$JobRole)
data$MaritalStatus <- as.factor(data$MaritalStatus)
data$OverTime <- as.factor(data$OverTime)
data$Education <- factor(data$Education,levels=c("Below College","College","Bachelor","Master","Doctor"))
data$EnvironmentSatisfaction <- factor(data$EnvironmentSatisfaction,levels=c("Low","Medium","High","Very High"))
data$JobInvolvement <- factor(data$JobInvolvement,levels=c("Low","Medium","High","Very High"))
data$JobLevel <- as.factor(data$JobLevel)
data$JobSatisfaction <- factor(data$JobSatisfaction,levels=c("Low","Medium","High","Very High"))
data$PerformanceRating <- factor(data$PerformanceRating,levels=c("Low","Good","Excellent","Outstanding"))
data$RelationshipSatisfaction <- factor(data$RelationshipSatisfaction,levels=c("Low","Medium","High","Very High"))
data$StockOptionLevel <- as.factor(data$StockOptionLevel)
data$WorkLifeBalance <- factor(data$WorkLifeBalance,levels=c("Bad","Good","Better","Best"))
# Look at the summary again
summary(data)
## Age Attrition BusinessTravel DailyRate
## Min. :18.00 Yes:140 Non-Travel : 94 Min. : 103.0
## 1st Qu.:30.00 No :730 Travel_Frequently:158 1st Qu.: 472.5
## Median :35.00 Travel_Rarely :618 Median : 817.5
## Mean :36.83 Mean : 815.2
## 3rd Qu.:43.00 3rd Qu.:1165.8
## Max. :60.00 Max. :1499.0
##
## Department DistanceFromHome Education
## Human Resources : 35 Min. : 1.000 Below College: 98
## Research & Development:562 1st Qu.: 2.000 College :182
## Sales :273 Median : 7.000 Bachelor :324
## Mean : 9.339 Master :240
## 3rd Qu.:14.000 Doctor : 26
## Max. :29.000
##
## EducationField EnvironmentSatisfaction Gender HourlyRate
## Human Resources : 15 Low :172 Female:354 Min. : 30.00
## Life Sciences :358 Medium :178 Male :516 1st Qu.: 48.00
## Marketing :100 High :258 Median : 66.00
## Medical :270 Very High:262 Mean : 65.61
## Other : 52 3rd Qu.: 83.00
## Technical Degree: 75 Max. :100.00
##
## JobInvolvement JobLevel JobRole JobSatisfaction
## Low : 47 1:329 Sales Executive :200 Low :179
## Medium :228 2:312 Research Scientist :172 Medium :166
## High :514 3:132 Laboratory Technician :153 High :254
## Very High: 81 4: 60 Manufacturing Director : 87 Very High:271
## 5: 37 Healthcare Representative: 76
## Sales Representative : 53
## (Other) :129
## MaritalStatus MonthlyIncome MonthlyRate NumCompaniesWorked
## Divorced:191 Min. : 1081 Min. : 2094 Min. :0.000
## Married :410 1st Qu.: 2840 1st Qu.: 8092 1st Qu.:1.000
## Single :269 Median : 4946 Median :14074 Median :2.000
## Mean : 6390 Mean :14326 Mean :2.728
## 3rd Qu.: 8182 3rd Qu.:20456 3rd Qu.:4.000
## Max. :19999 Max. :26997 Max. :9.000
##
## Over18 OverTime PercentSalaryHike PerformanceRating
## Length:870 No :618 Min. :11.0 Low : 0
## Class :character Yes:252 1st Qu.:12.0 Good : 0
## Mode :character Median :14.0 Excellent :738
## Mean :15.2 Outstanding:132
## 3rd Qu.:18.0
## Max. :25.0
##
## RelationshipSatisfaction StockOptionLevel TotalWorkingYears
## Low :174 0:379 Min. : 0.00
## Medium :171 1:355 1st Qu.: 6.00
## High :261 2: 81 Median :10.00
## Very High:264 3: 55 Mean :11.05
## 3rd Qu.:15.00
## Max. :40.00
##
## TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole
## Min. :0.000 Bad : 48 Min. : 0.000 Min. : 0.000
## 1st Qu.:2.000 Good :192 1st Qu.: 3.000 1st Qu.: 2.000
## Median :3.000 Better:532 Median : 5.000 Median : 3.000
## Mean :2.832 Best : 98 Mean : 6.962 Mean : 4.205
## 3rd Qu.:3.000 3rd Qu.:10.000 3rd Qu.: 7.000
## Max. :6.000 Max. :40.000 Max. :18.000
##
## YearsSinceLastPromotion YearsWithCurrManager
## Min. : 0.000 Min. : 0.00
## 1st Qu.: 0.000 1st Qu.: 2.00
## Median : 1.000 Median : 3.00
## Mean : 2.169 Mean : 4.14
## 3rd Qu.: 3.000 3rd Qu.: 7.00
## Max. :15.000 Max. :17.00
##
This looks better. It also looks like we can remove the Over18 field as well.
data$Over18 <- c()
Histograms for YearsSinceLastPromotion and YearsAtCompany
YearsSinceLastPromotion and YearsAtCompany both seemed to have a right tail. Check to make sure they don’t need to be handled separately.
# Try creating a histogram for YearsSinceLastPromotion
data %>% ggplot(aes(x=YearsSinceLastPromotion)) +
geom_histogram(fill = "#13294b", binwidth =1) + ylab('Count') +
xlab('Years Since Last Promotion') + ggtitle('Histogram of Years Since Last Promotion') +
theme(plot.background = element_rect("#dbdfdf"),
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1, color = "#13294b",size = 10),
axis.text.y = element_text(color = "#13294b",size = 14),
axis.title.x = element_text(color = "#13294b",size = 16),
axis.title.y =element_text(color = "#13294b",size = 16),
plot.title = element_text(color = "#13294b",size = 18))
# Try creating a histogram for YearsAtCompany
data %>% ggplot(aes(x=YearsAtCompany)) +
geom_histogram(fill = "#13294b", binwidth =1) + ylab('Count') +
xlab('Years at Company') + ggtitle('Histogram of Years at Company') +
theme(plot.background = element_rect("#dbdfdf"),
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1, color = "#13294b",size = 10),
axis.text.y = element_text(color = "#13294b",size = 14),
axis.title.x = element_text(color = "#13294b",size = 16),
axis.title.y =element_text(color = "#13294b",size = 16),
plot.title = element_text(color = "#13294b",size = 18))
For YearsSinceLastPromotion, it does have a right tail, but I wouldn’t call the maximum value an outlier. For YearsAtCompany, the 40 at the end is by itself, and there are a few values around 30 that are capped. It could be worthwhile to create a new variable that caps out at 30, to see if that performs any differently.
data$YearsAtCompany_Capped <- pmin(data$YearsAtCompany,30)
# Plot it
data %>% ggplot(aes(x=YearsAtCompany_Capped)) +
geom_histogram(fill = "#13294b", binwidth =1) + ylab('Count') +
xlab('Years at Company') + ggtitle('Histogram of Years at Company - Capped') +
theme(plot.background = element_rect("#dbdfdf"),
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1, color = "#13294b",size = 10),
axis.text.y = element_text(color = "#13294b",size = 14),
axis.title.x = element_text(color = "#13294b",size = 16),
axis.title.y =element_text(color = "#13294b",size = 16),
plot.title = element_text(color = "#13294b",size = 18))
Plot the interestsing relationships
We should see if these relationships with relatively high correlations are shown visually
# 1. Martial Status and Stock Option Level
data %>% ggplot(aes(x=StockOptionLevel, fill=MaritalStatus)) +
geom_bar() + xlab('Stock Option Level') + ylab('Count') +
ggtitle('Stock Option Level by Marital Status') +
scale_fill_manual("Marital Status",
values = c("Single" = "#13294b", "Divorced" = "lightblue", "Married" = "blue")) +
theme(plot.background = element_rect("#dbdfdf"),
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1, color = "#13294b",size = 10),
axis.text.y = element_text(color = "#13294b",size = 14),
axis.title.x = element_text(color = "#13294b",size = 16),
axis.title.y =element_text(color = "#13294b",size = 16),
plot.title = element_text(color = "#13294b",size = 18))
All Single People have Stock Level 0. I wonder how Stock Level compares to Attrition?
# How does this compare to Attrition?
data %>% ggplot(aes(x=StockOptionLevel, fill=Attrition)) +
geom_bar() + xlab('Stock Option Level') + ylab('Count') +
ggtitle('Stock Option Level by Attrition') +
scale_fill_manual("Attrition",
values = c("Yes" = "blue", "No" = "#13294b")) +
theme(plot.background = element_rect("#dbdfdf"),
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1, color = "#13294b",size = 10),
axis.text.y = element_text(color = "#13294b",size = 14),
axis.title.x = element_text(color = "#13294b",size = 16),
axis.title.y =element_text(color = "#13294b",size = 16),
plot.title = element_text(color = "#13294b",size = 18))
It does look Stock Level 0 is associated with higher Attrition.
# 2. Attrition and OverTime
data %>% ggplot(aes(x=OverTime, fill=Attrition)) +
geom_bar() + xlab('Overtime') + ylab('Count') +
ggtitle('Overtime by Attrition') +
scale_fill_manual("Attrition",
values = c("Yes" = "blue", "No" = "#13294b")) +
theme(plot.background = element_rect("#dbdfdf"),
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1, color = "#13294b",size = 10),
axis.text.y = element_text(color = "#13294b",size = 14),
axis.title.x = element_text(color = "#13294b",size = 16),
axis.title.y =element_text(color = "#13294b",size = 16),
plot.title = element_text(color = "#13294b",size = 18))
The amount of Attrition is the same for overtime and no overtime, but how does that compare as a percentage?
# Try plotting percentage
data %>% ggplot(aes(x=OverTime, fill=Attrition)) +
geom_bar(position = "fill") + xlab('Overtime') + ylab('Percent') +
ggtitle('Overtime by Attrition Percentage') +
scale_fill_manual("Attrition",
values = c("Yes" = "blue", "No" = "#13294b")) +
scale_y_continuous(labels = scales::percent_format(scale = 100)) +
theme(plot.background = element_rect("#dbdfdf"),
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1, color = "#13294b",size = 10),
axis.text.y = element_text(color = "#13294b",size = 14),
axis.title.x = element_text(color = "#13294b",size = 16),
axis.title.y =element_text(color = "#13294b",size = 16),
plot.title = element_text(color = "#13294b",size = 18))
The relationship is much clearer when you look at the difference by Percentage.
# 3. Age and Education
data %>% ggplot(aes(x=Age)) +
geom_histogram(fill="#13294b") + xlab('Age') + ylab('Count') +
ggtitle('Age vs Education Level') +
facet_wrap(~ Education, ncol=1) +
theme(plot.background = element_rect("#dbdfdf"),
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1, color = "#13294b",size = 10),
axis.text.y = element_text(color = "#13294b",size = 7),
axis.title.x = element_text(color = "#13294b",size = 16),
axis.title.y =element_text(color = "#13294b",size = 16),
plot.title = element_text(color = "#13294b",size = 18))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Age seems to go up slightly with Education level. This makes some sense, but the majority of ages overlap. For example, the mode for Below College is around 25, but there are Masters graduates at that age as well.
Attrition Model
I already found highly correlated variables. First I will look at a few variables that are highly correlated with Attrition. Then try to create a model to predict Attrition using these variables.
Looking at Categorical Variables associated with Attrition
We have correlations between Attrition and other variables already. Let’s plot some of the more highly correlated Variables.
Let’s start off with Attrition vs Job Involvement
# Plot Attrition vs JobInvolvement
data %>% ggplot(aes(x=JobInvolvement,fill=Attrition)) +
geom_bar(position="fill") + xlab('Job Involvement') + ylab('Percent') +
ggtitle('Attrition based on Job Involvement') +
scale_fill_manual("Attrition",
values = c("Yes" = "blue", "No" = "#13294b")) +
scale_y_continuous(labels = scales::percent_format(scale = 100)) +
theme(plot.background = element_rect("#dbdfdf"),
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1, color = "#13294b",size = 10),
axis.text.y = element_text(color = "#13294b",size = 14),
axis.title.x = element_text(color = "#13294b",size = 16),
axis.title.y =element_text(color = "#13294b",size = 16),
plot.title = element_text(color = "#13294b",size = 18))
Then Attrition vs Job Level
# Plot Attrition vs JobLevel
data %>% ggplot(aes(x=JobLevel,fill=Attrition)) +
geom_bar(position="fill") + xlab('Job Level') + ylab('Percent') +
ggtitle('Attrition based on Job Level') +
scale_fill_manual("Attrition",
values = c("Yes" = "blue", "No" = "#13294b")) +
scale_y_continuous(labels = scales::percent_format(scale = 100)) +
theme(plot.background = element_rect("#dbdfdf"),
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1, color = "#13294b",size = 10),
axis.text.y = element_text(color = "#13294b",size = 14),
axis.title.x = element_text(color = "#13294b",size = 16),
axis.title.y =element_text(color = "#13294b",size = 16),
plot.title = element_text(color = "#13294b",size = 18))
Let’s also look at Attrition vs Marital Status, since it has a high correlation.
# Plot Attrition vs MaritalStatus
data %>% ggplot(aes(x=MaritalStatus,fill=Attrition)) +
geom_bar(position="fill") + xlab('Marital lStatus') + ylab('Percent') +
ggtitle('Attrition based on Marital Status') +
scale_fill_manual("Attrition",
values = c("Yes" = "blue", "No" = "#13294b")) +
scale_y_continuous(labels = scales::percent_format(scale = 100)) +
theme(plot.background = element_rect("#dbdfdf"),
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1, color = "#13294b",size = 10),
axis.text.y = element_text(color = "#13294b",size = 14),
axis.title.x = element_text(color = "#13294b",size = 16),
axis.title.y =element_text(color = "#13294b",size = 16),
plot.title = element_text(color = "#13294b",size = 18))
Looking at Nuermical Variables associated with Attrition
# Plot Attrition vs TotalWorkingYears
data %>% ggplot(aes(x=TotalWorkingYears)) +
geom_histogram(fill="#13294b") + xlab('Attrition') + ylab('Count') +
ggtitle('Attrition vs Total Working Years') +
facet_wrap(~ Attrition, ncol=1, scales='free') +
theme(plot.background = element_rect("#dbdfdf"),
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1, color = "#13294b",size = 10),
axis.text.y = element_text(color = "#13294b",size = 7),
axis.title.x = element_text(color = "#13294b",size = 16),
axis.title.y =element_text(color = "#13294b",size = 16),
plot.title = element_text(color = "#13294b",size = 18))
It looks like Attrition = Yes has more values to the lower levels.
# Plot Attrition vs MonthlyIncome
data %>% ggplot(aes(x=MonthlyIncome)) +
geom_histogram(fill="#13294b") + xlab('Attrition') + ylab('Count') +
ggtitle('Attrition vs Monthly Income') +
facet_wrap(~ Attrition, ncol=1, scales='free') +
theme(plot.background = element_rect("#dbdfdf"),
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1, color = "#13294b",size = 10),
axis.text.y = element_text(color = "#13294b",size = 7),
axis.title.x = element_text(color = "#13294b",size = 16),
axis.title.y =element_text(color = "#13294b",size = 16),
plot.title = element_text(color = "#13294b",size = 18))
It looks like there are less high monthly incomes for Attrition = Yes.
Top 3 Factors
Based on the cor.test values, the top 3 factors are OverTime, MaritalStatus, and JobInvolvement. Looking at the plots of those, along with the plots of some of the numeric variables, I feel like the Categorical variables contribute more towards turnover. So those three variables seem to be good candidates for the top three variables that contribute to turnover.
Creating Extra variables
I created extra variables using 3 approaches: 1. Squaring the numeric variables 2. One Hot Encoding the categorical variables 3. Crossing all the variables
# Create var lists
cat_vars <- c('BusinessTravel','Department','Education','EducationField',
'EnvironmentSatisfaction','Gender','JobInvolvement','JobLevel','JobRole',
'JobSatisfaction','MaritalStatus','OverTime','PerformanceRating',
'RelationshipSatisfaction','StockOptionLevel','WorkLifeBalance')
num_vars <- c('Age','DailyRate','DistanceFromHome','HourlyRate','MonthlyIncome',
'MonthlyRate','NumCompaniesWorked','PercentSalaryHike','TotalWorkingYears',
'TrainingTimesLastYear','YearsAtCompany','YearsAtCompany_Capped',
'YearsInCurrentRole','YearsSinceLastPromotion','YearsWithCurrManager')
# Get onehot encoded variables
before_onehot <- data
data <- one_hot(as.data.table(data))
data <- as.data.frame(data)
data$Attrition <- before_onehot$Attrition
data$Attrition_Yes <- c()
data$Attrition_No <- c()
nvar <- length(cat_vars)
for (ii in 1:nvar){
var <- cat_vars[ii]
data[,var] <- before_onehot[,var]
}
names(data) <- gsub(' ','_',names(data))
names(data) <- gsub('&','AND',names(data))
names(data) <- gsub('-','_',names(data))
col_names <- names(data)
onehot_vars <- col_names[grepl('_', names(data))]
onehot_vars <- onehot_vars[!grepl('squared', onehot_vars)]
# Make cross products of variables
data_before_cross <- data
to_cross_vars <- c(num_vars,onehot_vars)
nvars <- length(to_cross_vars)
cross_vars <- c()
for (ii in 1:nvars){
for(jj in 1:nvars){
if (ii<jj){
var1 <- to_cross_vars[ii]
var2 <- to_cross_vars[jj]
new_col <- data[,var1]*data[,var2]
if(sum(new_col)!=0){
new_var <- paste(var1,var2,sep="X")
data[,new_var]<-new_col
cross_vars <- c(cross_vars,new_var)
}
}
}
}
Looking at Created Variables
# Plot Attrition vs JobLevel_1XOverTime_Yes
data %>% ggplot(aes(x=JobLevel_1XOverTime_Yes,fill=Attrition)) +
geom_bar(position="fill") + xlab('Job Level = 1 X Overtime = Yes') + ylab('Percent') +
ggtitle('Attrition by Created Variable JobLevel_1XOverTime_Yes') +
scale_fill_manual("Attrition",
values = c("Yes" = "blue", "No" = "#13294b")) +
scale_y_continuous(labels = scales::percent_format(scale = 100)) +
scale_x_discrete(breaks=c(0,1),limits=c(0,1)) +
theme(plot.background = element_rect("#dbdfdf"),
axis.text.x = element_text(angle = 0, vjust = 0.5, hjust=0.5, color = "#13294b",size = 14),
axis.text.y = element_text(color = "#13294b",size = 14),
axis.title.x = element_text(color = "#13294b",size = 16),
axis.title.y =element_text(color = "#13294b",size = 16),
plot.title = element_text(color = "#13294b",size = 14))
# Plot Attrition vs OverTime_YesXStockOptionLevel_0
data %>% ggplot(aes(x=OverTime_YesXStockOptionLevel_0,fill=Attrition)) +
geom_bar(position="fill") + xlab('Overtime = Yes X Stock Option Level = 0') + ylab('Percent') +
ggtitle('Attrition by Created Variable OverTime_YesXStockOptionLevel_0') +
scale_fill_manual("Attrition",
values = c("Yes" = "blue", "No" = "#13294b")) +
scale_y_continuous(labels = scales::percent_format(scale = 100)) +
scale_x_discrete(breaks=c(0,1),limits=c(0,1)) +
theme(plot.background = element_rect("#dbdfdf"),
axis.text.x = element_text(angle = 0, vjust = 0.5, hjust=0.5, color = "#13294b",size = 14),
axis.text.y = element_text(color = "#13294b",size = 14),
axis.title.x = element_text(color = "#13294b",size = 16),
axis.title.y =element_text(color = "#13294b",size = 16),
plot.title = element_text(color = "#13294b",size = 14))
Both of these crossed variables have 50% Attrition whenever they’re true. They could definitely be very predictive.
Creating and testing out simple models
First, I will try using 3 variables with a Naive Bayes model. Then I will see how well it predicts by measuring its Sensitivity and Specificity. NOTE: I’m not running the code because it takes a while
# Setting up parameters
seed <- 10
prop_test <- .1
num_tests <- 200
formula_str <- 'Attrition ~ JobLevel_1XOverTime_Yes+OverTime_YesXStockOptionLevel_0+MaritalStatus_SingleXOverTime_Yes'
thresh <- .6
metric <- 0
num_passed <- 0
set.seed(seed) # re-setting the seed for some sort of predictability
for (jj in 1:num_tests){
# Creating training and test data
indices = sample(seq(1:nrow(data)),round(prop_test*nrow(data)))
validation = data[indices,]
train = data[-indices,]
# Creating the model using training data
model <- naiveBayes(as.formula(formula_str), data = train)
# Testing the model using test data
pred <- predict(model, validation)
cmBayes <- confusionMatrix(table(pred,validation$Attrition))
# Determining how well the model did
sens <- as.numeric(cmBayes$byClass["Sensitivity"])
if(is.na(sens)) {
sens<-0
}
spec <- as.numeric(cmBayes$byClass["Specificity"])
if(is.na(spec)){
spec<-0
}
if(sens>thresh && spec>thresh) {
num_passed <- num_passed + 1
}
metric <- metric + sens + spec
}
# Outputting metrics
metric / num_tests
num_passed
For 3 of the original variables, the metric was 1.082 and 0 passed. For 3 created variables, the metric was 1.409 and 56 passed.
Then I will try the same thing with KNN models.
seed <- 10
prop_test <- .1
num_tests <- 200
vars <- c('JobLevel_1XOverTime_Yes','OverTime_YesXStockOptionLevel_0','MaritalStatus_SingleXOverTime_Yes',)
thresh <- .6
k_to_use <- 13
metric <- 0
num_passed <- 0
set.seed(seed) # re-setting the seed for some sort of predictability
for (jj in 1:num_tests){
# Creating training and test data
indices = sample(seq(1:nrow(data)),round(prop_test*nrow(data)))
validation = data[indices,]
train = data[-indices,]
# Creating the model using training data
classifications <- knn(train[,vars], # you might need at least 2 variables
validation[,vars],
train$Attrition, prob = TRUE, k = k_to_use)
# Testing the model using test data
CM = confusionMatrix(table(validation$Attrition,classifications))
# Determining how well the model did
sens <- as.numeric(CM$byClass["Sensitivity"])
if(is.na(sens)) {
sens<-0
}
spec <- as.numeric(CM$byClass["Specificity"])
if(is.na(spec)){
spec<-0
}
if(sens>thresh && spec>thresh) {
num_passed <- num_passed + 1
}
metric <- metric + sens + spec
}
# Outputting metrics
metric / num_tests
num_passed
For 3 of the original variables, the metric was 1.091 and 14 passed. For 3 created variables, the metric was 1.572 and 135 passed. At this point, KNN is performing better than Naive Bayes.
Creating a more complex model
I wanted to see what happened if I tried using more variables. This can be dangerous because of overfitting, but it can also do much better. The approach I took was: 1. Sorting variables based on p-score when using the cor.test function in r 2. Going through those variables 2a. Adding each to the model 2b. Seeing if the metric increased 2c. If it did, add the variable to the model 3. Output the (potentially lengthy) formula I made sure that the metric was increased a decent amount, and that the p-values weren’t too high, in order to keep the list of variables to a more reasonable amount.
# This function goes through all of the variables in the p-value order, adds them to a Naive Bayes model, and sees if they increase the predictiveness of the model enough to add the variables permanently.
add_vars_nb <- function(data, all_vars, seed, prop_test, thresh, num_tests, num_no_change, txt_file){
# First get a list of the p-values
var_df <- data.frame(Variables=all_vars)
var_df$p <- 0
var_df$metric <- 0
var_df$num_passed <- 0
var_df$to_use <- 0
var_df$curr_formula <- ""
num_since_last_change <- 0
nvar <- length(all_vars)
for(ii in 1:nvar){
var<-all_vars[ii]
test <- cor.test(unclass(data[,'Attrition']),unclass(data[,var]))
var_df$p[ii] <- test$p.value
}
# Sort the p-values
sort_order <- order(var_df$p,decreasing = FALSE)
# Loop through all of the columns, add the next variable to the list
vars_to_use <- c()
best_passed <- 0
best_metric <- 0
for (ii in 1:nvar){
var<-all_vars[sort_order[ii]]
curr_p <- var_df$p[sort_order[ii]]
if(curr_p > 0.001){ # Don't want the variables to not be super predictive
break
}
if(sum(data[,var]!=0)<length(data[,var])*0.05){
next # Don't want too few numbers in the variable, since that could make it overfit on the variable
}
vars_to_use_loop <- c(vars_to_use,var)
formula_str <- paste("Attrition ~ ", paste(vars_to_use_loop,collapse="+"))
formula_obj <- as.formula(formula_str)
# Running several tests, to make sure that the model creation formula creates good models over different training and test sets
metric <- 0
num_passed <- 0
set.seed(seed) # re-setting the seed for some sort of predictability
for (jj in 1:num_tests){
# Create training and test sets
indices = sample(seq(1:nrow(data)),round(prop_test*nrow(data)))
validation = data[indices,]
train = data[-indices,]
# Create the model
model <- naiveBayes(formula_obj, data = train)
# Test the model
pred <- predict(model, validation)
cmBayes <- confusionMatrix(table(pred,validation$Attrition))
sens <- as.numeric(cmBayes$byClass["Sensitivity"])
if(is.na(sens)) {
sens<-0
}
spec <- as.numeric(cmBayes$byClass["Specificity"])
if(is.na(spec)){
spec<-0
}
if(sens>thresh && spec>thresh) {
num_passed <- num_passed + 1
}
metric <- metric + sens + spec
}
metric <- metric / num_tests
var_df$metric[var_df$Variable==var] <- metric
var_df$num_passed[var_df$Variable==var] <- num_passed
# Check if that variable should be added
if(best_metric+0.01<metric || num_passed<=num_tests*.01){
best_passed = num_passed
best_metric <- metric
vars_to_use <- c(vars_to_use,var)
var_df$to_use[var_df$Variable==var] <- 1
var_df$curr_formula[var_df$Variable==var] <- formula_str
num_since_last_change <- 0
# Save off the formula in a file so you can view / copy and paste it later.
writeLines(paste('num: ',ii,' var: ',var,' metric: ', metric,' num_passed: ', num_passed,
' formula_str: ',formula_str,sep=""),
txt_file)
}
# Print something to the screen to see general progress
print(paste('num: ',ii,' var: ',var,' metric: ', metric,' num_passed: ', num_passed,sep=""))
# If the loop has been going on for too long, just break out of it
num_since_last_change <- num_since_last_change + 1
if(num_since_last_change == num_no_change){
break
}
}
# Output the var_df
var_df
}
Create the same sort of function, but for the KNN model
# Similar to the above function, fbut for KNN
add_vars_knn <- function(data, all_vars, seed, prop_test, thresh, num_tests, num_no_change, start_formula, start_num, txt_file, start_best){
# First get a list of the p-values
var_df <- data.frame(Variables=all_vars)
var_df$num <- 0
var_df$p <- 0
var_df$metric <- 0
var_df$num_passed <- 0
var_df$k <- 0
var_df$to_use <- 0
var_df$curr_formula <- ""
num_since_last_change <- 0
nvar <- length(all_vars)
for(ii in 1:nvar){
var<-all_vars[ii]
test <- cor.test(unclass(data[,'Attrition']),unclass(data[,var]))
var_df$p[ii] <- test$p.value
}
# Sort the p-values
sort_order <- order(var_df$p,decreasing = FALSE)
# Running several tests, to make sure that the model creation formula creates good models over different training and test sets
if (start_formula==''){
vars_to_use <- c()
} else {
vars_to_use <- unlist(strsplit(start_formula, "\\+"))
}
best_passed <- 0
best_metric <- start_best
for (ii in start_num:nvar){
var<-all_vars[sort_order[ii]]
vars_to_use_loop <- c(vars_to_use,var)
curr_p <- var_df$p[sort_order[ii]]
if(curr_p > 0.001){ # Don't want the variables to not be super predictive
break
}
if(sum(data[,var]!=0)<length(data[,var])*0.05){
next # Don't want too few numbers in the variable, since that could make it overfit on the variable
}
if(class(data[,var]) == 'factor'){
next # KNN only really handles numbers
}
# KNN can be use for different k values, so try out a few for each new variable.
k_met <- 0
k_num_passed <- 0
k_best <- 3
set.seed(seed) # re-setting the seed for some sort of predictability
for (kk in c(3,5,7,9,11,13,15)){
print(paste('k is ',kk,sep=""))
metric <- 0
num_passed <- 0
for (jj in 1:num_tests){
indices = sample(seq(1:nrow(data)),round(prop_test*nrow(data)))
validation = data[indices,]
train = data[-indices,]
classifications <- knn(train[,vars_to_use_loop], # you might need at least 2 variables
validation[,vars_to_use_loop],
train$Attrition, prob = TRUE, k = kk)
CM = confusionMatrix(table(validation$Attrition,classifications))
sens <- as.numeric(CM$byClass["Sensitivity"])
if(is.na(sens)) {
sens<-0
}
spec <- as.numeric(CM$byClass["Specificity"])
if(is.na(spec)){
spec<-0
}
if(sens>thresh && spec>thresh) {
num_passed <- num_passed + 1
}
metric <- metric + sens + spec
}
metric <- metric / num_tests
if(k_met<metric){
k_best <- kk
k_met <- metric
k_num_passed <- num_passed
}
}
var_df$metric[var_df$Variable==var] <- k_met
var_df$num_passed[var_df$Variable==var] <- k_num_passed
var_df$num[var_df$Variable==var] <- ii
# Check if that variable should be added
if(best_metric+0.01<k_met || k_num_passed<=num_tests*.01){
best_passed = k_num_passed
best_metric <- k_met
vars_to_use <- c(vars_to_use,var)
var_df$to_use[var_df$Variable==var] <- 1
var_df$curr_formula[var_df$Variable==var] <- paste(vars_to_use,collapse="+")
var_df$k[var_df$Variable==var] <- k_best
num_since_last_change <- 0
file_str <- paste(readLines(txt_file), collapse="\n")
writeLines(paste(file_str,"\n",'num: ',ii,' var: ',var, ' k: ', k_best,
' metric: ', metric, ' num_passed: ', num_passed,
' formula_str: ',paste(vars_to_use,collapse="+"),sep=""),
txt_file)
}
# Print something to the screen to see general progress
print(paste('num: ',ii,' var: ',var,' metric: ', metric,
' num_since_last_change: ',num_since_last_change,
' num_passed: ', num_passed, " k: ", k_best, sep=""))
# If the loop has been going on for too long, just break out of it
num_since_last_change <- num_since_last_change + 1
if(num_since_last_change == num_no_change){
break
}
}
# Output the var_df
var_df
}
After these models were created, I did the same approach as above for testing them. For Naive Bayes, the metric was 1.661 and 195 out of the 200 tests passed. For KNN, the metric was 1.671 and 164 out of the 200 passed. So KNN was slightly better at the overall metric, but Naive Bayes passed more often.
Monthly Income Model
I’ll take a similar approach to the model for attrition. First I will look at a few variables that are highly correlated with Monthly Income. Then I’ll try to create a linear model to predict Monthly Income using these variables.
First re-create the data table with all the additional variables
data <- read.table(textConnection(getURL(
"https://s3.us-east-2.amazonaws.com/ddsproject1/CaseStudy2-data.csv"
)), sep=",", header=TRUE)
# Reset some of the numeric looking variables to words.
data$Education[data$Education == 1] <- 'Below College'
data$Education[data$Education == 2] <- 'College'
data$Education[data$Education == 3] <- 'Bachelor'
data$Education[data$Education == 4] <- 'Master'
data$Education[data$Education == 5] <- 'Doctor'
data$EnvironmentSatisfaction[data$EnvironmentSatisfaction == 1] <- 'Low'
data$EnvironmentSatisfaction[data$EnvironmentSatisfaction == 2] <- 'Medium'
data$EnvironmentSatisfaction[data$EnvironmentSatisfaction == 3] <- 'High'
data$EnvironmentSatisfaction[data$EnvironmentSatisfaction == 4] <- 'Very High'
data$JobInvolvement[data$JobInvolvement == 1] <- 'Low'
data$JobInvolvement[data$JobInvolvement == 2] <- 'Medium'
data$JobInvolvement[data$JobInvolvement == 3] <- 'High'
data$JobInvolvement[data$JobInvolvement == 4] <- 'Very High'
data$JobSatisfaction[data$JobSatisfaction == 1] <- 'Low'
data$JobSatisfaction[data$JobSatisfaction == 2] <- 'Medium'
data$JobSatisfaction[data$JobSatisfaction == 3] <- 'High'
data$JobSatisfaction[data$JobSatisfaction == 4] <- 'Very High'
data$PerformanceRating[data$PerformanceRating == 1] <- 'Low'
data$PerformanceRating[data$PerformanceRating == 2] <- 'Good'
data$PerformanceRating[data$PerformanceRating == 3] <- 'Excellent'
data$PerformanceRating[data$PerformanceRating == 4] <- 'Outstanding'
data$RelationshipSatisfaction[data$RelationshipSatisfaction == 1] <- 'Low'
data$RelationshipSatisfaction[data$RelationshipSatisfaction == 2] <- 'Medium'
data$RelationshipSatisfaction[data$RelationshipSatisfaction == 3] <- 'High'
data$RelationshipSatisfaction[data$RelationshipSatisfaction == 4] <- 'Very High'
data$WorkLifeBalance[data$WorkLifeBalance == 1] <- 'Bad'
data$WorkLifeBalance[data$WorkLifeBalance == 2] <- 'Good'
data$WorkLifeBalance[data$WorkLifeBalance == 3] <- 'Better'
data$WorkLifeBalance[data$WorkLifeBalance == 4] <- 'Best'
# Set variables to factors
data_original <- data
data$Attrition <- factor(data$Attrition,levels=c("Yes","No")) # Making "Yes" the true value for bayes
data$BusinessTravel <- as.factor(data$BusinessTravel)
data$Department <- as.factor(data$Department)
data$EducationField <- as.factor(data$EducationField)
data$Gender <- as.factor(data$Gender)
data$JobRole <- as.factor(data$JobRole)
data$MaritalStatus <- as.factor(data$MaritalStatus)
data$OverTime <- as.factor(data$OverTime)
data$Education <- factor(data$Education,levels=c("Below College","College","Bachelor","Master","Doctor"))
data$EnvironmentSatisfaction <- factor(data$EnvironmentSatisfaction,levels=c("Low","Medium","High","Very High"))
data$JobInvolvement <- factor(data$JobInvolvement,levels=c("Low","Medium","High","Very High"))
data$JobLevel <- as.factor(data$JobLevel)
data$JobSatisfaction <- factor(data$JobSatisfaction,levels=c("Low","Medium","High","Very High"))
data$PerformanceRating <- factor(data$PerformanceRating,levels=c("Low","Good","Excellent","Outstanding"))
data$RelationshipSatisfaction <- factor(data$RelationshipSatisfaction,levels=c("Low","Medium","High","Very High"))
data$StockOptionLevel <- as.factor(data$StockOptionLevel)
data$WorkLifeBalance <- factor(data$WorkLifeBalance,levels=c("Bad","Good","Better","Best"))
# Make a YearsAtCompany_Capped variable and cap it at 30
data$YearsAtCompany_Capped <- pmin(data$YearsAtCompany,30)
# Create lists of the variables
# List all the categorical and numeric variables
cat_vars <- c('Attrition','BusinessTravel','Department','Education','EducationField',
'EnvironmentSatisfaction','Gender','JobInvolvement','JobLevel','JobRole',
'JobSatisfaction','MaritalStatus','OverTime','PerformanceRating',
'RelationshipSatisfaction','StockOptionLevel','WorkLifeBalance')
num_vars <- c('Age','DailyRate','DistanceFromHome','HourlyRate','MonthlyRate',
'NumCompaniesWorked','PercentSalaryHike','TotalWorkingYears',
'TrainingTimesLastYear', 'YearsAtCompany','YearsInCurrentRole',
'YearsSinceLastPromotion','YearsWithCurrManager', 'YearsAtCompany_Capped')
# Create square vars
nvars <- length(num_vars)
squared_vars <- c()
data_no_squared <- data
for (ii in 1:nvars){
var <- num_vars[ii]
var_squared <- paste(var,"_squared",sep="")
data[,var_squared]<-data[,var]*data[,var]
squared_vars <- c(squared_vars,var_squared)
}
# Get onehot encoded variables
before_onehot <- data
data <- one_hot(as.data.table(data))
data <- as.data.frame(data)
nvar <- length(cat_vars)
for (ii in 1:nvar){
var <- cat_vars[ii]
data[,var] <- before_onehot[,var]
}
names(data) <- gsub(' ','_',names(data))
names(data) <- gsub('&','AND',names(data))
names(data) <- gsub('-','_',names(data))
col_names <- names(data)
onehot_vars <- col_names[grepl('_', names(data))]
onehot_vars <- onehot_vars[!grepl('squared', onehot_vars)]
# Make cross products of variables
data_before_cross <- data
to_cross_vars <- c(num_vars,onehot_vars)
nvars <- length(to_cross_vars)
cross_vars <- c()
for (ii in 1:nvars){
for(jj in 1:nvars){
if (ii<jj){
var1 <- to_cross_vars[ii]
var2 <- to_cross_vars[jj]
new_col <- data[,var1]*data[,var2]
if(sum(new_col)!=0){
new_var <- paste(var1,var2,sep="X")
data[,new_var]<-new_col
cross_vars <- c(cross_vars,new_var)
}
}
}
}
# Combine all the variables
all_vars <- c(cat_vars, num_vars, squared_vars, onehot_vars, cross_vars)
Plot Monthly Income vs Categorical Variables
# Plot Monthly Income vs Job Level
data %>% ggplot(aes(x=MonthlyIncome,fill=JobLevel)) +
geom_histogram() + xlab('Monthly Income') + ylab('Count') +
ggtitle('Monthly Income vs Job Level') +
scale_fill_manual("Attrition",
values = c("1" = "#13294b", "2" = "darkblue", "3" = "blue",
"4" = "cornflowerblue", "5" = "lightblue")) +
theme(plot.background = element_rect("#dbdfdf"),
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1, color = "#13294b",size = 10),
axis.text.y = element_text(color = "#13294b",size = 7),
axis.title.x = element_text(color = "#13294b",size = 16),
axis.title.y =element_text(color = "#13294b",size = 16),
plot.title = element_text(color = "#13294b",size = 18))
Monthly Income should be extremely correlated with Job Level, since Job Level is how the bands for salary are decided upon by companies.
# Plot Monthly Income vs Education
data %>% ggplot(aes(x=MonthlyIncome)) +
geom_histogram(fill="#13294b") + xlab('Monthly Income') + ylab('Count') +
ggtitle('Monthly Income vs Education Level') +
facet_wrap(~ Education, ncol=1, scales='free_y') +
theme(plot.background = element_rect("#dbdfdf"),
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1, color = "#13294b",size = 10),
axis.text.y = element_text(color = "#13294b",size = 7),
axis.title.x = element_text(color = "#13294b",size = 16),
axis.title.y =element_text(color = "#13294b",size = 16),
plot.title = element_text(color = "#13294b",size = 18))
These look surprisingly uncorrelated.
Plot Monthly Income vs Numeric Variables
# Plot Monthly Income vs Total Working Years
data %>% ggplot(aes(x=TotalWorkingYears, y=MonthlyIncome)) +
geom_point(color="#13294b") + xlab('Total Working Years') + ylab('Monthly Income') +
ggtitle('Monthly Income vs Total Working Years') +
geom_smooth(method = "lm", formula = y ~ x) +
theme(plot.background = element_rect("#dbdfdf"),
axis.text.x = element_text(angle = 0, vjust = 0.5, hjust=1, color = "#13294b",size = 10),
axis.text.y = element_text(color = "#13294b",size = 10),
axis.title.x = element_text(color = "#13294b",size = 16),
axis.title.y =element_text(color = "#13294b",size = 16),
plot.title = element_text(color = "#13294b",size = 18))
Monthly Income appears to be correlated with Total Working Years, which makes sense. Generally you have a lower salary at the beginning of your career.
# Plot Monthly Income vs Years Since Last Promotion
data %>% ggplot(aes(x=YearsSinceLastPromotion, y=MonthlyIncome)) +
geom_point(color="#13294b") + xlab('Years Since Last Promotion') + ylab('Monthly Income') +
ggtitle('Monthly Income vs Years Since Last Promotion') +
theme(plot.background = element_rect("#dbdfdf"),
axis.text.x = element_text(angle = 0, vjust = 0.5, hjust=1, color = "#13294b",size = 10),
axis.text.y = element_text(color = "#13294b",size = 10),
axis.title.x = element_text(color = "#13294b",size = 16),
axis.title.y =element_text(color = "#13294b",size = 16),
plot.title = element_text(color = "#13294b",size = 18))
There appears to be more of a concentration of points in the bottom left, but they are fairly evenly distributed throughout.
Plot Monthly Income vs Created Variables
# Plot MonthlyIncome vs AgeXTotalWorkingYears
data %>% ggplot(aes(x=AgeXTotalWorkingYears, y=MonthlyIncome)) +
geom_point(color="#13294b") + xlab('Age X Total Working Years') + ylab('Monthly Income') +
ggtitle('Monthly Income by AgeXTotalWorkingYears') +
theme(plot.background = element_rect("#dbdfdf"),
axis.text.x = element_text(angle = 0, vjust = 0.5, hjust=1, color = "#13294b",size = 10),
axis.text.y = element_text(color = "#13294b",size = 10),
axis.title.x = element_text(color = "#13294b",size = 16),
axis.title.y =element_text(color = "#13294b",size = 16),
plot.title = element_text(color = "#13294b",size = 18))
# Plot MonthlyIncome vs TotalWorkingYears_squared
data %>% ggplot(aes(x=TotalWorkingYears_squared, y=MonthlyIncome)) +
geom_point(color="#13294b") + xlab('Total Working Years Squared') + ylab('Monthly Income') +
ggtitle('Monthly Income by Total Working Years Squared') +
theme(plot.background = element_rect("#dbdfdf"),
axis.text.x = element_text(angle = 0, vjust = 0.5, hjust=1, color = "#13294b",size = 10),
axis.text.y = element_text(color = "#13294b",size = 10),
axis.title.x = element_text(color = "#13294b",size = 16),
axis.title.y =element_text(color = "#13294b",size = 16),
plot.title = element_text(color = "#13294b",size = 18))
These look similar to the relationship between Monthly Income and Total Working Years, but with slightly less correlation. For Monthly Income, there isn’t as strong of relationship between the Created Variables.
Creating and testing out simple models
Similar to the Attrition Models, I started with some simple models based on variables with high correlation to Monthly Income. I then tested them using RMSE as a metric.
# Setting up parameters
seed <- 20
prop_test <- .1
num_tests <- 200
formula_str <- 'MonthlyIncome ~ JobLevel+TotalWorkingYears'
thresh <- 30000
metric <- 0
num_passed <- 0
set.seed(seed) # re-setting the seed for some sort of predictability
for (jj in 1:num_tests){
# Creating training and test data
indices = sample(seq(1:nrow(data)),round(prop_test*nrow(data)))
validation = data[indices,]
train = data[-indices,]
# Creating the model using training data
model <- lm(as.formula(formula_str), data = train)
# Testing the model using test data
pred <- predict(model, validation)
# Determining how well the model did
rmse <- sqrt(sum((pred-as.numeric(validation[,"MonthlyIncome"]))^2)/nrow(validation))
if(rmse<thresh){
num_passed <- num_passed+1
}
metric <- metric + rmse
}
# Outputting metrics
metric / num_tests
num_passed
All tests got below the 3,000 threshold for every single test, so num_passed was always 200. Average RMSE was: 1. JobLevel = 1262.0 2. JobLevel + TotalWorkingYears = 1255.4 3. JobLevel + TotalWorkingYears + Attrition = 1255.3 4. JobLevel + TotalWorkingYears + AgeXTotalWorkingYears = 1254.8
Creating a more complex model
I’m using a similar approach to creating the more complext model for Naive Bayes and KNN.
add_vars_reg <- function(data, all_vars, seed, prop_test, thresh, num_tests, num_no_change, num_models, start_formula, start_num, txt_file){
# First get a list of the p-values
var_df <- data.frame(Variables=all_vars)
var_df$num <- 0
var_df$p <- 0
var_df$metric <- 0
var_df$num_passed <- 0
var_df$to_use <- 0
var_df$curr_formula <- ""
nvar <- length(all_vars)
for(ii in 1:nvar){
var<-all_vars[ii]
test <- cor.test(unclass(data[,'MonthlyIncome']),unclass(data[,var]))
var_df$p[ii] <- test$p.value
}
var_df$p[is.na(var_df$p)] <- 1
# Sort the p-values
sort_order <- order(var_df$p,decreasing = FALSE)
# Loop through all of the columns, add the next variable to the list
if (start_formula=='') {
vars_to_use <- c()
} else {
vars_to_use <- strsplit(start_formula, "~")[[1]][2]
}
best_passed <- 0
best_metric <- 1000000000
for (ii in start_num:nvar){
var<-all_vars[sort_order[ii]]
curr_p <- var_df$p[sort_order[ii]]
if(curr_p > 0.001){ # Don't want the variables to not be super predictive
break
}
if(sum(data[,var]!=0)<length(data[,var])*0.05){
next # Don't want too few values, since using that variable could then lead to overfitting
}
vars_to_use_loop <- c(vars_to_use,var)
formula_str <- paste("MonthlyIncome ~ ", paste(vars_to_use_loop,collapse="+"))
formula_obj <- as.formula(formula_str)
# Perform several tests and get the average RMSE
metric <- 0
num_passed <- 0
set.seed(seed) # re-setting the seed for some sort of predictability
for (jj in 1:num_tests){
# Create training and test data
indices = sample(seq(1:nrow(data)),round(prop_test*nrow(data)))
validation = data[indices,]
train = data[-indices,]
# Create model
model <- lm(formula_obj, data = train)
# Test model
pred <- predict(model, validation)
rmse <- sqrt(sum((pred-as.numeric(validation[,"MonthlyIncome"]))^2)/nrow(validation))
if(rmse<thresh){
num_passed <- num_passed+1
}
metric <- metric + rmse
}
metric <- metric / num_tests
var_df$metric[var_df$Variable==var] <- metric
var_df$num_passed[var_df$Variable==var] <- num_passed
# Check if that variable should be added
if(best_metric>(metric+1)){
best_passed = num_passed
best_metric <- metric
vars_to_use <- c(vars_to_use,var)
var_df$to_use[var_df$Variable==var] <- 1
var_df$curr_formula[var_df$Variable==var] <- formula_str
var_df$num[var_df$Variable==var] <- ii
num_since_last_change <- 0
file_str <- paste(readLines(txt_file), collapse="\n")
writeLines(paste(file_str,"\n",'num: ',ii,' var: ',var,' metric: ', metric,' num_passed: ', num_passed,
' formula_str: ',formula_str,sep=""),
txt_file)
}
# Print something to the screen to see general progress
num_since_last_change <- num_since_last_change + 1
print(paste('num: ',ii,' var: ',var,' metric: ', metric,' num_passed: ',
num_passed," num_since_last_change: ",num_since_last_change,sep="" ))
# If the loop has been going on for too long, or if p-values aren't low anymore, just break out of it
if(num_since_last_change == num_no_change){
break
}
}
# Output the var_df
var_df
}
The model that was created had an average RMSE of 966.7 over the 200 tests. So, significatntly better than the more simple models.
XGBoost
I’ve heard good things about using XGBoost from people at work, so I thought it would be fun to try it out here. I’ll try it out using the same methodology as with the other approaches.
# Create training and test dtaa
seed <- 1
prop_test <- .1
set.seed(seed)
indices = sample(seq(1:nrow(data)),round(prop_test*nrow(data)))
validation = data[indices,]
train = data[-indices,]
# XG Boost needs a DMatrix
vals <- c('OverTime','MaritalStatus')
target <- as.numeric(train$Attrition)*-1 +2
feature_data <- train[,vals]
feature_matrix <- as.matrix(as.data.frame(lapply(feature_data,as.numeric)))
dtrain <- xgb.DMatrix(data = feature_matrix, label = target)
# Create the model
bstDMatrix <- xgboost(data = dtrain, nthread = 2, nrounds = 10, objective = "binary:logistic")
## [1] train-logloss:0.562331
## [2] train-logloss:0.492316
## [3] train-logloss:0.451833
## [4] train-logloss:0.427794
## [5] train-logloss:0.413350
## [6] train-logloss:0.404682
## [7] train-logloss:0.399527
## [8] train-logloss:0.396496
## [9] train-logloss:0.394737
## [10] train-logloss:0.393728
# Predict using the model
validation_matrix <- as.matrix(as.data.frame(lapply(validation[,vals],as.numeric)))
validation_dmatrix <- xgb.DMatrix(data = validation_matrix)
pred <- predict(bstDMatrix, validation_dmatrix)
prediction <- as.numeric(pred > 0.5)
# Evaluate the predictions
validation_label <- as.numeric(validation[, 'Attrition'])*-1+2
sens <- sum(prediction == 1 & validation_label == 1) / sum(prediction == 1)
spec <- sum(prediction == 0 & validation_label == 0) / sum(prediction == 0)
sens + spec
## [1] 1.589744
1.58 is a pretty good Sensitivity + Specificity score. Good enough to try generating a larger model.
add_vars_xgboost_attrition <- function(data, all_vars, seed, prop_test, thresh, num_tests,
num_no_change, start_formula, start_num, txt_file,
start_best, amount_to_improve){
# First get a list of the p-values
var_df <- data.frame(Variables=all_vars)
var_df$num <- 0
var_df$p <- 0
var_df$metric <- 0
var_df$num_passed <- 0
var_df$to_use <- 0
var_df$curr_formula <- ""
num_since_last_change <- 0
nvar <- length(all_vars)
for(ii in 1:nvar){
var<-all_vars[ii]
test <- cor.test(unclass(data[,'Attrition']),unclass(data[,var]))
var_df$p[ii] <- test$p.value
}
# Sort the p-values
sort_order <- order(var_df$p,decreasing = FALSE)
# Loop through all of the columns, add the next variable to the list
if (start_formula==''){
vars_to_use <- c()
} else {
vars_to_use <- unlist(strsplit(start_formula, "\\+"))
}
best_passed <- 0
best_metric <- start_best
for (ii in start_num:nvar){
var<-all_vars[sort_order[ii]]
vars_to_use_loop <- c(vars_to_use,var)
curr_p <- var_df$p[sort_order[ii]]
if(curr_p > 0.001){ # Don't want the variables to not be super predictive
break
}
if(sum(data[,var]!=0)<length(data[,var])*0.05){
next # Don't want too few numbers
}
if(class(data[,var]) == 'factor'){
next # XGBoost only seems to handle numeric data well
}
metric <- 0
num_passed <- 0
set.seed(seed) # re-setting the seed for some sort of predictability
for (jj in 1:num_tests){
# Split data into training and test
indices = sample(seq(1:nrow(data)),round(prop_test*nrow(data)))
validation = data[indices,]
train = data[-indices,]
# Call the xgboost algorithm
target <- as.numeric(train$Attrition)*-1 +2
feature_data <- train[,vars_to_use_loop]
feature_matrix <- as.matrix(feature_data)
dtrain <- xgb.DMatrix(data = feature_matrix, label = target)
msg_output <- capture.output({
bstDMatrix <- xgboost(data = dtrain, nthread = 2, nrounds = 10, objective = "binary:logistic")
})
# Use the resulting model for predictions
validation_matrix <- as.matrix(validation[,vars_to_use_loop])
validation_dmatrix <- xgb.DMatrix(data = validation_matrix)
pred <- predict(bstDMatrix, validation_dmatrix)
prediction <- as.numeric(pred > 0.5)
# Determine how good the predictions are
validation_label <- as.numeric(validation[, 'Attrition'])*-1+2
sens <- sum(prediction == 1 & validation_label == 1) / sum(prediction == 1)
if(is.na(sens)){
sens <- 0
}
spec <- sum(prediction == 0 & validation_label == 0) / sum(prediction == 0)
if(is.na(spec)){
spec <- 0
}
if(sens>thresh && spec>thresh) {
num_passed <- num_passed + 1
}
metric <- metric + sens + spec
}
metric <- metric / num_tests
var_df$metric[var_df$Variable==var] <- metric
var_df$num_passed[var_df$Variable==var] <- num_passed
var_df$num[var_df$Variable==var] <- ii
# Check if that variable should be added
if(best_metric+amount_to_improve<metric || num_passed<=num_tests*.01){
best_passed = num_passed
best_metric <- metric
vars_to_use <- c(vars_to_use,var)
var_df$to_use[var_df$Variable==var] <- 1
var_df$curr_formula[var_df$Variable==var] <- paste(vars_to_use,collapse="+")
num_since_last_change <- 0
file_str <- paste(readLines(txt_file), collapse="\n")
writeLines(paste(file_str,"\n",'num: ',ii,' var: ',var,
' metric: ', metric, ' num_passed: ', num_passed,
' formula_str: ',paste(vars_to_use,collapse="+"),sep=""),
txt_file)
}
# Print something to the screen to see general progress
print(paste('num: ',ii,' var: ',var,' metric: ', metric,
' num_since_last_change: ',num_since_last_change,
' num_passed: ', num_passed, sep=""))
# If the loop has been going on for too long, just break out of it
num_since_last_change <- num_since_last_change + 1
if(num_since_last_change == num_no_change){
break
}
}
# Output the var_df
var_df
}
The approach only seemed to select a model with 2 variables. And when testing that 2 variable model out with 200 tests, only 53 got both sensitivity and specificity above 60% with an average metric of 1.362. This isn’t very good. Maybe I’m using XG Boost wrong?