Introduction

Awhile ago, during my time at Centura Health, I began researching all the different ways we could analyze the data at hand. And at some point I became very interested in HR analytics. It was a fascinating way to explore an organization through data. Of course, while digesting the org structure through network graphs is interesting there are also more immediately clear benefits from employee data; predicting turnover.

I came across this article on using machine learning, specifically h2o and Lime, for turnover prediction and really liked the simplicity of the features included, as well as, the author’s explanation. After setting this example aside for some time, I finally found the time to work through this example and peeking under the h2o hood attempt to replicate the results using GLMNET.

The Data

Unfortunately, the link to the data in the original article is broken. After a little bit of Googling I was able to find some conscientious Github repos with a copy. I did some checks and the data seemed to match the same as the article. Since I no longer remember where I found the data, it’s hosted in my own repo now.

Data

EDA

Here we’ll look at the employee data and convert characters to factors.

employee_attrition_raw %>% 
  head(10) %>% 
  knitr::kable(format = 'html') %>% 
  kableExtra::kable_styling() %>%
  kableExtra::scroll_box(width = "100%", height = "750px")

Age	Attrition	BusinessTravel	DailyRate	Department	DistanceFromHome	Education	EducationField	EmployeeCount	EmployeeNumber	EnvironmentSatisfaction	Gender	HourlyRate	JobInvolvement	JobLevel	JobRole	JobSatisfaction	MaritalStatus	MonthlyIncome	MonthlyRate	NumCompaniesWorked	Over18	OverTime	PercentSalaryHike	PerformanceRating	RelationshipSatisfaction	StandardHours	StockOptionLevel	TotalWorkingYears	TrainingTimesLastYear	WorkLifeBalance	YearsAtCompany	YearsInCurrentRole	YearsSinceLastPromotion	YearsWithCurrManager
41	Yes	Travel_Rarely	1102	Sales	1	2	Life Sciences	1	1	2	Female	94	3	2	Sales Executive	4	Single	5993	19479	8	Y	Yes	11	3	1	80	0	8	0	1	6	4	0	5
49	No	Travel_Frequently	279	Research & Development	8	1	Life Sciences	1	2	3	Male	61	2	2	Research Scientist	2	Married	5130	24907	1	Y	No	23	4	4	80	1	10	3	3	10	7	1	7
37	Yes	Travel_Rarely	1373	Research & Development	2	2	Other	1	4	4	Male	92	2	1	Laboratory Technician	3	Single	2090	2396	6	Y	Yes	15	3	2	80	0	7	3	3	0	0	0	0
33	No	Travel_Frequently	1392	Research & Development	3	4	Life Sciences	1	5	4	Female	56	3	1	Research Scientist	3	Married	2909	23159	1	Y	Yes	11	3	3	80	0	8	3	3	8	7	3	0
27	No	Travel_Rarely	591	Research & Development	2	1	Medical	1	7	1	Male	40	3	1	Laboratory Technician	2	Married	3468	16632	9	Y	No	12	3	4	80	1	6	3	3	2	2	2	2
32	No	Travel_Frequently	1005	Research & Development	2	2	Life Sciences	1	8	4	Male	79	3	1	Laboratory Technician	4	Single	3068	11864	0	Y	No	13	3	3	80	0	8	2	2	7	7	3	6
59	No	Travel_Rarely	1324	Research & Development	3	3	Medical	1	10	3	Female	81	4	1	Laboratory Technician	1	Married	2670	9964	4	Y	Yes	20	4	1	80	3	12	3	2	1	0	0	0
30	No	Travel_Rarely	1358	Research & Development	24	1	Life Sciences	1	11	4	Male	67	3	1	Laboratory Technician	3	Divorced	2693	13335	1	Y	No	22	4	2	80	1	1	2	3	1	0	0	0
38	No	Travel_Frequently	216	Research & Development	23	3	Life Sciences	1	12	4	Male	44	2	3	Manufacturing Director	3	Single	9526	8787	0	Y	No	21	4	2	80	0	10	2	3	9	7	1	8
36	No	Travel_Rarely	1299	Research & Development	27	3	Medical	1	13	3	Male	94	3	2	Healthcare Representative	3	Married	5237	16577	6	Y	No	13	3	2	80	2	17	3	2	7	7	7	7

# Change to factors

employee_attrition <- employee_attrition_raw %>% 
  dplyr::mutate_if(is.character, as.factor) %>% 
  dplyr::select(Attrition, dplyr::everything())

In total we have 1470 observations and 35 features.

tibble::glimpse(employee_attrition, width = 100)

## Observations: 1,470
## Variables: 35
## $ Attrition                <fct> Yes, No, Yes, No, No, No, No, No, No, No, No, No, No, No, Yes, N…
## $ Age                      <dbl> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, 29, 31, 34, 28, 29, …
## $ BusinessTravel           <fct> Travel_Rarely, Travel_Frequently, Travel_Rarely, Travel_Frequent…
## $ DailyRate                <dbl> 1102, 279, 1373, 1392, 591, 1005, 1324, 1358, 216, 1299, 809, 15…
## $ Department               <fct> Sales, Research & Development, Research & Development, Research …
## $ DistanceFromHome         <dbl> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26, 19, 24, 21, 5, 16, …
## $ Education                <dbl> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3, 4, 2, 2, 4, 3, 2, 4…
## $ EducationField           <fct> Life Sciences, Life Sciences, Other, Life Sciences, Medical, Lif…
## $ EmployeeCount            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ EmployeeNumber           <dbl> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16, 18, 19, 20, 21, 22…
## $ EnvironmentSatisfaction  <dbl> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, 3, 2, 1, 4, 1, 4, 1, 3…
## $ Gender                   <fct> Female, Male, Male, Female, Male, Male, Female, Male, Male, Male…
## $ HourlyRate               <dbl> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84, 49, 31, 93, 50, 51, …
## $ JobInvolvement           <dbl> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, 2, 4, 4, 4, 2, 3, 4, 2…
## $ JobLevel                 <dbl> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1, 3, 1, 1, 4, 1, 2, 1…
## $ JobRole                  <fct> Sales Executive, Research Scientist, Laboratory Technician, Rese…
## $ JobSatisfaction          <dbl> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, 3, 1, 2, 4, 4, 4, 3, 1…
## $ MaritalStatus            <fct> Single, Married, Single, Married, Married, Single, Married, Divo…
## $ MonthlyIncome            <dbl> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 2693, 9526, 5237, 2426…
## $ MonthlyRate              <dbl> 19479, 24907, 2396, 23159, 16632, 11864, 9964, 13335, 8787, 1657…
## $ NumCompaniesWorked       <dbl> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5, 1, 0, 1, 2, 5, 0, 7…
## $ Over18                   <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y…
## $ OverTime                 <fct> Yes, No, Yes, Yes, No, No, Yes, No, No, No, No, Yes, No, No, Yes…
## $ PercentSalaryHike        <dbl> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13, 12, 17, 11, 14, 11, …
## $ PerformanceRating        <dbl> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4…
## $ RelationshipSatisfaction <dbl> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, 2, 3, 4, 2, 3, 3, 4, 2…
## $ StandardHours            <dbl> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, …
## $ StockOptionLevel         <dbl> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0, 1, 2, 2, 0, 0, 1, 0…
## $ TotalWorkingYears        <dbl> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5, 3, 6, 10, 7, 1, 31, …
## $ TrainingTimesLastYear    <dbl> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4, 1, 5, 2, 3, 3, 5, 4…
## $ WorkLifeBalance          <dbl> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3…
## $ YearsAtCompany           <dbl> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, 4, 10, 6, 1, 25, 3, 4…
## $ YearsInCurrentRole       <dbl> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, 2, 9, 2, 0, 8, 2, 2, 3…
## $ YearsSinceLastPromotion  <dbl> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0, 8, 0, 0, 3, 1, 1, 0…
## $ YearsWithCurrManager     <dbl> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, 3, 8, 5, 0, 7, 2, 3, 3…

Imbalance and Feature to Observation Ratio

Let’s take a quick look at the number of positive and negative cases to ensure we don’t need to investigate sampling methods such as down/up sampling or synthetic sampling.

case_table <- employee_attrition %>% 
  dplyr::count(Attrition, name = "Count") %>% 
  dplyr::mutate(prop = Count / sum(Count))

case_table

## # A tibble: 2 x 3
##   Attrition Count  prop
##   <fct>     <int> <dbl>
## 1 No         1233 0.839
## 2 Yes         237 0.161

We do see we have some moderate observation imbalance. To keep consistent with the analysis performed in the article we’ll leave as is and note this for future analysis.

Although, we’re letting H20 decide the optimal model for us let’s still check we have enough observations per variable under a logistic regression model.

Per Peduzzi et al. N = 10 * k / p, where N is the number of observations, k is the number of features, and p is the number of cases.

k <- length(colnames(employee_attrition[, -1]))
p <- case_table %>% 
  {`[[`(., 3)[2]}

N <- 10 * k / p

N

## [1] 2108.861

We see no issues with the number of features to observations.

Model Setup

Here we’ll setup our model and establish training, test, and validation sets.

h2o::h2o.init()

## 
## H2O is not running yet, starting it now...
## 
## Note:  In case of errors look at the following log files:
##     /tmp/RtmpUmXXBs/h2o_ben_started_from_r.out
##     /tmp/RtmpUmXXBs/h2o_ben_started_from_r.err
## 
## 
## Starting H2O JVM and connecting: . Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         1 seconds 213 milliseconds 
##     H2O cluster timezone:       Etc/UTC 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.28.0.1 
##     H2O cluster version age:    1 month and 20 days  
##     H2O cluster name:           H2O_started_from_R_ben_ugd504 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   6.96 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4 
##     R Version:                  R version 3.6.2 (2019-12-12)

h2o::h2o.no_progress()

employee_attrition_h2o <- h2o::as.h2o(employee_attrition)

split_h2o <- h2o::h2o.splitFrame(employee_attrition_h2o, c(0.7, 0.15), seed = 1234)

train_h2o <- h2o::h2o.assign(split_h2o[[1]], "train")
valid_h2o <- h2o::h2o.assign(split_h2o[[2]], "valid")
test_h2o <- h2o::h2o.assign(split_h2o[[3]], "test")

y <- "Attrition"
x <- setdiff(names(train_h2o), y)

Time for the modeling step!

automl_models_h2o <- h2o::h2o.automl(
  x = x,
  y = y,
  training_frame = train_h2o,
  leaderboard_frame = valid_h2o,
  max_runtime_secs = 30,
  seed = 1234
)

automl_leader <- automl_models_h2o@leader

Evaluation

Let’s evaluate the model’s performance.

One interesting thing to note is at this step we’re not even aware of the model H2O has selected. That’s an interesting feature of the AutoML; we’re focused on the results instead of selecting features and optimizing parameters at this stage.

pred_h2o <- h2o::h2o.predict(object = automl_leader, newdata = test_h2o)

test_performance <- test_h2o %>% 
  tibble::as_tibble() %>% 
  dplyr::select(Attrition) %>% 
  dplyr::mutate(pred = as.vector(pred_h2o$predict)) %>% 
  dplyr::mutate_if(is.character, as.factor)

test_performance

## # A tibble: 211 x 2
##    Attrition pred 
##    <fct>     <fct>
##  1 No        No   
##  2 No        No   
##  3 Yes       Yes  
##  4 No        No   
##  5 No        No   
##  6 No        No   
##  7 Yes       Yes  
##  8 No        No   
##  9 No        No   
## 10 Yes       Yes  
## # … with 201 more rows

confusion_matrix <- test_performance %>% 
  table()

confusion_matrix

##          pred
## Attrition  No Yes
##       No  170  12
##       Yes  13  16

Here I liked the authors pause to point out the high null error rate. You could pick no and have an accuracy of ~77%. Having a ~10% between a naive no model and your actual model isn’t great. However, the author goes on to point out recall’s value to HR. The organization would prefer to missclassify employees as high risk when they’re not versus missclassify as not high risk when they are. Too often the focus on modeling is on accuracy and misses meaningfulness to the organization. While our numbers differ slightly from those on the article, the organization could possible keep 72% of employees predicted as high risk.

tn <- confusion_matrix[1]
tp <- confusion_matrix[4]
fp <- confusion_matrix[3]
fn <- confusion_matrix[2]

accuracy <- (tp + tn) / (tp + tn + fp + fn)
misclassification_rate <- 1 - accuracy
recall <- tp / (tp + fn)
precision <- tp / (tp + fp)
null_error_rate <- tn / (tp + tn + fp + fn)

tibble::tibble(
  accuracy,
  misclassification_rate,
  recall,
  precision,
  null_error_rate
) %>% 
  t()

##                             [,1]
## accuracy               0.8815166
## misclassification_rate 0.1184834
## recall                 0.5517241
## precision              0.5714286
## null_error_rate        0.8056872

A Peek Under The Hood

automl_leader

## Model Details:
## ==============
## 
## H2OBinomialModel: stackedensemble
## Model ID:  StackedEnsemble_BestOfFamily_AutoML_20200206_044731 
## NULL
## 
## 
## H2OBinomialMetrics: stackedensemble
## ** Reported on training data. **
## 
## MSE:  0.02772933
## RMSE:  0.1665212
## LogLoss:  0.1215058
## Mean Per-Class Error:  0.04749831
## AUC:  0.9964706
## AUCPR:  0.958869
## Gini:  0.9929412
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##         No Yes    Error      Rate
## No     859  11 0.012644   =11/870
## Yes     14 156 0.082353   =14/170
## Totals 873 167 0.024038  =25/1040
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold      value idx
## 1                       max f1  0.302228   0.925816 137
## 2                       max f2  0.199425   0.957207 172
## 3                 max f0point5  0.417172   0.944730 123
## 4                 max accuracy  0.302228   0.975962 137
## 5                max precision  0.992197   1.000000   0
## 6                   max recall  0.199425   1.000000 172
## 7              max specificity  0.992197   1.000000   0
## 8             max absolute_mcc  0.302228   0.911526 137
## 9   max min_per_class_accuracy  0.234473   0.970588 156
## 10 max mean_per_class_accuracy  0.199425   0.978161 172
## 11                     max tns  0.992197 870.000000   0
## 12                     max fns  0.992197 166.000000   0
## 13                     max fps  0.026594 870.000000 399
## 14                     max tps  0.199425 170.000000 172
## 15                     max tnr  0.992197   1.000000   0
## 16                     max fnr  0.992197   0.976471   0
## 17                     max fpr  0.026594   1.000000 399
## 18                     max tpr  0.199425   1.000000 172
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## 
## H2OBinomialMetrics: stackedensemble
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
## 
## MSE:  0.08896314
## RMSE:  0.2982669
## LogLoss:  0.3077805
## Mean Per-Class Error:  0.2524003
## AUC:  0.8320047
## AUCPR:  0.6496573
## Gini:  0.6640095
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##         No Yes    Error       Rate
## No     830  40 0.045977    =40/870
## Yes     78  92 0.458824    =78/170
## Totals 908 132 0.113462  =118/1040
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold      value idx
## 1                       max f1  0.394694   0.609272 119
## 2                       max f2  0.078752   0.639894 294
## 3                 max f0point5  0.608480   0.685654  68
## 4                 max accuracy  0.608480   0.888462  68
## 5                max precision  0.989688   1.000000   0
## 6                   max recall  0.029366   1.000000 393
## 7              max specificity  0.989688   1.000000   0
## 8             max absolute_mcc  0.394694   0.550091 119
## 9   max min_per_class_accuracy  0.115575   0.756322 250
## 10 max mean_per_class_accuracy  0.209415   0.776099 182
## 11                     max tns  0.989688 870.000000   0
## 12                     max fns  0.989688 169.000000   0
## 13                     max fps  0.025840 870.000000 399
## 14                     max tps  0.029366 170.000000 393
## 15                     max tnr  0.989688   1.000000   0
## 16                     max fnr  0.989688   0.994118   0
## 17                     max fpr  0.025840   1.000000 399
## 18                     max tpr  0.029366   1.000000 393
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

Of note, for the next section and unsurprisingly, we’ll use the family, link, and regularization info.

GLMNET Comparison

Since we noticed above the model H2O has selected is a glm model with ridge regression, we’ll attempt to replicate the results using the GLMNET R package. And since we’re already doing a comparison, why not a few different types of regression, lasso, ridge, and elasticnet.

Here we’ll split our sample according to the same test and training sets we used in our H2O model.

I’ll drop the feature Over18 since it’s a constant factor and doesn’t add any additional information to the outcome.

x_train <- rbind(as.data.frame(train_h2o[x][, -c(21)]), as.data.frame(valid_h2o[x][, -c(21)]))
y_train <- rbind(as.data.frame(train_h2o[y]), as.data.frame(valid_h2o[y]))$Attrition

x_test <- as.data.frame(test_h2o[x][, -c(21)])
y_test <- as.data.frame(test_h2o[y])$Attrition

x_train <- model.matrix(~.-1, x_train)

x_test <- model.matrix(~.-1, x_test)

Since it’s not apparent the value of alpha, although, we do have a hint (Ridge Regression), we’ll perform a grid search over alpha.

alpha_grid <- seq(0, 1, .1)

model_fits <- lapply(alpha_grid, function(alpha){
  
  set.seed(1234)
  
  glmnet::cv.glmnet(x_train, as.factor(y_train), 
                    family = "binomial", 
                    alpha = alpha,
                    type.measure = "mse",
                    nfolds = 20                      
                    )
  
})

mse_lambda_min <- lapply(model_fits, function(model){ 
  
  lambda_min_mse <- model$cvm[which(model$lambda == model$lambda.min) ]
  
  lambda_1se_mse <- model$cvm[which(model$lambda == model$lambda.1se) ]
  
  lambda_min <- model$lambda.min
  
  lambda_1se <- model$lambda.1se
  
  data.frame(lambda_min_mse = lambda_min_mse, lambda_min = lambda_min, lambda_1se_mse = lambda_1se_mse, lambda_1se = lambda_1se) 
  })

mse_lambda_min <- do.call(rbind, mse_lambda_min)

mse_lambda_min$alpha <- alpha_grid

mse_lambda_min

##    lambda_min_mse  lambda_min lambda_1se_mse  lambda_1se alpha
## 1       0.1910575 0.008881137      0.1994429 0.052017005   0.0
## 2       0.1910322 0.007724360      0.1992090 0.037560485   0.1
## 3       0.1910282 0.006749269      0.2000218 0.029903457   0.2
## 4       0.1910617 0.005948091      0.2002854 0.024012547   0.3
## 5       0.1910651 0.005897270      0.2002692 0.019765305   0.4
## 6       0.1910490 0.005177797      0.1991611 0.015812244   0.5
## 7       0.1910771 0.004314831      0.2000434 0.014461599   0.6
## 8       0.1911163 0.004059019      0.1994610 0.012395656   0.7
## 9       0.1911311 0.003551641      0.1990230 0.010846199   0.8
## 10      0.1911558 0.003157014      0.2004191 0.010581058   0.9
## 11      0.1911867 0.002841313      0.2001205 0.009522952   1.0

Interestingly, enough we do not see our value of lambda. There’s a couple of reasons for this:

GLMNET picks a validation set at random whereas for the H2O model we specified this beforehand.
The solver for our lambda search seems to be different from GLMNET.

Regardless, we’ll pick the lambda with the lowest mse for both lambda_min and lambda_1se and see how they compare.

glmnet_min <- glmnet::glmnet(x_train, as.factor(y_train), 
                             family = "binomial", 
                             alpha = 0.2,
                             lambda = 0.005412513
                             )

glmnet_1se <- glmnet::glmnet(x_train, as.factor(y_train), 
                             family = "binomial", 
                             alpha = 0.4,
                             lambda = 0.02299651
                             )

glmnet_min_pred <- predict(glmnet_min, 
                           s = "lambda.min",
                           newx = x_test,
                           type = "class"
                           )

glmnet_1se_pred <- predict(glmnet_1se, 
                           s = "lambda.1se",
                           newx = x_test,
                           type = "class"
                           )

To make measure performance for our 2 models, we’ll make it easy and create a performance function.

perfomance <- function(confusion_matrix) {
  tn <- confusion_matrix[1]
  tp <- confusion_matrix[4]
  fp <- confusion_matrix[3]
  fn <- confusion_matrix[2]
  
  accuracy <- (tp + tn) / (tp + tn + fp + fn)
  misclassification_rate <- 1 - accuracy
  recall <- tp / (tp + fn)
  precision <- tp / (tp + fp)
  null_error_rate <- tn / (tp + tn + fp + fn)
  
  tibble::tibble(
    accuracy,
    misclassification_rate,
    recall,
    precision,
    null_error_rate
  ) %>% 
    t()
}

A quick look at our confusion matrix highlights a mixed recall.

glmnet_min_confusion <- table(y_test, glmnet_min_pred, dnn = c("Attrition", "Prediction"))
glmnet_min_confusion

##          Prediction
## Attrition  No Yes
##       No  173   9
##       Yes  14  15

Compared to our H2O model, the accuracy is slightly higher. However, our recall is lower. The gap between our null error and accuracy has closed slightly.

perfomance(glmnet_min_confusion)

##                             [,1]
## accuracy               0.8909953
## misclassification_rate 0.1090047
## recall                 0.5172414
## precision              0.6250000
## null_error_rate        0.8199052

Our 1se model has likewise traded in recall for precision. As you can see from the confusion matrix the model is very certain when it predicts yes it’s correct. However, we’ve closed the gap even more between our null error rate and accuracy.

glmnet_1se_confusion <- table(y_test, glmnet_1se_pred, dnn = c("Attrition", "Prediction"))
glmnet_1se_confusion

##          Prediction
## Attrition  No Yes
##       No  181   1
##       Yes  21   8

perfomance(glmnet_1se_confusion)

##                             [,1]
## accuracy               0.8957346
## misclassification_rate 0.1042654
## recall                 0.2758621
## precision              0.8888889
## null_error_rate        0.8578199

GLMNET Conclusions

Overall, the GLMNET model has similar results to the H2O model, but it does tend to focus more on accuracy and precision at the expense of recall. It’s important to keep these things in mind when developing a model and ensuring you’re optimizing for the task at hand.

We’ll return now to the original article.

Feature Importance

Here we’ll use the lime package to feature importance plots.

Just to note an appealing aspect of lime is it being model agnostic. We can apply the same technique regardless of it being random forest or neural network.

Since lime is still only a few years old I’ll give a brief explanation of what lime does. The first step is taking one of our predicted observations and creating slight perturbations of the original observation and finding the predictions using the original model. Finally, it fits a linear model based on the local characteristics of the model.

Lime Setup

require('lime')
model_type.H2OBinomialModel <- function(x, ...) return('classification')

predict_model.H2OBinomialModel <- function(x, newdata, type, ...) {
  pred <- h2o::h2o.predict(x, h2o::as.h2o(newdata))
  
  return(as.data.frame(pred[, -1]))
  
}

predict_model(x = automl_leader, newdata = as.data.frame(test_h2o[, -1]), type = 'raw') %>% 
  tibble::as_tibble()

## # A tibble: 211 x 2
##        No    Yes
##     <dbl>  <dbl>
##  1 0.858  0.142 
##  2 0.957  0.0426
##  3 0.0563 0.944 
##  4 0.966  0.0341
##  5 0.890  0.110 
##  6 0.956  0.0442
##  7 0.138  0.862 
##  8 0.939  0.0612
##  9 0.909  0.0912
## 10 0.456  0.544 
## # … with 201 more rows

explainer <- lime::lime(
  as.data.frame(train_h2o[, -1]),
  model = automl_leader,
  bin_continuous = FALSE
)

explanation <- lime::explain(
  as.data.frame(test_h2o[1:10, -1]),
  explainer = explainer,
  n_labels = 1,
  n_features = 4,
  kernel_width = 0.5
)

Let’s take a look at the feature importance plots from lime.

Note these differ quite a bit from those found in the article. This highlights, an important point about reproducibilty. Since the authors didn’t disclose or set a seed value for the AutoML we cannot reproduce their results; it’s been left up to chance. As you can see below the feature importance gives a very different perspective on the possible interventions of attrition.

lime::plot_features(explanation) + 
  ggplot2::labs(title = "Employee Attrition Prediction: Feature Importance",
       subtitle = "First 10 Cases From Test Set")

Author’s Feature Importance

Feature Evaluation

From the feature plots we generated, a few things stand out as important features in which we can explore and possible create intervention plans.

attrition_critical_features <- employee_attrition %>% 
  tibble::as_tibble() %>% 
  dplyr::select(Attrition, NumCompaniesWorked, YearsSinceLastPromotion, OverTime) %>% 
  dplyr::mutate(Case = dplyr::row_number()) %>% 
  dplyr::select(Case, dplyr::everything())

attrition_critical_features

## # A tibble: 1,470 x 5
##     Case Attrition NumCompaniesWorked YearsSinceLastPromotion OverTime
##    <int> <fct>                  <dbl>                   <dbl> <fct>   
##  1     1 Yes                        8                       0 Yes     
##  2     2 No                         1                       1 No      
##  3     3 Yes                        6                       0 Yes     
##  4     4 No                         1                       3 Yes     
##  5     5 No                         9                       2 No      
##  6     6 No                         0                       3 No      
##  7     7 No                         4                       0 Yes     
##  8     8 No                         1                       0 No      
##  9     9 No                         0                       1 No      
## 10    10 No                         6                       7 No      
## # … with 1,460 more rows

Number of Companies Worked

We can see from the rankings that in general the more companies worked at the more inclined it is that the individual would leave. Oddly, I would have assumed this to be monotonic and capture job hoppers. As an intervention, I’m not sure what you could possible do; you can’t do much about a person’s previous job history. As a prehire check this could be useful information if you’re looking for a long term employee.

attrition_critical_features %>% 
  dplyr::group_by(NumCompaniesWorked, Attrition) %>% 
  dplyr::summarise(n = dplyr::n()) %>% 
  dplyr::mutate(freq = n / sum(n)) %>%
  dplyr::filter(Attrition == 'Yes') %>% 
  ggplot2::ggplot(ggplot2::aes(forcats::fct_reorder(as.factor(NumCompaniesWorked), freq), freq)) +
  ggplot2::geom_bar(stat = 'identity') +
  ggplot2::coord_flip() +
  ggplot2::ylim(0, 1) + 
  ggplot2::ylab("Attrition Percentage (Yes / Total)") +
  ggplot2::xlab("NumCompaniesWorked")

Overtime

No surprises here. A large number of employees who stay are not working overtime. Interventions here could include hiring more staff or offering incentives for those times when overtime is necessary.

attrition_critical_features %>% 
  dplyr::mutate(OverTime = ifelse(OverTime == 'Yes', 1, 0)) %>% 
  ggplot2::ggplot(ggplot2::aes(Attrition, OverTime)) +
  ggplot2::geom_violin(trim=TRUE) +
  ggplot2::geom_jitter(shape = 16, position = ggplot2::position_jitter(0.4))

Years Since Last Promotion

This is also interestingly not monotonic. It’s clear at the top that employees leave if they haven’t been promoted for a long time.

Possible interventions here would include promotions or in lieu of that recognition programs.

attrition_critical_features %>% 
  dplyr::group_by(YearsSinceLastPromotion, Attrition) %>% 
  dplyr::summarise(n = dplyr::n()) %>% 
  dplyr::mutate(freq = n / sum(n)) %>%
  dplyr::filter(Attrition == 'Yes') %>% 
  ggplot2::ggplot(ggplot2::aes(forcats::fct_reorder(as.factor(YearsSinceLastPromotion), freq), freq)) +
  ggplot2::geom_bar(stat = 'identity') +
  ggplot2::coord_flip() +
  ggplot2::ylim(0, 1) + 
  ggplot2::ylab("Attrition Percentage (Yes / Total)") +
  ggplot2::xlab("YearsSinceLastPromotion")

Conclusion

Working through this example has been interesting. Not only were we able to explore interesting employee data, but also see how models compare and the pitfalls of not ensuring your work is reproducible.

Employee Attrition Example