Evaluating Machine Learning Classifiers: An Attrition Prediction Perspective

Togy Jose
hrness.ai
Published in
5 min readFeb 13, 2020

--

While there are quite a few generic articles / blogs on how to evaluate Machine Learning Classifiers, there are hardly any that explain model evaluation in the context of a specific use case — for example Attrition Prediction. Would like to use this article to do exactly that. Having said so, the high level principles can be extended to any use-case.

Why is it important for everyone (even folks who are not Developers / Data Scientists) to know how to evaluate an ML model?

With the increasing adoption of predictive solutions — there is a pretty good chance that you or your organization is going to evaluate a product that claims to be AI/ML-enabled or a consultant who claims to provide predictive capabilities. In addition to being highly complex (or opaque — as with Neural Networks), predictive models also need to pass muster with multiple regulatory agencies who view them as being inherently biased and difficult to interpret.

Keeping this in mind — it is imperative that the end-user of these services (including folks with a non-technical background) need to know what questions to ask when such a solution is being pitched and how to continuously track the effectiveness of a product / service, post go-live.

Lets start with basics. If we are trying to evaluate a model that predicts Attrition likeliness as “Yes” or “No”, how would we measure model accuracy? A simple approach would be — “How many times did the model predict correctly?” (This would include both Yes and No predictions). Within R this view can be easily generated using the “CrossTable” function.

Let’s assume in a group of 1390 employees, 183 have actually quit. The breakup of the predictions vs actuals are given below and is referred to as the “Confusion Matrix”. Sample given below.

Confusion Matrix

The “Accuracy” of the model will be 97.4% i.e. 86.4% (correctly predicted as employees who will not resign — 1201) + 11% (correctly predicted as employees who will resign — 153).

While Accuracy sounds like a fairly robust metric — it has a few drawbacks, one being — the Class Imbalance Problem.

Imagine you’re in an organization / business unit where attrition is low — for example, in a population of 1390 employees only 140 quit (~ 10%). If the overall model accuracy is 85% — there is a very real possibility that predictions around who will quit may be completed wrong (since the margin-of-error is 15%)! Another, more practical, issue with relying too much on “Accuracy” is it doesn’t give granular inputs like — “What percentage of the employees who actually quit — were predicted correctly.” In this case — the number is 153 which is 84% of overall attrition (i.e. 183).

Given these limitations, it is important to get a more nuanced understanding of the model performance. Here are few metrics to be considered —

1) False Positives (FP) and False Negatives (FN) — In the figure given above, a FP (0.4% Overall Probability) is the model predicting that someone is likely to quit (‘Positive’ event) but she does not quit. A FN (2.2% Overall Probability) is the model predicting that someone is not-likey to quit (‘Negative’ event) but she quits.

Which error you want to focus on depends on your organization’s overall Retention Strategy. For example — if your organization gives financial incentives, overseas opportunities etc to encourage employees to reverse resignations, its important to keep FP low since it has direct financial impact. On the other hand, if the cost of attrition is excessively high (i.e. exit of key employee may have a direct impact on the bottomline) then we need to keep FN low to prevent impacting the organization’s performance.

All ML models deployed in these use-cases have the ability to introduce an “Error-Cost” which penalizes the model (during the learning stage) i.e. if we introduce an error-cost for FP then it modifies the variables to bring FP down but this almost always has an adverse effect on FN. So organizations have to carefully decide which error type is more critical.

2) Sensitivity / Specificity: In continuation with the above theme — it is important to establish a tradeoff between detection of False Positives and False Negatives.

Sensitivity is the ability to correctly predict Positive cases (ie exits) which is ratio of correctly predicted exit cases (153) to actual exits (183) which is 83.6%. Likewise Specificity is the ability to correctly predict Negative cases (i.e. will-not-exit) which is the ratio of correctly predicted will-not-exit cases (1201) to total employees who did not exit (1207) which is 99.5%.

Here Specificity is significantly higher that Sensitivity — which is good in terms of aggressively detecting employees showing the intent to quit, but a relatively low Sensitivity also means your organization may be spending significantly on retention measures. These two measures need to be constantly reviewed to make mid-course corrections to your organization’s overall retention strategy.

3) Additional Metrics: In addition to these, there other metrics which provide a more nuanced understanding. For example Precision / Recall (Measures the extent of noise in your predictions), ROC Curves (used to visually compare effectiveness of multiple models by reviewing sensitivity / specificity tradeoffs) and F Measure (Consolidates Precision and Recall metrics to compare multiple models)

A final word of caution — while the ML model creation and testing process is typically associated with splitting the data into Training and Testing Data (a 75/25 or 67/33 split), if we are comparing performance of multiple models created from the same data there is a risk of getting a biased output if all models are tested on the same “Test” data. This is where Validation Dataset enters the picture. The overall data is split three-ways into Training (50%) : Validation (25%) : Test (25%) data. During the modelling phase, 75% of the data is split multiple ways into Training and Validation data to create and tune multiple models. In the final phase, all models are checked against the same Test data.

Hope this article provides clarity on how to effectively evaluate an ML Classifier and also gives you the confidence to engage meaningfully with vendors providing ML enabled products.

Please feel free to provide feedback.

--

--

Togy Jose
hrness.ai

Founder @ hrness.ai #graphanalytics #ml #ai #peopleanalytics #startup #networks #communities. Twitter: @togyjose