An issue that’s always faced when working on anything machine learning (ML) is model selection. One of our favourites is Random Forest for a number of reasons; they tend to have very good accuracy, they’re exceptional at handling imbalanced datasets, and it’s easy to extract the features of the data that are most important to the outcome of the model. This last point is often one of our clients’ key interests. It’s one thing to predict business outcomes, but if the client wants to influence them at all they need to know what factors are at play and how big their influence is. Feature importance is your friend…
It’s never that easy
This last point is not as clear cut as it may seem however. The feature importance produced by Random Forests (and similar techniques like XGBoost) isn’t the features that directly correlate to the accuracy of the model against our test set, but rather those that are important for the trees that have been built. The technicalities of this are explained here so I won’t repeat it.
There’s a module for that!
Fortunately for us, there are ways around this. The ELI5 permutation importance implementation is our weapon of choice. This takes a much more direct path of determining which features are important against a specific test set by systematically removing them (or more accurately, replacing them with random noise) and measuring how this affects the model’s performance. It works with both classification and regression models.
As is often the case, the best way to compare these methods is with real world data. Below are two feature importance plots produced from a real (but anonymised) binary classifier for a customer project:
The built-in RandomForestClassifier feature importance
The improved ELI5 permutation importance
They both agree on the most important feature by far, however C has dropped off almost entirely and D has surpassed both B and C to take the second place spot. When a client is making long term business plans this could have a significant impact! Also note that all features further down the hierarchy drop off to effective insignificance, further reinforcing the importance of the top three features.
Another point worth noting is that there are often multiple feature importance measures built into ML models, and these are often not consistent between various models. For example XGBoost offers gain, cover and frequency, all of which are difficult to interpret and equally as difficult to know which is most relevant. With ELI5 however, it’s clear exactly how the importance is ascertained which is critical when we’re explaining abstract and abstruse findings to clients. A ground-breaking insight that cannot be communicated clearly in business terms to non-technical stakeholders isn’t worth anything!
A further distinction with built-in feature importance is that ELI5 uses the features themselves to find their true importance, rather than the workings of the model. The benefits of this are that ELI5 treats the ML models as a ‘black box’. This makes it applicable across any and all models we create, allowing us to have a standard that’s portable between projects.
Worth the cost
While there is a time penalty to pay for running ELI5 – it does have to iterate through the model for every feature after all – it’s more than worthwhile for the value it adds to our projects. Due to the increased confidence we can place on the results of ELI5, and it’s application to a variety of machine learning algorithms, it’s quickly become a standard part of our toolkit.