Deep learning models are often criticized for being complex and opaque. They are called black-boxes because they deliver predictions and insights, but the logic behind their outputs is hard to comprehend. Due to its complex multilayer nonlinear neural networks, Data Scientists find it difficult to ascertain the factors or reasons for a certain prediction.

This unintelligibility makes people wary of taking important decisions based on the models’ outputs. As human beings, we trust what we understand; what we can verify. And over time, this has served us well. So, being able to show how models go about solving a problem to produce insights, will help build trust, even among people with cursory knowledge of data science.

To achieve this it is imperative to develop computational methods that can interpret, audit, and debug such models. Debugging is essential to understanding how models identify patterns and generate predictions. This will also help us identify and rectify bugs and defects.

In this article we delve into the different methods used to debug machine learning models.

Source: interpretable-ml-book — Source: https://christophm.github.io/interpretable-ml-book/terminology.html

Permutation Importance

Also known as permutation feature importance, it is an algorithm that computes the sensitivity of a model to permutations/alterations in a feature’s values. In essence, feature importance evaluates each feature of your data and scores it based on its relevance or importance towards the output. While permutation feature importance, measures each feature of the data after it has been altered, and scores it based on its importance towards generating an output.

For instance, let us randomly permute or shuffle the values of a single column in the validation dataset with all the other columns intact. If the model’s accuracy drops substantially and causes an increase in error, that feature is considered “important”. On the other hand, a feature is considered ‘unimportant’ if shuffling its values doesn’t affect the model’s accuracy.

Debugging ML models using ELI5

ELI5 is a Python library that assists several ML frameworks and helps to easily visualize and debug black-boxes using unified API. It helps to compute permutation importance. But it should be noted that the permutation importance is only computed on test data after the model is built.

Debugging ML models using ELI5

After our model is ready, we import ELI5 to calculate permutation importance.

Importing ELI5 for debugging ML models

The output for above code is shown below:

output

The features that are on top are the most important, which means that any alterations made to these values will reduce the accuracy of the model significantly. The features that are at the bottom of the list are unimportant in that any permutation made to their values will not reduce the accuracy of the model. In this example, OverallQual was the most important feature.

Debugging CNN-based models using Grad-CAM (Gradient-weighted Class Activation Mapping)

Grad-CAM is a technique that produces visual explanations for outputs to deliver transparent Convolutional Neural Network (CNN) based models. It examines the gradient information that flows into the final layer of the neural network to understand the output. Grad-CAM can be used for image classification, image captioning, and visual question answering. The output provided by Grad-CAM is a heatmap visualization, which is used to visually verify that your model is trained to look at the right patterns in an image.

Debugging ML models using SHAP (SHapley Additive exPlanations)

SHAP is a game theoretic approach that aims to explain a prediction by calculating the importance of each feature towards that prediction. The SHAP library uses Shapley values at its core and explains individual predictions. Lloyd Shapley introduced the concept of Shapley in 1953 and it was later applied to the field of machine learning.

Shapley values are derived from Game theory, where each feature in the data is a player, and the final reward is the prediction. Depending on their contribution towards the reward, Shapley values tell us how to distribute this reward fairly among the players.

We use SHAP quite often, especially for the models in which explainability is critical. The results are really quite accurate.

SHAP can explain:

General feature importance of the model by using all the data
Why a model calculates a particular score for a specific row/ record
The more dominant features for a segment/ group of data

Shapley values calculate the feature importance by comparing two predictions, one with the feature included and the other without it. The positive SHAP values affect the prediction/ target variable positively whereas the negative SHAP values affect the target negatively.

Here is an example to explain the same. For this purpose, I am taking a red wine quality dataset from kaggle.

Debugging ML models using Shapley

Now, we produce variable importance plots, which lists the most significant variable in descending order. Wherein, the top variable would contribute more to the model.

Importing SHAP for debugging ML models

Mapping the plot

In the above figure, all the variables are plotted in descending order. Color of the variables represent the feature value, whether they are high (in red) or low (in blue) in that observation. A high level of the “sulphates” content has a high and positive impact on the quality rating. X-axis represents the “positive” impact. Similarly, we can say that “chlorides” is negatively correlated with the target variable.

Now, I would like to show you how SHAP values are computed in individual cases. Then, we execute such values on several observations and choose a few of the observations randomly.

Interpreting SHAP values

After choosing random observations, we initialize our notebook with initjs().

Explanation for some of the terms seen in the plot above:

Output value: Prediction for that observation, which in this case is 5.35
Base value: Base value is the mean prediction, or mean (yhat), here it is 5.65
Blue/ red: Features that can impact the prediction more are shown in red, and those that have the least impact are in blue color.

For more information and to test your skills, check out kaggle.