Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, Dhruv Batra | ICCV 2017
In a Nutshell 🥜
Selvaraju et al.1 introduce Grad-CAM, a visualization technique for Convolutional Neural Networks (CNNs). The paper is motivated by that while CNNs have achieved breakthroughs in many computer vision tasks, their lack of interpretability poses significant problems when they do fail (i.e. ungraceful failure).
The method of Grad-CAM may be summarized as follows. Given an image and a class of interest, we first pass the image through the network to compute the score for a class of interest. All gradients are then set to zero, except for the class of interest, which is set to one. Grad-CAM then backpropagates to the rectified convolutional feature maps of interest, which are combined and followed by a ReLU to obtain the heatmap visualization. Thus, Grad-CAM is commonly applied to the last convolutional layer of the CNN. We can further pointwise multiply the heatmap with guided backpropagation to obtain Guided Grad-CAM visualizations, which have higher resolution and are class-discriminative.
The paper then demonstrates Grad-CAM’s strong performance in localization and visualization. Specifically, localization is evaluated through the ImageNet localization challenge and visualization is evaluated through user studies. The paper also discusses various applications of Grad-CAM, including analyzing failure modes, identifying bias in a dataset, image captioning, and Visual Question Answering (VQA).
Some Thoughts ðŸ’
The paper has a strong and extensive evaluation section, both qualitative and quantitative, which I thoroughly enjoyed reading. The illustrative figures were also helpful.
As a technique, I like how generalizable Grad-CAM is, being applicable to localization, visualization, image captioning, VQA, and more.
I found the anecdote given by the paper on why transparency is essential at three stages of AI quite interesting. The three stages are: (1) When the AI is weaker than humans, transparency identifies failure modes for researchers; (2) When the AI is on par with humans, transparency helps establish trust from users; (3) When the AI is stronger than humans, transparency helps teach humans how to make better decisions.
Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision (pp. 618-626).