Adversarial Examples, natural inputs that are perturbed imperceptibly that induce erroneous predictions, have the natural corollary of Fooling Images 2, unnatural inputs that induce high confidence model predictions but would be unrecognizable to human perception.
These two related but systematically different phenomena provide unique insights into how Deep Neural Networks (DNNs) learn differently from Humans.
“Adversarial Examples” show how the small addition of data that human perception aligns with utterly incomprehensible noise, drastically alters the model’s perception. To me, this highlights the over-confidence, discriminatory narrow-sightedness, and lack of intentionality of a simple DNN.
Over-confidence - Our intuitive understanding of confidence in the problem is misaligned compared to a DNN prediction of a class. A human is likely to give an out-of-sample input a low predictive accuracy because of the awareness that this input is unlike anything it has seen before. The model meanwhile is unable to detect that its frame of reference is incoherent, and hence will not mark the answer as probably wrong.
- Notably contrastive loss approaches, and “robust” training are both techniques that combat adversarial prompts at this level. They do not attempt to combat the fundamental problem, but they can make the model less likely to be overconfident in “strange” samples.
Discriminatory Narrow-sightedness - Humans make many high-level inferences, in part because we discard much of the input to our visual system. Even on the most basic level, 3 times per second we go blind from Saccades and we don’t even notice. Instead, we fill in much of the blanks from memory.
- Meanwhile, Deep Neural Nets use no such tricks. They dredge through the data looking for any possible pattern. Anything with discriminatory power is fair game; even if the feature learned is utterly inhuman (i.e. not a feature we’d intended for the model to learn). These non-robust features are highly predictive. Yet brittle when considered in the context of the broader task of object recognition (as opposed to the dataset-localized task).
- Consider labeling all images with a red square for panda and a green square for gibbon. The model will learn to look for the square and will be highly confident in its prediction. However, this is not a feature representative of the task of recognizing a panda from a gibbon outside of our “augmented dataset”.
Intentionality Failure - In my view, however, these examples are not a failing of our model. Instead, it is a challenge of intention. We trained the classification model with the intention to learn the tricks we use (“pandas have white and black fur”) without any prior within the model to enforce that restriction. However, I don’t think the right approach would be to enforce such a restriction - we might get more human-interpretable models, but we’d lose several valuable robust, and perhaps inhuman features alongside.
It’s well known that Neural Networks pay an inordinate amount of attention to texture and color when classifying compared to humans. This is not without good reason. These are highly predictive features, but they’re also features that humans aren’t very good at detecting.
Another clear example of inhuman but valuable features is the artifacts present on an image from using a zoom lens. These artifacts turn out to have discriminatory power for the class of dogs. Commonly we photograph dogs outside, and at a distance; while cats are indoors (and wear bowties!). These features are valuable to learn, and an overly restrictive model would lose out on these rich and fascinating features.
The “Fooling Image” is interesting in part because of the increased human interpretability compared to directly encoding noise. Instead of inhuman features (black and yellow stripes are a good indicator for school buses), it shows a more fundamental issue common to all these examples:
- A Lack of Sufficiency - DNNs assume all discriminatory features are sufficient. That is if the only images the network sees with black and yellow stripes are school buses, then that is a sufficient feature to classify a school bus. A human however would consider that a mere predictive indicator, without jumping to a conclusion. I suspect the necessary conditions are often not the most discriminatory ones (a school bus must be a bus taking children to school and could be any color). The current brand of DNNs do not even attempt to find multiple conditions if one is sufficiently predictive.
Adversarial examples are not bugs, they are features.
Ilyas, Andrew, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and Aleksander Madry. Advances in neural information processing systems 32 (2019). ↩
Deep neural networks are easily fooled: High confidence predictions for unrecognizable images.
Nguyen, Anh, Jason Yosinski, and Jeff Clune. Proceedings of the IEEE conference on computer vision and pattern recognition. 2015. ↩ ↩2