A Discussion of ‘Adversarial Examples Are Not Bugs, They Are Features’: Robust Feature Leakage
Ilyas et al. report a surprising result: a model trained on adversarial examples is effective on clean data. They suggest this transfer is driven by non-robust cues overlaid on the dataset by the adversarial examples. However, an alternate mechanism for the transfer could be a kind of “robust feature leakage” where the model picks up on faint robust cues in the attacks.
Lower Bounding Leakage
Our technique for quantifying leakage consists of two steps:
- First, we construct features that are provably robust, in a sense we will soon specify.
- Next, we train a linear classifier on the datasets and …
We show that (a) PGD-based adversarial examples actually alter features in the data and (b) models can learn from human-meaningless/mislabeled training data. The dataset, on the other hand, illustrates that the non-robust features are actually sufficient for generalization and can be preferred over robust ones in natural settings.
Conclusion
The experiment put forth in the comment is a clever way of showing that such leakage is indeed possible. However, we want to stress (as the comment itself does) that robust feature leakage does not have an impact on our main thesis — the dataset explicitly controls for robust feature leakage (and in fact, allows us to quantify the models’ preference for robust features vs non-robust features — see Appendix D.6 in the paper).
Acknowledgments
Shan Carter (started the project), Preetum (technical discussion), Chris Olah (technical discussion), Ria (technical discussion), Aditiya (feedback)
References
- Adversarial examples are not bugs, they are features Ilyas, A., Santurkar, S., Tsipras, D., Engstrom, L., Tran, B. and Madry, A., 2019. arXiv preprint arXiv:1905.02175.
Updates and Corrections
If you see mistakes or want to suggest changes, please create an issue on GitHub.
Reuse
Diagrams and text are licensed under Creative Commons Attribution CC-BY 4.0 with the source available on GitHub, unless noted otherwise. The figures that have been reused from other sources don’t fall under this license and can be recognized by a note in their caption: “Figure from …”.
Citation
For attribution in academic contexts, please cite this work as
Goh, "A Discussion of 'Adversarial Examples Are Not Bugs, They Are Features': Robust Feature Leakage", Distill, 2019.
@article{goh2019a,
author = {Goh, Gabriel},
title = {A Discussion of 'Adversarial Examples Are Not Bugs, They Are Features': Robust Feature Leakage},
journal = {Distill},
year = {2019},
note = {https://distill.pub/2019/advex-bugs-discussion/response-2},
doi = {10.23915/distill.00019.2}
}

