Constructing Non-transferable Targeted Adversarial Examples
We demonstrate that there exist adversarial examples which are just “bugs”: aberrations in the classifier that are not intrinsic properties of the data distribution. In particular, we give a new method for constructing adversarial examples which:
- Do not transfer between models
- Do not leak “non-robust features” which allow for learning, in the sense of Ilyas-Santurkar-Tsipras-Engstrom-Tran-Madry
Background
Many have understood Ilyas et al. to claim that adversarial examples are not “bugs”, but are “features”. Specifically, Ilyas et al. postulate the following two worlds:
- World 1: Adversarial examples exploit directions irrelevant for classification (“bugs”). In this world, adversarial examples occur because classifiers behave poorly off-distribution, when they are evaluated on inputs that are not natural images. Here, adversarial examples would occur in arbitrary directions, having nothing to do with the true data distribution.
- World 2: Adversarial examples exploit useful directions for classification (“features”). In this world, adversarial examples occur in directions that are still “on-distribution”, and which contain features of the target class. For example, consider the perturbation that makes an image of a dog to be classified as a cat. In World 2, this perturbation is not purely random, but has something to do with cats. Moreover, we expect that this perturbation transfers to other classifiers trained to distinguish cats vs. dogs.
Our main contribution is demonstrating that these worlds are not mutually exclusive – and in fact, we are in both. Ilyas et al. show that there exist adversarial examples in World 2, and we show there exist examples in World 1.
Our Construction
Let $\{f_i : \R^n \to \cY\}_i$ be an ensemble of classifiers for the same classification problem as $f$. For example, we can let $\{f_i\}$ be a collection of ResNet18s trained from different random initializations.
For input example $(x, y)$ and target class $y_{targ}$, we perform iterative updates to find adversarial attacks – as in PGD. However, instead of stepping directly in the gradient direction, we step in the direction:
$$-\left( \nabla_x L(f, x_t, y_{targ}) + \E_i[ \nabla_x L(f_i, x_t, y)] \right)$$
Adversarial Squares: Adversarial Examples from Robust Features
To further illustrate that adversarial examples can be “just bugs”, we show that they can arise even when the true data distribution has no “non-robust features” – that is, no intrinsically vulnerable directions. We are unaware of a satisfactory definition of “non-robust feature”, but we claim that for any reasonable intrinsic definition, this problem has no non-robust features.
In the following toy problem, adversarial vulnerability arises as a consequence of finite-sample overfitting, and label noise.
Addendum: Data Poisoning via Adversarial Examples
As an addendum, we observe that the “non-robust features” experiment of Ilyas et al. (Section 3.2) directly implies data-poisoning attacks: An adversary that is allowed to imperceptibly change every image in the training set can destroy the accuracy of the learnt classifier – and can moreover apply an arbitrary permutation to the classifier output labels (e.g. swapping cats and dogs).
Conclusions
Our results demonstrate that adversarial examples can be constructed which are non-transferable and do not leak non-robust features. We also show that adversarial examples can arise even when the true data distribution has no non-robust features, and that data-poisoning attacks can be performed using imperceptible changes to the training set.

