Adversarial Examples are Just Bugs, Too

Constructing Non-transferable Targeted Adversarial Examples

We demonstrate that there exist adversarial examples which are just “bugs”: aberrations in the classifier that are not intrinsic properties of the data distribution. In particular, we give a new method for constructing adversarial examples which:

Do not transfer between models
Do not leak “non-robust features” which allow for learning, in the sense of Ilyas-Santurkar-Tsipras-Engstrom-Tran-Madry

Background

Many have understood Ilyas et al. to claim that adversarial examples are not “bugs”, but are “features”. Specifically, Ilyas et al. postulate the following two worlds:

World 1: Adversarial examples exploit directions irrelevant for classification (“bugs”). In this world, adversarial examples occur because classifiers behave poorly off-distribution, when they are evaluated on inputs that are not natural images. Here, adversarial examples would occur in arbitrary directions, having nothing to do with the true data distribution.
World 2: Adversarial examples exploit useful directions for classification (“features”). In this world, adversarial examples occur in directions that are still “on-distribution”, and which contain features of the target class. For example, consider the perturbation that makes an image of a dog to be classified as a cat. In World 2, this perturbation is not purely random, but has something to do with cats. Moreover, we expect that this perturbation transfers to other classifiers trained to distinguish cats vs. dogs.

Our main contribution is demonstrating that these worlds are not mutually exclusive – and in fact, we are in both. Ilyas et al. show that there exist adversarial examples in World 2, and we show there exist examples in World 1.

Our Construction

Let $\{f_i : \R^n \to \cY\}_i$ be an ensemble of classifiers for the same classification problem as $f$. For example, we can let $\{f_i\}$ be a collection of ResNet18s trained from different random initializations.

For input example $(x, y)$ and target class $y_{targ}$, we perform iterative updates to find adversarial attacks – as in PGD. However, instead of stepping directly in the gradient direction, we step in the direction:

$$-\left( \nabla_x L(f, x_t, y_{targ}) + \E_i[ \nabla_x L(f_i, x_t, y)] \right)$$

Adversarial Squares: Adversarial Examples from Robust Features

To further illustrate that adversarial examples can be “just bugs”, we show that they can arise even when the true data distribution has no “non-robust features” – that is, no intrinsically vulnerable directions. We are unaware of a satisfactory definition of “non-robust feature”, but we claim that for any reasonable intrinsic definition, this problem has no non-robust features.

In the following toy problem, adversarial vulnerability arises as a consequence of finite-sample overfitting, and label noise.

Addendum: Data Poisoning via Adversarial Examples

As an addendum, we observe that the “non-robust features” experiment of Ilyas et al. (Section 3.2) directly implies data-poisoning attacks: An adversary that is allowed to imperceptibly change every image in the training set can destroy the accuracy of the learnt classifier – and can moreover apply an arbitrary permutation to the classifier output labels (e.g. swapping cats and dogs).

Conclusions

Our results demonstrate that adversarial examples can be constructed which are non-transferable and do not leak non-robust features. We also show that adversarial examples can arise even when the true data distribution has no non-robust features, and that data-poisoning attacks can be performed using imperceptible changes to the training set.

Frequently Asked Questions

Post Views: 32

Adversarial Examples are Just Bugs, Too

Constructing Non-transferable Targeted Adversarial Examples

Background

Our Construction

Adversarial Squares: Adversarial Examples from Robust Features

Addendum: Data Poisoning via Adversarial Examples

Frequently Asked Questions

Q1: What are adversarial examples?

Q2: What is the difference between a bug and a feature?

Q3: How do adversarial examples work?

Q4: Why are adversarial examples important?

Q5: What are some examples of adversarial examples?

Q6: How do I create adversarial examples?

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Featured video: Coding for underwater robotics | MIT News

Generate single title from this title Upgrading agentic AI for finance workflows in 100 -150 characters. And it must return only title i dont...

Generate single title from this title Making Softmax More Efficient with NVIDIA Blackwell Ultra in 100 -150 characters. And it must return only title...

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Featured video: Coding for underwater robotics | MIT News

Generate single title from this title Upgrading agentic AI for finance workflows in 100 -150 characters. And it must return only title i dont...

Generate single title from this title Making Softmax More Efficient with NVIDIA Blackwell Ultra in 100 -150 characters. And it must return only title...

Generate single title from this title Nvidia shares fall as blockbuster results fail to dazzle in 100 -150 characters. And it must return only...

Generate single title from this title It exposed what was already broken in 100 -150 characters. And it must return only title i dont...

What is a Performance Review + Definition?

LEAVE A REPLY Cancel reply

Latest

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Featured video: Coding for underwater robotics | MIT News

Categories

Useful Links

Our Newsletter