Date:

Adversarial Examples are Just Bugs, Too

Constructing Non-transferable Targeted Adversarial Examples

We demonstrate that there exist adversarial examples which are just “bugs”: aberrations in the classifier that are not intrinsic properties of the data distribution. In particular, we give a new method for constructing adversarial examples which:

  1. Do not transfer between models
  2. Do not leak “non-robust features” which allow for learning, in the sense of Ilyas-Santurkar-Tsipras-Engstrom-Tran-Madry

Background

Many have understood Ilyas et al. to claim that adversarial examples are not “bugs”, but are “features”. Specifically, Ilyas et al. postulate the following two worlds:

  • World 1: Adversarial examples exploit directions irrelevant for classification (“bugs”). In this world, adversarial examples occur because classifiers behave poorly off-distribution, when they are evaluated on inputs that are not natural images. Here, adversarial examples would occur in arbitrary directions, having nothing to do with the true data distribution.
  • World 2: Adversarial examples exploit useful directions for classification (“features”). In this world, adversarial examples occur in directions that are still “on-distribution”, and which contain features of the target class. For example, consider the perturbation that makes an image of a dog to be classified as a cat. In World 2, this perturbation is not purely random, but has something to do with cats. Moreover, we expect that this perturbation transfers to other classifiers trained to distinguish cats vs. dogs.

Our main contribution is demonstrating that these worlds are not mutually exclusive – and in fact, we are in both. Ilyas et al. show that there exist adversarial examples in World 2, and we show there exist examples in World 1.

Our Construction

Let $\{f_i : \R^n \to \cY\}_i$ be an ensemble of classifiers for the same classification problem as $f$. For example, we can let $\{f_i\}$ be a collection of ResNet18s trained from different random initializations.

For input example $(x, y)$ and target class $y_{targ}$, we perform iterative updates to find adversarial attacks – as in PGD. However, instead of stepping directly in the gradient direction, we step in the direction:

$$-\left( \nabla_x L(f, x_t, y_{targ}) + \E_i[ \nabla_x L(f_i, x_t, y)] \right)$$

Adversarial Squares: Adversarial Examples from Robust Features

To further illustrate that adversarial examples can be “just bugs”, we show that they can arise even when the true data distribution has no “non-robust features” – that is, no intrinsically vulnerable directions. We are unaware of a satisfactory definition of “non-robust feature”, but we claim that for any reasonable intrinsic definition, this problem has no non-robust features.

In the following toy problem, adversarial vulnerability arises as a consequence of finite-sample overfitting, and label noise.

Addendum: Data Poisoning via Adversarial Examples

As an addendum, we observe that the “non-robust features” experiment of Ilyas et al. (Section 3.2) directly implies data-poisoning attacks: An adversary that is allowed to imperceptibly change every image in the training set can destroy the accuracy of the learnt classifier – and can moreover apply an arbitrary permutation to the classifier output labels (e.g. swapping cats and dogs).

Conclusions

Our results demonstrate that adversarial examples can be constructed which are non-transferable and do not leak non-robust features. We also show that adversarial examples can arise even when the true data distribution has no non-robust features, and that data-poisoning attacks can be performed using imperceptible changes to the training set.

Frequently Asked Questions

Q1: What are adversarial examples?

A1: Adversarial examples are inputs to a machine learning model that are designed to cause the model to make a mistake. They are often used to test the robustness of a model to attacks.

Q2: What is the difference between a bug and a feature?

A2: A bug is an error or flaw in a system, whereas a feature is a desirable characteristic or ability. In the context of machine learning, a bug can refer to a weakness in a model, while a feature can refer to a characteristic of the data that the model is trained on.

Q3: How do adversarial examples work?

A3: Adversarial examples work by adding a small amount of noise to the original input, which causes the model to make a mistake. This noise can be in the form of a small amount of data, or it can be in the form of a specific type of noise, such as noise that is designed to mimic the characteristics of a specific type of data.

Q4: Why are adversarial examples important?

A4: Adversarial examples are important because they can be used to test the robustness of a model to attacks, and to identify weaknesses in the model. They can also be used to improve the performance of a model by making it more robust to attacks.

Q5: What are some examples of adversarial examples?

A5: Some examples of adversarial examples include images that are designed to be misclassified by a neural network, or text that is designed to be misclassified by a natural language processing model. These types of examples can be used to test the robustness of a model and to identify weaknesses in the model.

Q6: How do I create adversarial examples?

A6: Creating adversarial examples can be a complex task that requires a good understanding of the model being tested and the type of attack being used. There are several methods that can be used to create adversarial examples, including the use of machine learning algorithms and the use of manual techniques such as adding noise to the input data.

Latest stories

Read More

LEAVE A REPLY

Please enter your comment!
Please enter your name here