Learning from Errors
Section 3.2 of Ilyas et al. (2019) shows that training a model on only adversarial errors leads to non-trivial generalization on the original test set. We show that these experiments are a specific case of learning from errors.
A Counterintuitive Result
We take a completely mislabeled training set (without modifying the inputs) and use it to train a model that generalizes to the original test set. We then show that this result, and the results of Ilyas et al. (2019), are a special case of model distillation.
We begin with the following question: what if we took the images in the training set (without any adversarial perturbations) and mislabeled them? Since the inputs are unmodified and mislabeled, intuition says that a model trained on this dataset should not generalize to the correctly-labeled test set. Nevertheless, we show that this intuition fails – a model can generalize.
Two-dimensional Illustration of Model Distillation
We construct a dataset of adversarial examples using a two-dimensional binary classification problem. We generate 32 random two-dimensional data points in [0,1]^2 and assign each point a random binary label. We then train a small feed-forward neural network on these examples, predicting 32/32 of the examples correctly (panel (a) in the Figure below).
Next, we create adversarial examples for the original model using an l∞ ball of radius 0.12. In panel (a) of the Figure above, we display the ϵ-ball around each training point. In panel (b), we show the adversarial examples which cause the model to change its prediction (from correct to incorrect). We train a new feed-forward neural network on this dataset, resulting in the model in panel (c).
Conclusion
Our experiments show that a model can generalize from a mislabeled training set, even if the inputs are unmodified and the labels are incorrect. This is a special case of model distillation, where information about the original model is “leaked” into the mislabeled examples.
FAQs
Q: Why does this happen?
A: This phenomenon is a result of model distillation, where information about the original model is “leaked” into the mislabeled examples.
Q: Is this a problem?
A: No, this is not a problem. In fact, it shows that a model can generalize from a mislabeled training set, which can be useful in certain situations.
Q: Can this be used for good or bad?
A: This phenomenon can be used for both good and bad. For example, it can be used to train a model on a small dataset and then fine-tune it on a larger dataset, or it can be used to create a backdoor in a model.
Q: Is this a new phenomenon?
A: No, this phenomenon has been studied before in the context of model distillation. However, our experiments show that it can also occur in the context of mislabeled training sets.

