Rethinking Adversarial Examples: How Errors Reveal True Features in Neural Networks
By ● min read
<p>Adversarial examples have long puzzled researchers, but recent work by Ilyas and colleagues suggests these perturbations are not mere bugs—they are features that models learn from mislabeled data. This Q&A explores how training exclusively on adversarial errors can yield meaningful generalization, reshaping our understanding of robustness and learning. Jump to: <a href="#q1">Why are adversarial examples considered features?</a> | <a href="#q2">What does training on adversarial errors reveal?</a> | <a href="#q3">How does Section 3.2 demonstrate generalization?</a> | <a href="#q4">What are the implications for adversarial robustness?</a> | <a href="#q5">How does this link to learning from incorrectly labeled data?</a></p>
<h2 id="q1">Why are adversarial examples considered features rather than bugs?</h2>
<p>Adversarial examples are often viewed as flaws in neural networks, but Ilyas et al. argue they are actually <strong>features</strong>—predictive patterns that arise from the data distribution. In standard training, models pick up on non-robust features that correlate with labels but are easily perturbed. Adversarial perturbations exploit these fragile correlations. When a model is trained solely on these perturbed examples (which are mislabeled from the true distribution), it still learns features that generalize to the original test set. This suggests that the adversarial perturbations themselves encode meaningful structure, not random noise. Thus, adversarial examples are not bugs; they are indicators of how models rely on specific, often brittle, data characteristics.</p><figure style="margin:20px 0"><img src="https://distill.pub/2019/advex-bugs-discussion/response-6/thumbnail.jpg" alt="Rethinking Adversarial Examples: How Errors Reveal True Features in Neural Networks" style="width:100%;height:auto;border-radius:8px" loading="lazy"><figcaption style="font-size:12px;color:#666;margin-top:5px">Source: distill.pub</figcaption></figure>
<h2 id="q2">What does training on adversarial errors reveal about model learning?</h2>
<p>Training a model exclusively on adversarial errors—cases where the original model misclassifies an adversarially perturbed input—demonstrates that even these mislabeled data contain learnable patterns. The model achieves non-trivial accuracy on the original test set, implying that the errors are systematic rather than random. This aligns with the idea that adversarial perturbations highlight latent features that are <em>predictive</em> but non-robust. The model effectively learns from mistakes, capturing the underlying data structure despite incorrect labels. This challenges the assumption that clean labels are essential for generalization and points to robustness as a separate property from accuracy.</p>
<h2 id="q3">How does Section 3.2 of Ilyas et al. (2019) demonstrate generalization from adversarial errors?</h2>
<p>In Section 3.2, the authors conduct an experiment where they train a new model using only inputs that were adversarially perturbed and misclassified by a pretrained model. Despite the training labels being incorrect relative to the true dataset, the new model achieves performance significantly above chance on the original test set. This is a specific case of <strong>learning from errors</strong>: the adversarial mislabels are not random but reflect consistent feature correlations. The result shows that the information needed for generalization is still present in these perturbed examples, reinforcing that adversarial examples are not just noise but structured signals.</p>
<h2 id="q4">What are the implications of this finding for adversarial robustness?</h2>
<p>If adversarial examples are features, then achieving robustness requires training models to rely on robust features (e.g., object shape) rather than non-robust ones (e.g., texture). Standard adversarial training does this by augmenting data with adversarial examples, forcing the model to ignore brittle correlations. However, the Ilyas et al. finding suggests that simply training on adversarial errors may not guarantee robustness—it may just swap one set of features for another. For robust generalization, models must learn to prioritize stable patterns. This reframes the adversarial robustness problem as a feature selection challenge, guiding future defenses toward understanding and leveraging robust feature representations.</p>
<h2 id="q5">How does this work connect to learning from incorrectly labeled data?</h2>
<p>The experiment of training on adversarial errors is a direct instance of learning from mislabeled data. Although the labels are faulty (e.g., a cat image labeled as a dog due to perturbation), the model still extracts meaningful correlations that transfer to clean data. This indicates that some errors are <em>informative</em>—they preserve underlying data structure. The authors show that adversarial errors are not random but arise from specific, repeated feature patterns. This bridges to research on noisy labels: not all mislabeling is detrimental; structured noise can still teach a model useful representations. It suggests that in some contexts, learning from errors can be more efficient than learning from purely clean data, provided the errors reflect systematic features.</p>
Tags: