Picture a group of medical professionals relying on an advanced neural network to identify cancer in mammogram scans. While this artificial intelligence system might appear to function accurately, it could be fixating on image elements coincidentally linked to tumors—such as watermarks or timestamps—rather than genuine indicators of cancerous cells.
To assess these AI models, researchers employ "feature-attribution techniques"—methodologies designed to reveal which image components most significantly influence the neural network's decision-making process. However, what happens when these attribution approaches overlook features crucial to the model's analysis? Since researchers initially lack knowledge about which features truly matter, they possess no means of determining whether their evaluation methodology is actually effective.
To address this fundamental challenge, MIT scientists have developed an innovative procedure that alters original data to guarantee certainty about which features genuinely matter to the model. Subsequently, they utilize this modified dataset to assess whether feature-attribution techniques can accurately pinpoint those significant features.
Their discoveries reveal that even the most widely-used attribution methods frequently fail to identify critical features within images, with some techniques performing scarcely better than random selection. These findings carry profound implications, particularly as neural networks find deployment in high-stakes environments such as healthcare diagnostics. If the network functions improperly—and attempts to detect such anomalies also prove ineffective—human specialists may remain completely unaware that they're being misled by a flawed system, explains lead researcher Yilun Zhou, an electrical engineering and computer science graduate student at the Computer Science and Artificial Intelligence Laboratory (CSAIL).
"These attribution methods enjoy widespread application, particularly in extremely high-stakes scenarios like cancer detection through X-rays or CT scans. Yet our research demonstrates that these feature-attribution approaches may be fundamentally flawed. They might highlight elements that don't align with the actual features the model employs to make predictions—a situation we've found occurs frequently. If you intend to use these feature-attribution methods to validate a model's correct operation, you must first ensure the attribution method itself functions properly," he explains.
Zhou authored the paper alongside EECS graduate student Serena Booth, Microsoft Research scientist Marco Tulio Ribeiro, and senior author Julie Shah, who serves as an MIT professor of aeronautics and astronautics and directs the Interactive Robotics Group within CSAIL.
Concentrating on Critical Features
In image classification tasks, every pixel represents a potential feature the neural network might leverage for predictions—creating millions of possibilities. For instance, if researchers aim to design an algorithm helping emerging photographers enhance their skills, they might train a model to differentiate between professional and amateur photographs. This system could evaluate how closely amateur images resemble professional ones and even offer specific improvement suggestions. Researchers would want this model to focus on identifying artistic elements during training—such as color composition, framing techniques, and post-processing approaches. However, professional photographs typically contain the photographer's watermark, while tourist photos generally don't—creating a situation where the model might simply take the shortcut of identifying the watermark.
"Clearly, we wouldn't want to suggest to aspiring photographers that watermark placement alone guarantees professional success. Instead, we need to ensure our model concentrates on genuine artistic elements rather than watermark presence. While using feature attribution methods to analyze our model seems tempting, ultimately, no guarantee exists that they function correctly—since the model might utilize artistic features, the watermark, or any other characteristics," Zhou notes.
"We remain unaware of what spurious correlations exist within datasets. Numerous factors might prove completely imperceptible to humans—such as image resolution," Booth adds. "Even when undetectable to us, neural networks can likely extract these features and employ them for classification. This represents the underlying challenge: we don't fully comprehend our datasets, yet achieving complete understanding remains practically impossible."
The researchers modified their dataset by diminishing all correlations between original images and corresponding labels—ensuring none of the original features remained significant.
Next, they introduced a new feature into images so conspicuous that neural networks had to focus on it to make predictions—such as brightly colored rectangles in different hues for various image classes.
"We can confidently assert that any model achieving high prediction accuracy must focus on the colored rectangle we introduced. This allows us to observe whether these feature-attribution methods prioritize highlighting that specific location rather than everything else," Zhou explains.
Particularly Concerning Findings
They applied this technique to numerous feature-attribution methods. For image classification tasks, these approaches generate what experts call saliency maps—visualizations displaying the concentration of important features across an entire image. For example, when neural networks classify bird images, the saliency map might indicate that 80% of important features cluster around the bird's beak.
After eliminating all correlations within image data, researchers manipulated photographs through various approaches—including image blurring, brightness adjustments, or watermark addition. If feature-attribution methods functioned correctly, nearly 100% of important features should concentrate around areas researchers had modified.
The results proved discouraging. No feature-attribution approach approached the 100% target, with most barely achieving random baseline levels of 50%. Some methods even performed worse than baseline in certain instances. Despite the new feature representing the only element the model could use for predictions, feature-attribution methods sometimes failed to detect it.
"None of these methods demonstrate consistent reliability across different types of spurious correlations. This proves particularly alarming because, with real-world datasets, we don't know which spurious correlations might apply," Zhou states. "Numerous factors could be involved. We believed we could trust these methods to inform us, but our experiments suggest placing trust in them remains challenging."
All studied feature-attribution methods demonstrated greater capability in detecting anomalies rather than their absence. In practical terms, these methods could more easily identify a watermark than recognize that an image contained no watermark. Consequently, humans would find it more difficult to trust a model providing negative predictions.
The team's work underscores the critical importance of testing feature-attribution methods before implementing them in real-world models—especially in high-stakes applications.
"Researchers and practitioners might employ explanation techniques like feature-attribution methods to build trust in models, but that trust lacks foundation unless the explanation technique undergoes rigorous evaluation first," Shah notes. "While explanation techniques might help calibrate a person's trust in a model, it's equally vital to calibrate trust in the model's explanations themselves."
Looking ahead, the researchers aim to apply their evaluation procedure to investigate more subtle or realistic features potentially causing spurious correlations. Another research direction they hope to explore involves helping humans better understand saliency maps, enabling improved decision-making based on neural network predictions.
This research received partial support from the National Science Foundation.