New research identifies hidden evidence of mistaken correlations — and gives a technique to enhance accuracy.
MIT researchers have detected significant examples of machine-learning model failure when those models are carried out to data other than what they were trained on, increasing inquiries about the need to test each time a model is deployed in a new setting.
“We show that even while you train models on large amount of data, and select the best average model, in a latest setting this ‘best model’ will be the worst model for 6-75% of the brand new data,” says Marzyeh Ghassemi, an associate professor in MIT’s Department of Electrical Engineering and Computer Science (EECS), a member of the Institute for Medical Engineering and Science, and principal investigator at the Laboratory for Information and Decision Systems.
In a paper that was demonstrated at the Neural Information Processing Systems (NeurIPS 2025) conference in December, the researchers point out that models trained to successfully diagnose illness in chest X-rays at one hospital, for example, may be considered effective in a different hospital, on average. The researchers’ overall performance evaluation, but, disclosed that some of the best-performing models at the first hospital had been the worst-acting on up to 75% of patients at the second hospital, despite the fact that when all patients are aggregated within the second hospital, high average performance hides this failure.
Their findings show that even though spurious correlations — a easy example of that is which is when a machine-learning system, not having “seen” many cows pictured on the beach, classifies a photo of a beach-going cow as an orca in reality because of its background — are idea to be mitigated via just improving model overall performance on observed data, they actually still occur and continue to be a risk to a model’s trustworthiness in latest settings. In many examples— along with areas examined by the researchers consisting as chest X-rays, cancers histopathology images, and hate speech detection — such spurious correlations are much difficult to detect.
In the case of a medical diagnosis model trained on chest X-rays, for example, the model may also have found out to correlate a specific and irrelevant marking on one hospital’s X-rays with a positive pathology. At another hospital wherein the marking is not used, that pathology could be ignored.
Previous research by Ghassemi’s group has proven that models can spuriously correlate such factors as age, gender, and race with medical findings. If, for example, a model has been trained on more older people’s chest X-rays that have pneumonia and hasn’t “seen” as many X-rays belonging to more youthful people, it’d predict that only older patients have pneumonia.
“We want models to learn how to look at the anatomical features of the patient and then make a decision based totally on that,” stated Olawale Salaudeen, an MIT postdoc and the lead author of the paper, “however simply something that’s in the data that’s correlated with a choice can be used by the model. And those correlations might not simply be sturdy with changes in the environment, making the model predictions undependable sources of decision-making.”
Spurious correlations make contributions to the risks of biased decision-making. In the NeurIPS conference paper, the researchers confirmed that, as an example, chest X-ray models that advanced ordinary analysis performance certainly accomplished worse on patient with pleural situations or enlarged cardiomediastinum, that means expansion of the heart or significant chest cavity.
Other authors of the paper consist PhD students Haoran Zhang and Kumail Alhamoud, EECS Assistant Professor Sara Beery, and Ghassemi.
While preceding work has particularly taken that models ordered best-to-worst by means of overall performance will keep that order while applied in latest settings, referred to as accuracy-on-the-line, the researchers were able to exhibit examples of when the best performing models in a one setting were the worst-performing in another.
Salaudeen devised an algorithm referred to as OODSelect to find examples where accuracy-on-the-line was broken. Basically, he trained lots of models using of in-distribution data, meaning the data were from the first setting, and calculated their accuracy. Then he applied the models to the data from the second setting. When those with the best accuracy on the first-setting data had been wrong when applied to a large percent of examples in the second setting, this recognized the issue subsets, or sub-populations. Salaudeen also emphasizes the dangers of aggregate statistics for evaluation, which can difficult to understand greater granular and consequential data about model performance.
In the course in their work, the researchers separated out the “most miscalculated examples” in order not to conflate spurious correlations within a dataset with situations that are simply hard to classify.
The NeurIPS paper releases the researchers’ code and a few diagnosed subsets for future work.
Once a hospital, or any organization employing machine-learning, identifies subsets on which a model is performing badly, that facts can be used to improve the model for its unique task and setting. The researchers suggest that future work adopt OODSelect which will highlight targets for evaluation and design approaches to enhancing performance extra continuously.
“We wish the released code and OODSelect subsets become a steppingstone,” the researchers write, “toward benchmarks and models that confront the adverse effects of spurious correlations.”












