Skip to Main Content

The artificial intelligence model showed great promise in predicting which patients treated in U.S. Veterans Affairs hospitals would experience a sudden decline in kidney function. But it also came with a crucial caveat: Women represented only about 6% of the patients whose data were used to train the algorithm, and it performed worse when tested on women.

The shortcomings of that high-profile algorithm, built by the Google sister company DeepMind, highlight a problem that machine learning researchers working in medicine are increasingly worried about. And it’s an issue that may be more pervasive — and more insidious — than experts previously realized, new research suggests.


The study, led by researchers in Argentina and published Monday in the journal PNAS, found that when female patients were excluded from or significantly underrepresented in the training data used to develop a machine learning model, the algorithm performed worse in diagnosing them when tested across across a wide range of medical conditions affecting the chest area. The same pattern was seen when men were left out or underrepresented.

“It’s such a valuable cautionary tale about how bias gets into algorithms,” said Ziad Obermeyer, a University of California, Berkeley, physician who studies machine learning and its clinical and health policy applications. “The combination of their results, with the fact that the datasets that these algorithms are trained on often don’t pay attention to these measures of diversity, feels really important,” added Obermeyer, who was not involved in the study.

The researchers in Argentina focused on one of the most popular applications of AI in medicine: analyzing images to try to make a diagnosis. The systems they examined were tasked with analyzing an X-ray image of the chest region to detect the presence or absence of 14 medical conditions including hernias, pneumonia, and an enlargement of the heart.


The researchers evaluated three open-source machine learning algorithms — known as DenseNet-121, ResNet, and Inception-v3 — that are widely used experimentally in the research community, but are not yet deployed commercially for use in the clinic. The researchers set out to train the models on data from two open-source datasets — maintained by the National Institutes of Health and Stanford University — containing chest X-ray images from tens of thousands of patients.

Those two datasets are reasonably balanced by sex — 56.5% of the images in the NIH dataset are from male patients, compared to roughly 60% in the Stanford dataset — and so in a normal research setting, AI researchers using these datasets wouldn’t have to worry much about sex skew in their training data.

But for the purposes of their experiment, the researchers purposefully introduced skew by looking at just a subset of those images broken down starkly along the lines of biological sex. They set up five different training datasets featuring varying breakdowns: 100% images from female patients, 100% images from male patients, 75% images from female patients and 25% images from male patients, 75% images from male patients and 25% images from female patients, and a 50/50 split. (The researchers accounted for sample size by using the same number of images in each dataset.)

The researchers trained the algorithms on each dataset, and then tested them on images from either male or female patients. The trend was clear: Across medical conditions, the algorithms performed worse when tested in patients whose sex was underrepresented in the training data. And being overrepresented also didn’t seem to put either sex at an advantage: When the model was trained exclusively or mostly on women, it didn’t perform better when tested on women compared to when the training data were evenly split by sex.

Enzo Ferrante, the senior author of the study, said the research was inspired in part by an uproar sparked in 2015 when a Google image recognition algorithm — inadequately trained on images of people with dark skin — mistakenly characterized photos of Black people as gorillas.

The findings published in the PNAS paper should reiterate to AI researchers how important it is to use diverse training data that “includes all the characteristics of the people in which you will test the model,” said Ferrante, who studies medical image computing at a top university and a leading research institute in Argentina.

The research did not interrogate why, exactly, a model trained on men fared worse when tested on women. Some of it is likely physiological — men and women’s chests are anatomically different, after all — but there may also be other factors at play.

For example, women may get diagnosed or get X-rays taken earlier or later in the progression of their disease compared to men, which could affect how those images appear in aggregate in the training data, said Irene Chen, a Ph.D. student at MIT’s Computer Science and Artificial Intelligence Lab who studies equitable machine learning for clinical decision making and was not involved in the PNAS paper.

Those potential differences in the way men and women are diagnosed represent “a much more troubling mechanism, because it means that those biases are built into the outcomes that are coded in the dataset,” Obermeyer said.

The challenges of equitable sex representation in training data loom larger in certain cases. For some diseases and populations, even the most conscientious AI researchers have no choice but to work with a dataset that’s extremely skewed in terms of sex.

Take autism, which is diagnosed at a significantly higher rate in boys than in girls, in part due to the fact that the condition manifests differently between sexes. And researchers studying military populations — such as the DeepMind team, which unveiled its algorithm to predict a kidney condition last summer in the journal Nature — must also work with data that skews heavily male.

On the flip side, researchers developing algorithms for use in breast cancer must sift through data almost entirely from female patients — and their models may not work as well when used in men with breast cancer. Moreover, it may prove difficult to build algorithms for medical conditions affecting the intersex community, as with other rare conditions, because there’s just not enough patients to supply the training data necessary for models to be accurate.

Companies have been built and academic careers devoted to finding technical ways to get around such challenges. A key first step, the researchers consulted by STAT said, is awareness of the limitations imposed by training datasets that are not representative. “We should communicate them, we should quantify them, and then we should work with clinicians and anyone who wants to use these algorithms to figure out a way to accommodate any of the limitations of artificial intelligence,” Chen said.

The PNAS paper is the latest research to explore the impact of bias in algorithms meant for clinical use. In a paper posted on a preprint server earlier this year, a team led by researchers at the University of Toronto also examined chest X-ray data used to train diagnostic algorithms; they found that factors like the sex, age, race, and the insurance type of patients in that training data were associated with how well the models performed.

In a widely circulated paper published last fall, a team led by researchers at the University of Chicago found racial bias in a UnitedHealth Group algorithm widely used by U.S. health systems to identify patients who might need extra support. The algorithm, sold by UnitedHealth’s Optum unit, relied on insurance claims data and cost projections that ended up classifying white patients as being sicker than their Black peers — even if the Black patients were just as ill.

This is part of a yearlong series of articles exploring the use of artificial intelligence in health care that is partly funded by a grant from the Commonwealth Fund.