Skip to Main Content

The study was a page-turner: Researchers at Google showed that an artificial intelligence system could predict acute kidney injury, a common killer of hospitalized patients, up to 48 hours in advance.

The results were so promising that the Department of Veterans Affairs, which supplied de-identified patient data to help build the AI, said in 2019 that it would immediately start work to bring it to the bedside.


But a new study shows how treacherous that journey can be. Researchers found that a replica of the AI system, trained on a predominantly male population of veterans, does not perform nearly as well on women. Their study, published recently in the journal Nature, reports that a model built to approximate Google’s AI overestimated the risk for women in certain circumstances and was less accurate in predicting the condition for women overall.

“If we have this problem, then half the population won’t benefit,” said Jie Cao, a Ph.D. student at the University of Michigan and the lead author of the paper. She said the results reinforce the need to train models on diverse groups of patients and test them on local populations, where demographic differences among patients and variations in how health care is delivered may undermine an AI system’s accuracy.

None of this is necessarily news to Google, which flagged performance issues in women in its initial paper and emphasized the need for additional testing. But the Michigan paper quantifies the extent of the problem at various stages of kidney injury, and goes further to point out how challenging the problem is to fix.


When the researchers retrained their model on sex-balanced data from Michigan and VA facilities, it performed better on women in the Michigan data but continued to struggle within the VA. That suggests its accuracy problem may be tied to issues more complicated than limited exposure to female patients with kidney damage, such as differences in treatment practices within VA facilities.

Acute kidney injury kills about 1.7 million people around the world annually. It is difficult to recognize and often causes patients to deteriorate rapidly, before lifesaving treatment can be delivered. Within the VA, 28% of patients who develop the condition die within one year, and about 6% die in the hospital, according to a recent study.

All of which makes it an alluring use case for AI.

By analyzing patient data, Google’s system, developed by the Alphabet research unit DeepMind Health, which has since been merged with Google’s health division, could tell clinicians which patients would develop the condition up to two days in advance. In the initial study, it accurately predicted 9 out of 10 patients in the VA whose kidney function ultimately declined to the point of needing dialysis.

In the business of applying AI in medicine, such results quickly get labeled as a “game-changer” or a “breakthrough.” But that initial excitement ignores the much more difficult work that lies ahead in making sure a newly minted AI works equally well across health care settings and in different kinds of patients, regardless of their race, income level, or gender.

“We still have a long way to go in terms of using these models to change how health care works and to change a patient’s health,” said Cao, adding that premature implementation of Google’s model could lead to poor treatment decisions by clinicians who might not understand its weaknesses. The VA was planning a clinical trial of the AI in its Palo Alto health system, but the status of that effort is unclear. The VA, which has funded multiple projects to predict kidney injury — including one at the University of Michigan — did not respond to questions about the status of the clinical trial. “VA is continuing to study various approaches before making a determination on the different models’ efficacy and/or suitability for any specific uses,” the agency said in a statement.

In emailed responses to STAT, one of Google’s researchers, Alan Karthikesalingam, wrote that he “agree(s) with the importance of the observations” made by the Michigan researchers about its struggles with female patients, noting, as Google’s initial paper did, that the company’s AI models are prototypes in need of further refinement and evaluation for effectiveness and fairness.

Karthikesalingam also pointed out that the Michigan study approximates the lower-performing of two AI models his research team built for acute kidney injury prediction. The Michigan researchers said they tested only one of the models — which use different architectures to process data — because of limited computing capacity in the VA’s system.

Google said it is now taking steps to advance the development of its AI. Its researchers have published protocols for implementing the better-performing of its two models, along with open source code. “We hope this enables other clinical researchers to build upon our promising initial findings and take further research steps to improve performance in representative populations in multiple clinical environments,” they wrote.

Google Cloud, a separate business unit of the company, is also making a pilot-ready version of the model available to health systems around the world that wish to explore its application to patients in their hospitals.

This story is part of a series examining the use of artificial intelligence in health care and practices for exchanging and analyzing patient data. It is supported with funding from the Gordon and Betty Moore Foundation.

Create a display name to comment

This name will appear with your comment

There was an error saving your display name. Please check and try again.