
A top researcher at the Massachusetts Institute of Technology on Thursday said that artificial intelligence systems developed for medicine must be more transparent and judged against a set of common standards to ensure fairness and equity.
“It’s really important, whenever we are developing these models, to train them on diverse populations and report their accuracy, slicing and dicing for different subpopulations,” Regina Barzilay, a professor of engineering and computer science at MIT, said Thursday at the STAT Summit. “It’s very easy with these models, if they’re trained on one population and then applied on another…to provide inequitable care.”
The risk of unfair treatment is rising, she warned, because many AI developers are not validating their products on different groups of people or open sourcing the code so their accuracy can be compared to other models.
She spoke of her efforts to build a machine learning tool to predict a patients’ risk of developing breast cancer based on imaging data. She said a commonly-used model for that task, known as the Tyrer-Cuzick, performs with modest accuracy in populations of white women, but is much less effective when applied to women of African and Asian descent.
“For Asian women and for African American women, it was working close to random,” she said. “And it’s not surprising because that statistical model was developed on white women in London, but then it was applied in a diverse population.”
Barzilay, herself a breast cancer survivor, said the AI system she developed with clinicians at Massachusetts General Hospital has been tested on groups of patients in Sweden and Taiwan. The researchers are also seeking to expand testing among African-American patients and other groups.
But under current industry norms, the extent of testing and reporting on diversity is largely left up to the developers of the models. While publication and commercialization of machine learning products in medicine has exploded, neither medical journals nor the U.S. Food and Drug Administration has developed clear standards on reporting of race, gender, or age.
Furthermore, Barzilay said, since AI developers are building their products on disparate data sets, it is difficult to compare their performance or clinical usefulness.
“We need to have standards on what we expect to see when we say the [AI product] is ready for deployment,” she said. “Today in the United States, there is no large corpus of mammograms for training and testing. It means there is no way for me to compare my algorithm against Google’s published algorithms. We are publishing numbers on different datasets and there is no way for us to do the testing. This is really a serious issue.”