An algorithm commonly used by hospitals and other health systems to predict which patients are most likely to need follow-up care classified white patients overall as being more ill than black patients — even when they were just as sick, a new study finds.
Overall, only 18% of the patients identified by the algorithm as needing more care were black, compared to about 82% of white patients. If the algorithm were to reflect the true proportion of the sickest black and white patients, those figures should have been about 46% and 53%, respectively. The research was published Thursday in Science.
All told, health systems use the algorithm for some 100 million people across the country.
The study’s authors then retrained a new algorithm using patients’ biological data, rather than the insurance claims data that the original program used, and found an 84% reduction in bias. Previously, the algorithm was failing to account for a collective nearly 50,000 chronic conditions experienced by black patients. After rejiggering the algorithm, that number dropped to fewer than 8,000. The reduction in bias emphasized what many in the health technology field believe: Algorithms may only be as good as the data behind them.
STAT spoke with Sendhil Mullainathan, a computational and behavioral science researcher at the University of Chicago’s Booth School of Business and senior author of the new study, to learn more. This interview has been condensed and lightly edited for clarity.
What was the inspiration for this study?
There’s a lot of discussion about algorithmic bias in the news, but we actually don’t know much about algorithms that are already implemented at a large scale. You don’t really get access from the inside. Part of our interest was [in knowing if there] is bias, how these things could play out.
Why is there a suspicion of bias in such programs?
I think the suspicion comes because, in general, these algorithms are built on data and those reflect systemic biases, and so won’t the algorithm also reflect the biases?
The other part is that algorithms are only as good as the objectives we give them. Much like the people, you can’t ask a person to do “A” and then be disappointed that they didn’t do “B.” Algorithms are the same — you give it a very narrow objective. But we haven’t told algorithms yet to do things without racial bias. We haven’t learned yet that this is an objective that we have to build into it.
Who uses this algorithm you examined?
It’s a category of algorithms of which there are several manufacturers. [Looking at] past health records, [they] predict future health. It’s applied by health systems to over 100 million people, so quite widespread. And it’s used for care coordination programs — these algorithms are meant to flag people for chance that in the next year, they’ll need extra care. Health care systems take these algorithms, and use them to rank people and say these are some of the people who might need extra help.
For example, if you have diabetes and heart disease, you’re going to be at the hospital a lot. We might give you a dedicated nurse who can tell you whether to come in or not. Sometimes it’s making sure they don’t come in to the [emergency room], but go to a special desk instead.
What did you find?
We thought, if this algorithm were unbiased, it shouldn’t matter whether a person is black or white. But we found that similarly sick black patients are ranked much lower [by the algorithm] — as less sick — than white patients.
This basically says that care coordination programs, this part of how we’re coordinating such programs, is having massive gaps.
Was this surprising?
[This] is a gigantic number, and the magnitude of it is huge. But it’s not immediately obvious why there should have been bias going in. We always fear bias, but I think you’d have to be particularly pessimistic to be believe there’s bias everywhere.
It’s not clear why there would be bias in this direction. If I told you that the American health care system doesn’t serve black patients very well, then you would expect them to be sicker. But if they’re sicker, then you’d expect them to be flagged more by the algorithm. The bias should have been going in the other direction.
How do you account for genetic predispositions?
The goal [of the algorithm] isn’t to get to the causes of risk, but simply to identify people at risk. Whatever the reason for the gap, [our findings] mean that these sick people we are trying to target, we end up missing.
How do you think the bias was introduced?
It kind of arose in a subtle way. When you look at the literature, “sickness” could be defined by the health care we provide you. In other words, you’re sicker the more dollars we spend on you. The other way we could define sickness is physiologically, with things like high blood pressure.
It turns out that the algorithm was trained on dollars spent and the kind of care we deliver rather than on the underlying physiology. It looks like the algorithm took the system-wide problem of the difference between these two [definitions] that we often take to be synonymous — but there’s a difference between them, especially for black patients versus white patients — and expanding that and magnifying it.
Is there a way to fix it?
When we retrained a new algorithm on actual physiological variables, the gap then completely disappears. And it turns out that you can get almost all of the efficiency properties of the original algorithm without the racial gap. This, to me, is the bigger lesson for algorithms as a whole: Concepts that we as humans tend to take synonymously — like care in dollars and care in biological terms — algorithms take them literally.
How can we prevent things like this in the future?
When there’s new technology like this, it takes time for them to go from prototype to a larger scale, and some of these things don’t show up until later. If we have a prototype, there needs to be more questions [like about racial bias] that need to be asked. When these things go to scale, we just need a different way to be smarter about them.
Maybe the lesson here is to be very careful about the data the algorithm is being trained on. We should also spend a lot of time defining what our goals are. Do we really care only about insured people [if we’re using claims data to train algorithms], but also uninsured? Have I asked [the algorithm to do] everything I care about?
I think we understate the enormity of the problem in front of us. Because it’s new technology, the learning phase of it is not in the algorithm learning but in us learning about bigger, structural implications.