Skip to Main Content

Even as machine learning and artificial intelligence are drawing substantial attention in health care, overzealousness for these technologies has created an environment in which other critical aspects of the research are often overlooked.

There’s no question that the increasing availability of large data sources and off-the-shelf machine learning tools offer tremendous resources to researchers. Yet a lack of understanding about the limitations of both the data and the algorithms can lead to erroneous or unsupported conclusions.


Given that machine learning in the health domain can have a direct impact on people’s lives, broad claims emerging from this kind of research should not be embraced without serious vetting. Whether conducting health care research or reading about it, make sure to consider what you don’t see in the data and analyses.

Be critical of the data

One key question to ask is: Whose information is in the data and what do these data reflect?

Common forms of electronic health data, such as billing claims and clinical records, contain information only on individuals who have encounters with the health care system. But many individuals who are sick don’t — or can’t — see a doctor or other health care provider and so are invisible in these databases. This may be true for individuals with lower incomes or those who live in rural communities with rising hospital closures. As University of Toronto machine learning professor Marzyeh Ghassemi said earlier this year:


Even among patients who do visit their doctors, health conditions are not consistently recorded. Health data also reflect structural racism, which has devastating consequences.

Data from randomized trials are not immune to these issues. As a ProPublica report demonstrated, black and Native American patients are drastically underrepresented in cancer clinical trials. This is important to underscore given that randomized trials are frequently highlighted as superior in discussions about machine learning work that leverages nonrandomized electronic health data.

In interpreting results from machine learning research, it’s important to be aware that the patients in a study often do not depict the population we wish to make conclusions about and that the information collected is far from complete.

Be critical of the metrics

It has become commonplace to evaluate machine learning algorithms based on overall measures like accuracy or area under the curve. However, one evaluation metric cannot capture the complexity of performance. Be wary of research that claims to be ready for translation into clinical practice but only presents a “leader board” of tools that are ranked based on a single metric.

As an extreme illustration, an algorithm designed to predict a rare condition found in only 1% of the population can be extremely accurate by labeling all individuals as not having the condition. This tool is 99% accurate, but completely useless. Yet, it may “outperform” other algorithms if accuracy is considered in isolation.

What’s more, algorithms are frequently not evaluated based on multiple hold-out samples in cross-validation. Using only a single hold-out sample, which is done in many published papers, often leads to higher variance and misleading metric performance.

Beyond examining multiple overall metrics of performance for machine learning, we should also assess how tools perform in subgroups as a step toward avoiding bias and discrimination. For example, artificial intelligence-based facial recognition software performed poorly when analyzing darker-skinned women. Many measures of algorithmic fairness center on performance in subgroups.

Bias in algorithms has largely not been a focus in health care research. That needs to change. A new study found substantial racial bias against black patients in a commercial algorithm used by many hospitals and other health care systems. Other work developed algorithms to improve fairness for subgroups in health care spending formulas.

Subjective decision-making pervades research. Who decides what the research question will be, which methods will be applied to answering it, and how the techniques will be assessed all matter. Diverse teams are needed not just because they yield better results. As Rediet Abebe, a junior fellow of Harvard’s Society of Fellows, has written, “In both private enterprise and the public sector, research must be reflective of the society we’re serving.”

Going forward

The influx of so-called digital data that’s available through search engines and social media may be one resource for understanding the health of individuals who do not have encounters with the health care system. There have, however, been notable failures with these data. But there are also promising advances using online search queries at scale where traditional approaches like conducting surveys would be infeasible.

Increasingly granular data are now becoming available thanks to wearable technologies such as Fitbit trackers and Apple Watches. Researchers are actively developing and applying techniques to summarize the information gleaned from these devices for prevention efforts.

Much of the published clinical machine learning research, however, focuses on predicting outcomes or discovering patterns. Although machine learning for causal questions in health and biomedicine is a rapidly growing area, we don’t see a lot of this work yet because it is new. Recent examples of it include the comparative effectiveness of feeding interventions in a pediatric intensive care unit and the effectiveness of different types of drug-eluting coronary artery stents.

Understanding how the data were collected and using appropriate evaluation metrics will also be crucial for studies that incorporate novel data sources and those attempting to establish causality.

In our drive to improve health with (and without) machine learning, we must not forget to look for what is missing: What information do we not have about the underlying health care system? Why might an individual or a code be unobserved? What subgroups have not been prioritized? Who is on the research team?

Giving these questions a place at the table will be the only way to see the whole picture.

Sherri Rose, Ph.D., is associate professor of health care policy at Harvard Medical School and co-author of the first book on machine learning for causal inference, “Targeted Learning” (Springer, 2011).

  • We need a close look at the algorithms used for deciding clinical trial drug efficacy as they are extremely important,and may not be stastically correct based on the low number of patients studied,racism,geneology,epigenetics,location,age and study duration.Bayes Theorem is used in risk analysis and decision making in Science and Engineering-why not for Clinical Trials?Further,to improve Clinical Trials,lower drug costs,reduce the number of studies aand enhance Physician use of drugs,drugs should be approved for Biochemical Pathways and not Medical Conditions.All existing drugs should be assigned Biochemical Pathways…

  • The best article on AI and tech that has been written!

    Mass media, marketers and corporate propagandists rely on distorting perceptions, creating false correlations, and distorting reality to increase profits and deceive. When looking at any data set, one should always ask what is not there, who is not represented and who benefits.

  • Spot-on. There is also a vast gap, in my observation, between clinical practice and academic research in terms of an understanding of the actions of clinicians per protocols, training and muscle memory. It is one thing to study data in a quiet setting where time and urgency are not drivers to action, and discussions of sensitivity and specificity can be perused theoretically. It is quite another when action is required to save a patient’s life, time is of the essence, and there is no luxury afforded to debating at the margins. The clinician is the decision maker upon whom direction is determined and responsibility is levied – both for successful treatment of the patient and blame in the case of adverse or deleterious events. The training in terms of years and experience in the brain of the clinician have yet to be recreated successfully in the laboratory, and involves coalescing data from many sources, as well as incorporating the more visceral reactions of empathy and understanding. This whole topic seems worthy of much more focus.

  • Great advice. I worked in IT in the area of database administration for almost 40 years. Over the years increased computing power and sophisticated data mining and analysis tools held out the promise of being able to glean magnificent insights from vast stores of existing data. Many of those projects ended in disappointment because the data had not been collected with those goals in mind. Key pieces were missing, data had been collected and encoded inconsistently, data wasn’t granular enough, etc. Sometimes existing data can be scrubbed, recoded and improved to an extent. More often, efforts were launched to begin collecting new data that would support the type of analysis desired.

Comments are closed.