Skip to Main Content

“Mining” is a popular shorthand expression for how we uncover important insights buried in big data. So when I read recently that a record 1,040 students signed up for Stanford University’s CS229 Machine Learning course by the first day of class, I thought of the 19th-century Gold Rush to the same part of the country.

Consider the parallels beyond geography: Something valuable and hidden is suddenly found, and appears to be everywhere. Early adopters use simple methods to get rich. As word spreads, people with scant knowledge make huge personal investments as the reward seems so attainable. The 1840s equivalent of CS229 was San Francisco, which grew 78 times larger in five years.

Like prospectors staking claim after claim in search of a strike, organizations today rush to amass newer and bigger data sets. While having more data can open tremendous new opportunities, I’m troubled that there’s far less conversation around what to do with that data. As consumers of data, we need to acknowledge that how data are analyzed can be as important as having this information in the first place.


Few 49ers made a fortune. For the multitude who endured financial ruin, the disaster was usually theirs alone. The same isn’t always true when it comes to big data in health care. Bad choices and bad data science can have consequences that diminish the health of far too many people, whether the loss is time spent on less efficient clinical trials or mistaken conclusions about the effectiveness of a treatment.

I see three big risks to avoid to if we want a different, better outcome from mining health data.


Asking the wrong question

The private sector has taught us that the most successful data mining applications occur when there is a focused business question to answer. No algorithm can overcome a question that is too broad or misplaced. At the Duke Clinical Research Institute, where I work, we draw on the experiences of practicing faculty clinicians to frame sharper questions.

Asking the wrong question can cause us to undervalue data we have. For example, our hope that genetic markers can help us determine who is at highest risk of having a heart attack or developing other forms of cardiovascular disease has yet to be realized. However, colleagues and I are using the same data and machine learning techniques to probe a related question that cardiologists and patients undoubtedly will benefit from knowing the answer to: Who will benefit most from certain cardiovascular treatments while experiencing the fewest side effects and who will benefit the least with the greatest side effects?

Using the wrong data

The flip side of undervaluing data is assuming that the data we have or can most readily gather can answer any question we form. Too often today, the data lead and the question follows. We must acknowledge the limitations in data.

The more data the better is an accepted premise. More data may improve precision, but don’t necessarily expand the experiences captured. Wired magazine left that out when it asserted almost a decade ago that the correlations we identify with big data would make the search for causes obsolete. Here’s the challenge: If we had electronic health records for every American woman from ages 20 to 40 and tried to generalize their data to all women, we would be stymied by the fact that most women that age visit the doctor only when they are sick or pregnant. We’ve dubbed that “informed presence bias,” meaning that the data in the electronic health records of these women arise from important preconditions that must be acknowledged and addressed in any analysis. More data help only if they are representative of what we want to study.

But the draw of “new” and “more” data frequently overshadow substantive realities like these. Perhaps that isn’t surprising. When the consulting firm KPMG surveyed leaders in more than 2,000 companies, nearly twice as many leaders trusted their organizations’ capacity to get data than trusted their capacity to analyze the data accurately, or even know if they had done the math right.

Collecting new data for every clinical research question is impractical and inefficient. That’s at the heart of the interest in big data for pragmatic clinical research. Too often, the response today is to use data of convenience. But they are often missing something vital to the answer. We must identify those gaps and understand their impact on questions and answers.

Using the wrong tools

Clinical research tends to rely on precedent — what’s tried and true. On top of that, it is difficult for nonexpert data consumers to know which methods to use to answer a question. As a consequence, assembly-line analysis is common; methods that amount to mass production are used over and over. And too many investigators count on the size of the data pool to straighten out any problems, which doesn’t always happen.

For example, randomized clinical trials have shown that patients who take medications to lower their cholesterol or their blood pressure reduce their risk of having a heart attack, stroke, or other cardiovascular event. But when non-randomized observational data (like that collected as part of the Framingham Heart Study) is analyzed using conventional methods, the findings indicate that taking these medications is associated with the same or higher risk instead. That is a result of incorrectly applying statistical methods. When the observational data are analyzed using state-of-the-art patient matching and weighting techniques developed by Duke and other researchers, we come close to the results obtained in the randomized clinical trials. Simpler statistical analysis doesn’t get us to the answer.

Moving forward

As insights from data become more important in the decisions that physicians and patients make, we must create new methods of analysis and appropriately apply them, along with existing methods. With federal policymakers pledging to speed up drug approval and use real-world evidence in the process, the application and development of appropriate analytic tools become even more essential. Fortunately, an investment to improve analytics would cost a fraction of deploying an electronic health record platform across a single health care system.

At Duke, we say “the decision is in the question.” To answer the right questions that improve patients’ health and lives, we need the right people using the right methods on the right data. Without that, the current rush to collect big data may not pan out in the way we had hoped.

Michael J. Pencina, Ph.D., is a professor in the Department of Biostatistics and Bioinformatics at Duke University School of Medicine and directs research analytics and biostatistics for the Duke Clinical Research Institute.