At a time when data and data science are increasingly essential to improving cancer care, oncologists and cancer researchers often lack the training needed to understand and leverage the data to their fullest extent. Similarly, data scientists often lack an understanding of cancer biology and a patient’s journey through the disease, both of which are necessary to gather and query data appropriately to answer a myriad of important biological and clinical questions.

Take, for example, a case recently discussed at a big data meeting hosted by Susan G. Komen, the organization we work for: A data scientist constructed a survey to gather information about a sample of women. One question asked about the number of children the respondent had. If she skipped the question, the data scientist dealt with the missing data by assigning zero to the respondent’s answer. A cancer researcher then analyzed the data to glean information about family size and was surprised that a large percentage of respondents appeared to have no children.

The data set was built with rules that made sense to the data scientist and what he was seeking to measure. The cancer researcher, however, lacked information about how the data were constructed and unknowingly formed incorrect conclusions about children among participants.


There’s a growing belief that integrating “big data” — large amounts of different types of data including electronic health records, administrative and health insurance claims databases, large data repositories, and “-omics” information — can provide a more complete picture of people with cancer. The information available for patients would then include their conditions, their care, including which medicines and treatments they’ve tried and their actual outcomes. But to correctly use this information to guide patient care, there need to be changes made to the way data are gathered and made available to patients, their care providers, and researchers.

The retail and financial industries have excelled at using big data to get to know their customers and closely track their needs and habits. It isn’t a coincidence that the pair of shoes you looked at online but didn’t purchase later appeared in your Facebook feed for several days. It’s also why you get text messages from your credit card company asking if it was really you who just made that large and unusual purchase.


The health care industry hasn’t kept pace. So far, electronic data in health care have primarily been limited to coding procedures so medical providers can bill the appropriate parties and track the collection of payments.

A lack of specialized disease- and care-focused training in data science is one of the reasons the health care industry has been slow to use big data to improve patient care. While some graduate and other academic training programs offer courses that teach data science to researchers and medical students, and some data scientists study biology, students generally graduate with degrees in one field or the other but rarely in both. Unlike M.D./Ph.D. programs that cross-train clinician-scientists to care for patients and run research projects, most universities and science departments currently lack the infrastructure to implement the kind of pedagogical change necessary to develop a cadre of graduates who understand biology and data science equally well.

Komen’s Big Data for Breast Cancer (BD4BC) initiative is highlighting the need to accelerate the use of big data in cancer research and patient care to discover better therapies for breast cancer patients, improve their outcomes, reduce health care disparities, and optimize precision medicine. BD4BC is creating opportunities for cancer researchers and data scientists to become familiar with each other’s field, fostering strong collaborations and ultimately creating a new “bilingual” workforce populated with individuals who understand breast cancer risks, onset, and progression and can apply data science methods to answer the challenges faced by breast cancer patients.

As a starting point, the initiative provides financial scholarships for breast cancer researchers to attend conferences or workshops focused on data science. These meetings expose biomedical scientists to research conducted with larger sets of data, the types of questions that can be answered with these data, the methods used to find meaningful patterns in data, and the pitfalls of these types of efforts. Attendance at such conferences also introduces biomedical scientists to data science experts with whom they can form meaningful collaborations.

Along with the scholarship program, other programs in the BD4BC initiative are focused on answering some of the most pressing challenges around using big data in breast cancer research and patient care, including:

  • How do you talk to patients about the benefits of sharing their data with researchers?
  • How can we better engage patients and their data in research and clinical care?
  • How can continuous data from wearable devices such as fitness trackers be applied to the medical field to learn more about breast cancer and its treatments?
  • How can data science methods, such as machine learning, predict patients’ responses to therapy and outcomes?

Although BD4BC is focused on breast cancer, it could be a model for other big data initiatives focused on other types of cancer. All of these initiatives combined will pave the way for doctors to provide patients with any type of cancer individually tailored risks and benefits of various treatment options — including their likelihood of responding and of experiencing side effects — to inform decision-making and deliver the best results for each patient. Big data is one more tool for making that future a reality. Exposing more cancer researchers and oncologists to data science and data scientists to the complexity of cancer is a critical step to getting there.

Stephanie Birkey Reffey, Ph.D., is the senior director of data science and impact at Susan G. Komen, where Jerome Jourquin, Ph.D., is the senior manager of data science. They oversee Komen’s Big Data for Breast Cancer initiative.

A roundup of STAT’s top stories of the day in science and medicine

Privacy Policy