Kendall Squared brings you dispatches from the world’s epicenter for biotechnology and drug discovery.
Your Gmail is there, and so are documents that colleagues share with you in Dropbox, not to mention movies that studios send to Netflix. Now the cloud has another inhabitant: the complete DNA data and other molecular and medical information on cancers from 11,000 patients.
Seven Bridges, a biomedical data analysis company in Cambridge, Mass., announced this week that it had put that voluminous data — more than 1 petabyte worth, equivalent to about 1 million hours of streaming video — in the cloud and made it available to any scientist. The 7-year-old startup also said it received $45 million in investor backing to support further development of the company’s genomics research platform.
The rollout of the Seven Bridges tool comes as the White House’s call for a cancer “moonshot” has identified the inaccessibility of important cancer data as one reason the country has not made more progress against the disease, which is expected to kill some 600,000 people in the United States this year.
The choice of storage site — cloud versus the local supercomputers where cancer DNA data has historically resided — might not seem like a big deal, but it matters far more than whether people keep their personal files in Google Drive or on a home computer.
Migrating the National Institutes of Health’s Cancer Genome Atlas — or TCGA, as the database is called — could bring faster discoveries in the genetics of cancer, and therefore more successful treatments.
“I think it’s going to be very, very helpful to scientists and clinicians who have not been able to get to the data before,” said Anna Barker of Arizona State University, who helped develop the cancer atlas when she was deputy director of the NIH’s National Cancer Institute.
“The raw [DNA] sequence data has been difficult to access,” Barker said, “and if you’re not a pretty informed genomicist, even if you could get to the data, it’s been hard to know what to do with it.” Having the data in the cloud promises to improve how well physicians match cancer treatments to patients by looking for patterns in abstruse data on, for instance, which tumor genes are turned on and which are off, which genes are aberrant and which are normal.
The Cancer Genome Atlas, launched in 2005, is the world’s largest trove of information on which genes are mutated in ovarian cancer, leukemia, brain tumors, and 30 other cancer types. The data have resided at computing centers in San Diego and at the NCI in Maryland, and researchers who wanted to work with it — such as by using software they developed to analyze it — had to download the whole catalog and store it on their institution’s server.
Storage costs so much — about $2 million a year — that few institutions, let alone individual scientists, could afford to host the bulk downloads. And obtaining the files involved a long, arduous wait.
“Just to download it took six weeks,” said computational biologist Josh Stuart of the University of California, Santa Cruz. “And you needed a ton of storage space to hold the data.”
Those were the obstacles the NCI hoped to overcome when, in October 2014, it awarded a total of $20 million to three groups: the Institute for Systems Biology (ISB) in Seattle, Seven Bridges, and the Broad Institute of MIT and Harvard.
The marching orders were fairly general: develop pilot platforms that stored TCGA data in the cloud (the Broad and ISB use Google’s, while Seven Bridges uses Amazon (AMZN) Web Services’s).
“Instead of defining the work specifically, we wanted each group to innovate,” said Tony Kerlavage, the NCI’s chief of cancer informatics.
The ISB made its partially completed cloud platform available to researchers last November, offering users a way to more easily query TCGA data. The Broad’s platform opened to its own researchers and to scientists at partner institutions in Boston last month, with the goal of a wider release by April and the promise of more to come.
“When we built our cloud pilot we saw it as only a first step,” said Dr. Anthony Philippakis, chief data officer at the Broad. The platform, he noted, is “very generalizable and not specific to TCGA,” meaning future genomics databases could eventually be included.
Seven Bridges began asking for feedback on its platform-in-progress late last year and opened it to all comers on Tuesday. Its system lets researchers upload data on their own cancer patients to see, for instance, how their tumor mutations compare to those in TCGA. Researchers can also use the platform to search TCGA by more than 100 properties — including the type of tumor sequenced, patient demographics, and treatment history — as well as to find cancer cases based on what mutations and other genomic glitches patients harbor.
“NCI told us, don’t just make [the data] accessible, make it useful,” said Seven Bridges president James Sietstra. “We think this will let scientists uncover insights they might not have, if they’d had access to less data.”
During initial testing, all three cloud-pilot developers heard suggestions from users about what searches and analytic capabilities they would like. The NCI will evaluate the three platforms for nine months and “see what’s the best in each,” Kerlavage said.
It’s not clear whether the agency will combine the best of each into a single product, but eventually all or part of the three platforms will be incorporated into a mega-site called the Genomic Data Commons, developed by the University of Chicago to house molecular and clinical data from current and future NCI cancer genomics projects. That resource is expected to launch this spring.
Google Cloud and Amazon Web Services both charge for access to databases they host, but the NCI has made more than $1 million in credits available to scientists who wish to use the new cancer genomics cloud.