Contribute Try STAT+ Today

Last January, Samuel Scarpino wasn’t sure what to make of Covid-19. The director of Northeastern University’s Emergent Epidemics Lab, he, along with every other epidemiologist in the world, was trying to interpret the earliest data on the new virus.

He was soon pulled into working on a spreadsheet, started by a group of international epidemiologists, to collect and openly share granular data on individual Covid-19 cases around the world. Today, that project launched its complete website, Global.health, which will enable open access to more than 5 million anonymized Covid-19 records from 160 countries. Each record can contain dozens of data points about the case, including demographics, travel history, testing dates, and outcomes.

The project is supported by $1.25 million in funding and other resources from Google.org, with additional support from the Rockefeller Foundation, and is led by academics from the University of Oxford, Harvard, Northeastern, Boston Children’s Hospital, Georgetown, University of Washington, and Johns Hopkins Center for Health Security. 

advertisement

STAT spoke with Scarpino about the challenges involved in bringing the project to scale and what it will take to keep this kind of massive epidemiological data collection going long enough to help avert the next pandemic. 

How did this project evolve from a simple spreadsheet to the millions of records you have today?

advertisement

About this time last year, there was a group of volunteers manually entering in Covid-19 records as they were reported. A news source in Japan would break that a case was there, and we’re entering them into a Google spreadsheet and sharing them publicly via the spreadsheet and also Github.

We hit the size limit of what you can have in a Google spreadsheet, which is about 80,000 cases. So we reached out to Google.org about a fellowship program where Googlers will spend three to six months working on a project that has some kind of broadly defined positive social impact. We pitched this idea to them and they bit.

I thought we were going to get, you know, like a couple of engineers to work with us for a month or something. We got about 12 people for six months, across the product and engineering stack, from designers, researchers, to engineers.

We spent six months interviewing journalists, public health officials, and researchers who are our target audience, and then built a cloud data platform to house individual-level anonymized case records — initially, for Covid, but with the vision being that this would be the kind of rapid response data system that could be deployed in real time after an event.

What’s the value of recording those individual, line-level records compared to the aggregate case and death data we’re all used to seeing?

It’s interesting, because we actually pivoted off of trying to do aggregate data when we saw that Johns Hopkins had a system that was taking off for aggregate data. We need both kinds of systems, and it is clear you wouldn’t have the same infrastructure handling these two kinds of data.

Our data often contains much more granular information about each case. Things like travel history. Age distribution, race, ethnicity, if they’re reported. Symptoms, if they’re reported. If there’s an outcome reported, like a death or a hospitalization. And so it allows you to see a much more granular picture of what’s happening.

What kind of infrastructure did you need to build to track all those variables? 

One of the things that we tried to engineer in was the assumption that the data models were going to need to change, because this is an emerging infectious disease and we don’t know what we don’t know. When we launch we’re expecting to have 10 million records from 160 countries. Ninety percent of them have at least 12 fields complete; 50% of the cases have about 25 fields.

You won’t see it, but we built a front-end system for entering records, because we understand that a lot of times people are going to be collecting data off informal sources, maybe even manually as they get recorded from press releases and hospitals and on social media.

What’s similar now actually to what happened early with Wuhan, is that we started seeing most of these variants of concern reported in newspapers. “Individual from Czech Republic tests positive for B.1.1.7.” And so we just added a field — is there a variant of concern tied to this case — then started recording them, so we now have maps associated with variants.

But not every country will be able to provide that granularity of data. 

There’s actually a view that shows what percentage of all the reported cases we have in our system. You might think of it as, “this is a map of how poorly a job we’re doing in different places.” I don’t see it that way. We have all of the data that is readily accessible, reported publicly from the various ministries of health and health departments. So I see this as a map of global public health data quality.

One of the things that you will see is that many of the places where we have a higher percentage of the data, there are fewer cases. And it’s because the quality of one’s data system is going to be inextricably linked to the quality of one’s response in almost all cases.

Why didn’t the World Health Organization do this kind of tracking? 

Yeah, a couple reasons. First, it’s expensive. It requires software engineers, which are expensive. I have no idea what it would really cost to hire the kinds of — actually, I do have an idea, probably about $5 million for six months. WHO would have had to find those kinds of resources. Then, as soon as this thing gets ramped up, they’re massively understaffed to even be responding to everything they’re supposed to be doing, much less maintaining a massive international data system.

And then, on top of all that, the politics around international data sharing are a mess. We took advantage of the fact of being a volunteer organization and we just did it. We’ve engaged pretty heavily on the legal and ethical side of this. So we will find out, probably the hard way, but we think everything we’re doing is actually legal given the data use agreements and data sharing agreements and regulations around data scraping.

What’s the role of Global.health at this point in the pandemic? This data has been available to researchers the whole time — why launch publicly now? 

Right now there are so many cases that we might as well just be doing aggregated cases. But Covid-19 is gonna become rare. It will fall back into the milieu of things that cause respiratory illness.

As a result, we’re going to need higher-fidelity systems that are capturing lots of information and informing rapid public health response, identifying new variants and capturing information on their spread. So the one-to-two-year plan is ensuring that we have the data being captured as we move into the more complicated phase. Eventually we’ll go back down into the realm where we’re looking at travel histories, age distributions, and we’re going to be there tracking this the whole time.

The five-year plan is tuberculosis, malaria. And the next time somebody shows up in a seafood market with an emerging infectious disease …

What’s it going to take to keep this supported for however long until the next pandemic — to make it actually work the way that you intended it to?

To be honest, that’s one of the things that we’re looking to figure out. How do we actually organize ourselves for the next five years so that we don’t get sideswiped by the political and financing issues that we talked about for the WHO. A good point would be, you did this successfully because nobody was paying attention, and everybody was looking at Hopkins and you’re running the other way grabbing all the data, and then a whole bunch of software engineers donated $10 million worth of their time to build this thing. That’s not going to happen again.

We’ve had very generous support both in terms of finances, but also in terms of expertise and resources from Rockefeller, Google.org, MapBox. But I don’t know what that gets — two, three, four, or five years, maybe more than that. Maybe somebody decides they’re just going to infuse enough capital that we’re self sustaining. But there are business models that don’t involve one necessarily becoming Palantir that still allows you to be self-sustaining and that’s one of the things that we’re also working on.

    • Agreed – capturing lots of data without having underlying hypotheses about why some people were exposed and uninfected, exposed and infected, exposed, infected and symptomatic, or exposed, infected, and died, it’s hard to see the value in all of the work.

      We know age matters, underlying health status matters,(diabetes, hypertension, etc.), sex matters (males of different immune systems characteristics), geography seems to matter (were some populations previously exposed to “similar” viruses and thus, this was not a novel virus for them, living situations matter (e.g., nursing homes), ethnicity seems to matter (is it biological, or cultural, or both)… the list is pretty long.

Comments are closed.