Five organizations on Monday released a new open dataset of over 29,000 scientific articles published in journals and on preprint servers, in the hopes of spurring America’s artificial intelligence experts to develop new techniques for mining data and text that could help answer some of the most pressing questions about the novel coronavirus and the disease it causes.
The dataset is believed to be the most extensive collection of its kind concerning the coronavirus, and, crucially, it’s machine-readable, a format that can be easily processed by a computer and thus makes it much easier for AI specialists to work with.
However, in a common hurdle for machine-learning researchers looking for usable data, the database’s contents are variable in terms of how comprehensive they are. Only about 13,000 of the articles in the dataset include full text, meaning that all of the figures and words within the article are available. The other roughly 16,000 articles include only metadata, such as the authors’ names or the abstract of the paper, in large part because they are behind paywalls.
The dataset, which has been dubbed CORD-19, short for COVID-19 Open Research Dataset, was built by a collaboration of organizations spanning different sectors:
- Microsoft contributed its literature curation tools.
- The Allen Institute for AI, one of the research institutes founded by the late Microsoft co-founder Paul Allen, transformed the content into a form that would be machine readable.
- The National Institutes of Health’s National Library of Medicine provided access to literature content.
- The Chan Zuckerberg Initiative — the philanthropic vehicle launched by Facebook founder Mark Zuckerberg and his wife, the pediatrician Priscilla Chan — provided access to articles that have been posted on preprint servers but not yet peer-reviewed.
- Georgetown University’s Center for Security and Emerging Technology coordinated the initiative.
The creation of the dataset was requested by the White House’s Office of Science and Technology Policy, which hosted a call for reporters on Monday to get the word out about the dataset.
Michael Kratsios, the U.S. chief technology officer, told reporters that the Trump administration is issuing “a call to action” to the tech community to use the dataset to develop AI techniques and insights that could be useful in the response to the coronavirus.
As part of the initiative, 10 high-level research questions have been posted on Kaggle, an online community for AI researchers that’s owned by Google’s cloud business. Among them: “What do we know about virus genetics, origin, and evolution?” “What do we know about COVID-19 risk factors?” And “What has been published about ethical and social science considerations?”
The dataset, which covers the coronavirus, the disease it causes, and the family of viruses it belongs to, will be updated as more articles become available.