Tuesday, June 13, 2017

Wide-Open

Number of samples in the NCBI GEO
Open data is a vital pillar of open science and a key enabler for reproducibility, data reuse, and novel discoveries. Enforcement of open-data policies, however, largely relies on manual efforts, which invariably lag behind the increasingly automated generation of biological data.

Researchers routinely deposit data in online repositories. But they are only human and its not rare that they forget to inform a repository to release their data once a paper is published. Open data is a vital pillar of open science, enabling other researchers to reproduce results and use the same datasets to produce novel discoveries. While many scientific journals now require published authors to make the data underlying their findings publicly available, these policies often go unenforced. The challenge is substantial -- the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus repository (GEO) alone contains 80,985 public datasets, spanning hundreds of tissue types in thousands of organisms -- and the rapid growth in data makes it difficult for journals or data repositories to "police" whether datasets that should be made publicly available actually are.

A new tool, developed by University of Washington and Microsoft researchers automatically identifies datasets overdue for public release by applying text mining to dataset references in published articles and parse query results from repositories to determine if the datasets remain private.  The system is called Wide-Open and is available under an open source license on GitHub.

The colleagues tested their tool on two popular data repositories maintained by the NCBI - GEO and the Sequence Read Archive (SRA) . Wide-Open identified a large number of overdue datasets, which spurred repository administrators to respond by releasing 400 datasets within one week.

No comments:

Post a Comment