Scientific discovery could be accelerated with more open access to genomic data, says an article in the latest journal Science by a group of research leaders from across the globe that includes David W. Ussery, Ph.D., at the University of Arkansas for Medical Sciences (UAMS).
“We argue that the publicly available data should be treated as open data, a shared resource with unrestricted use for analysis, interpretation and publication,” the article states in the journal’s Policy Forum titled, “Toward unrestricted use of public genomic data.”
The article, with 51 authors, challenges long-standing customs and guidelines that have allowed the producers of genomics data to keep it for analysis and publication before outside researchers can study it.
As a specialist in bacterial genomics, Ussery, a professor in the UAMS College of Medicine Department of Biomedical Informatics, said a better understanding of genome sequences will help scientists more easily determine where outbreaks originate and how they can be treated.
“In my field, it is critical to have unrestricted access to this kind of genomic data,” said Ussery, a member of the international Genomics Standards Consortium. “Some of our biggest scientific advances are likely to come from genomics research, and we need to remove barriers that could delay discoveries.”
The article calls for revising the landmark 2003 Fort Lauderdale Agreement, which is a public declaration by scientists supporting free and unrestricted use of genome sequencing data. The agreement, the authors say, is “self-contradictory” because it also recommends a hands-off approach to publicly available data so that those who produced the data have a chance to analyze and publish it.
A key factor in the article’s push is the growing wave of raw data from faster, inexpensive third-generation genome sequencing devices, said Ussery, who holds the Helen Adams & Arkansas Research Alliance Endowed Chair in Bioinformatics.
“By 2025, the amount of data from third-generation sequencing will dwarf other big data generators like Youtube and Twitter,” Ussery said. “Youtube is expected to reach 2 exabytes, but third-generation sequencing will produce about 20 zettabytes of data.” A zettabyte is 1,000 times larger than an exabyte.
In a recent presentation, Ussery cited the 20 zettabyte projection for genetic sequencing data, noting that the estimated cost to store that much data is $2 trillion.
In fact, with the advent of large global data analysis studies, the article says, the amount of publicly available data is at the scale of yottabytes (1,000 times larger than a zettabyte).
Scientific analysis of so much data requires costly computing resources and advanced analytical capabilities, and some scientists who produce genomic data don’t have those advanced capabilities. In those cases, outside researchers should be allowed free access to the data without restriction.
“For example,” the article states, “the outsider team may have better analytical capabilities and/or overarching protocols for analyzing more comprehensive sets of data, pre- or post-publication. Also, sequence datasets can be interrogated by means of numerous value-added platforms and tools from multiple groups.”
The article cites three guiding principles for their recommendations:
- Public genomics data that have ethics approval for release should be open data – available for unrestricted use, together with associated metadata – with the exception of sensitive human data to which additional ethics restrictions may apply
- Science advances through open competition with clear-cut, transparent rules, not through posing restrictions and limitations
- Credit should be given appropriately to resource producers (those who produce the data) and should be transparent.
“These recommendations should not impede protection of sensitive human data,” the article states. “We acknowledge that for existing sensitive human data, some restrictions may be appropriate.”
The article is available here: http://science.sciencemag.org/content/363/6425/350.