PLoS ONE Paper on Sizing Discovery and Access
Challenges
I want to share a pointer to a paper published in PLoS ONE July
24, 2015 titled "Sizing the Problem of Improving Discovery and
Access to NIH-Funded Data: A Preliminary Study" by Kevin Read et
al.
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0132735
This is an excellent example of work that is badly needed to help
us ot better understand the scale of the challenge of managing
research data to facilitate its discovery and reuse by other scholars,
and to illuminate the roles that repositories of various types may
play in this effort. I've reproduced the abstract below.
Clifford Lynch
Director, CNI
---------------------------------
Objective
This study informs efforts to
improve the discoverability of and access to biomedical datasets by
providing a preliminary estimate of the number and type of datasets
generated annually by research funded by the U.S. National Institutes
of Health (NIH). It focuses on those datasets that are "invisible"
or not deposited in a known repository.
Methods
We analyzed NIH-funded journal
articles that were published in 2011, cited in PubMed and deposited in
PubMed Central (PMC) to identify those that indicate data were
submitted to a known repository. After excluding those articles, we
analyzed a random sample of the remaining articles to estimate how
many and what types of invisible datasets were used in each
article.
Results
About 12% of the articles
explicitly mention deposition of datasets in recognized repositories,
leaving 88% that are invisible datasets. Among articles with invisible
datasets, we found an average of 2.9 to 3.4 datasets, suggesting there
were approximately 200,000 to 235,000 invisible datasets generated
from NIH-funded research published in 2011. Approximately 87% of the
invisible datasets consist of data newly collected for the research
reported; 13% reflect reuse of existing data. More than 50% of the
datasets were derived from live human or non-human animal
subjects.
Conclusion
In addition to providing a
rough estimate of the total number of datasets produced per year by
NIH-funded researchers, this study identifies additional issues that
must be addressed to improve the discoverability of and access to
biomedical research data: the definition of a "dataset,"
determination of which (if any) data are valuable for archiving and
preservation, and better methods for estimating the number of datasets
of interest. Lack of consensus amongst annotators about the number of
datasets in a given article reinforces the need for a principled way
of thinking about how to identify and characterize biomedical
datasets.
|