By Shana R. Spindler, PhD
“What would Yoda say to investigators about sharing data?” asked Dr. Rohan Hazra, medical officer in the Maternal and Pediatric Infectious Disease Branch. “Train yourself to let go of everything you fear to lose.”
A few decades ago, the idea that we could store terabytes of data on our personal computers seemed unfathomable. But it just goes to show that scientists benefit from being forward thinking, because what might seem impossible today could easily be the norm tomorrow. So what do we do when the impossible becomes, in fact, reality?
With the ability to collect, store, and access large datasets from research spanning tens of years, scientists and doctors must establish dataset ground rules and—quite frankly—a new culture of data sharing. Four experienced researchers who are involved with large datasets met on July 10, 2013 in an NICHD Exchange meeting to discuss the role that “big data” will play in the future of NICHD and science in general.
The term “big data” refers to the vast amounts of structured and unstructured data accumulated from large and encompassing studies that requires nontraditional techniques to process and analyze. In Dr. Rohan Hazra’s introductory talk, he likened the process of establishing how to use big data to the rules of early childhood: share everything and play fair.
Several achievements in the past few years have set the stage for big data sharing. The Big Data to Knowledge (BD2K) initiative (https://commonfund.nih.gov/bd2k/overview.aspx) and the Yale Open Data Access (YODA) project (http://medicine.yale.edu/core/projects/yodap/index.aspx) are good examples of facilitating the use of big data and successfully sharing data, respectively.
The goals of sharing big data are noble. Sharing large datasets encourages diversity of analysis, promotes new research, avoids duplication, and honors research as a public good. Not to mention it saves money and time and increases the power of statistical arguments.
However, before researchers rush onto the big data scene, several concerns must be considered. Erroneous secondary analysis of datasets, privacy and confidentiality issues, proprietary interests, academic credit, and the balance of data preservation with administrative burdens all must factor into the equation. To address these pitfalls, researchers must “collect data with sharing issues in mind all along,” said Dr. Hazra.
Captain Steven Hirschfeld, director of the National Children’s Study (NCS), emphasized a standards-based approach to big data. The NCS is an NICHD-led study to investigate the environmental and genetic effects on the growth, development, and health of children from birth until age 21 years. Drawing upon his experience with the NCS, Captain Hirschfeld approaches big data with a “big picture” point of view, where data exists in the life cycle of approach, acquisition, analysis, and use.
Ultimately, different groups will have varying end-point requirements of the data, and Captain Hirschfeld aims to satisfy the needs of all data users via the integration of multiple standards for each step of the data life cycle. He also acknowledged that data access must be structured, robust, and tightly controlled.
When the Division of Epidemiology, Statistics, and Prevention Research (DESPR) of the NICHD standardized access to their extensive dataset collection, they established an internal Biorepository Access and Data Sharing (BRADS) committee to oversee the process. Dr. Jennifer Weck, Scientific Program Specialist with DESPR and chair of the BRADS committee, continued the Exchange meeting with the “theory versus practice” of dataset and biospecimen sharing.
Because DESPR researchers established their datasets to answer specific questions, Dr. Weck noted that finding a common data structure for diverse projects, establishing common dataset search terms, examining data ownership issues, and establishing ethical guidelines for the additional use of biospecimens* have been areas of data sharing that are tricky to navigate. To address some of these issues, the BRADS committee reviews all requests for access to the BRADS data, and biospecimen use requires a material transfer agreement with a 10-page proposal reviewed by the committee as well as outside experts.
Problems do arise, said Dr. Weck, such as sufficient review of proposals, specimen use costs, and compliance with data submission. The BRADS program addresses these issues with an oversight committee and reviewers, alongside an in-house data management and web development team.
Clearly, big data is not a one-person job. “Data science really requires a team approach,” said Dr. Regina Bures, final speaker of the Exchange meeting. Dr. Bures manages portfolios on population health and the environment for the Population Dynamics Branch in addition to her managing role in the Educational Programs for the Demography and Population Science Research Grant Program. Dr. Bures emphasized the need to encourage cultural change through funding policies, data sharing policies, and the education of researchers across all disciplines.
The take-home message from the meeting: it is important to be forward thinking and thoughtful now in the design of data collection and analysis because there’s a big world out there full of big data.
*During the discussion session, one of the audience members asked how to deal with consent for the use of biospecimens when a study included thousands of people who are no longer available for contact. After some discussion, the panel members agreed that the ethicists’ consensus is that researchers cannot use biospecimens for additional research without consent from the donating individual, although different institutes have varying guidelines on this practice. Food for thought in the design of biospecimen consent forms from this point forward…