Have you ever tried to re-run an analysis done by other research groups, your team members or even yourself after a few years? Even with available datasets and documentation, reproducing results from previous studies is not always as simple as one may expect. Data reproducibility is a measure of whether published results can be regenerated using the original researchers’ data and analysis codes. It evaluates whether a research finding can be validated, which is critical for peer review and quality control on any novel claims within the scientific community.
To ensure accuracy and to minimize research biases, novel scientific claims need to be validated by result reproducibility with transparency to the original data and methodologies. A 2016 Nature survey of over 1500 researchers reported that more than 70% failed to produce others’ experiments and more than half failed to reproduce their own . Approximately 90% of the scientific researchers acknowledged that we are now in a "reproducibility crisis.”
In computational biology research, reproducibility is further complicated by inherent issues related to the development and execution of scientific software. Latent software errors are hidden flaws in a software that may not lead to a failure and may remain undetected even after the software release. In programming, 15 to 50 errors are estimated to be present in every 1000 lines of delivered code . It is exceedingly rare to achieve below 1 error per 1000 lines. Other software-related issues impacting data reproducibility include the lack of software verification (when software behaviours differ from developers’ assumptions), non-deterministic algorithms (inconsistent behaviours in different runs using the same input), and differences computational resources used by different users .
As in silico experiments become more popular with the advancement of sequencing technologies, it is therefore necessary to ensure data reproducibility at every stage of our computational research, from software development to data analysis and interpretation.
The lack of reproducibility in science can be attributed to numerous factors. Lack of open data is a common issue. Data and code from a publication may simply be inaccessible. An editor from the peer-reviewed journal, Molecular Brain, reported that more than 97% of the submitted manuscripts did not present raw data as requested . Most of these manuscripts were consequentially rejected due to either a complete absence of data or insufficient data. Even when the data are publicly available, they may not always be in the formats that are easy to access or work with. For example, it is not uncommon to see large data tables embedded as multi-page PDF files rather than the easy-to-parse text files in the supplementary materials of a publication.
Making raw data and metadata available and readily accessible is key to enabling reproducibility of your research. It is impossible to validate any research without the original data. While small datasets can generally be published with the corresponding manuscripts, larger datasets such as sequencing data would require larger public repositories such as the National Center for Biotechnology Information (NCBI)’s Sequence Read Archive (SRA) for next-generation sequencing data and Gene Expression Omnibus (GEO) for high-throughput gene expression data.
A lack of documentation on the analysis code also creates another problem for reproducibility. A majority of scripts require specific user configuration before they can be successfully run on a machine. Without easily knowing what needs to be configured, it is difficult to run someone else’s analysis without encountering any errors.
A detailed documentation entails a clear description of the bioinformatic tools, software version, (default or customized) parameter settings, computational resources (memory, runtime, nodes, CPU cores, GPUs, etc.) and the operating system used in the original study. Raw data often require cleaning, wrangling, and quality control before they can be used for the primary analyses. These pre-processing steps should also be adequately explained. When sharing analysis scripts, keeping your code clean and utilizing comments to describe a complex function will aid others to re-run your analyses. It is always better to provide more than less information for others to precisely reproduce an analysis using the available data and instructions.
Many bioinformatic tools that support a graphical user interface (GUI) also support a command line interface (CLI) that which can be automated as an individual analysis or as a component of a larger analysis pipeline. GUI requires manual selection of parameters and upload of input files. These sequential actions are generally not trackable. On the other hand, CLI tools require all user input to be included in a script prior to running. This eases any future revisions or troubleshooting if problems occur during the analysis. For repeated analyses using multiple datasets or parameter settings, automation of the analysis using the CLI will also minimize manual errors or inconsistencies during data processing and analyses. It is also more efficient and scalable when working on large and complex analyses.
Whether you are a bioinformatic user or developer, a portable and reproducible software system will ease the installation and usability of a tool or pipeline. Too often, researchers spend an unexpected amount of time on software installation, especially when they encounter dependency issues (i.e., missing certain libraries for proper installation). Software containers are a way to prevent this headache. Containers, such as Docker images, are lightweight, standalone and executable package of a software that contains all the necessary components to run an application. They remove one of the common barriers to reproducibility in science by enabling the ease of installation of tools required for a particular analysis. Similarly, workflow management systems are tools for ensuring reproducibility in an analysis pipeline that comprises multiple software. They compose and execute a series of computational steps, from data processing to analysis and result visualization. A few popular workflow systems used by genomic researchers include Galaxy, Snakemake and Nextflow.
The quality of the underlying software used in scientific research tend to be overlooked by other priorities such as experimental designs and research findings. Even with a high-quality dataset and a refined research plan, the end results can be drastically impacted by the analysis tools. In silico predictions may be inaccurate or non-reproducible if rigorous software testing is not properly performed during software development. Software testing ensures that the program is correctly implemented (i.e., verification) and behaves in accordance to the user requirements (i.e., validation). It will also ensure that the software is reliable and will generate the same results when the analysis is repeated. If any unexpected behaviours are detected, reproducible errors will also be useful for troubleshooting.
Software stability over time is also another contributing factor to the reproducibility crisis. A 2019 study evaluated the archival stability of omics software published between 2005 to 2017 . Almost 28% of the total 36,702 software were no longer accessible by their uniform resource locators (URLs). Another study reported that software availability gradually decreases to 95% after two years to 84% after three years from their publication date . Even if they are open source and instrumental to sequencing-based research, many bioinformatic tools, primarily ones developed in academia, are poorly maintained. Unlike the NCBI and the European Bioinformatics Institute (EMBL-EBI) where the continuous maintenance of databases and tools are supported by government funding, independent research labs don’t always receive such support from their funding agencies. Although the US government funds an estimate of $16 billion per year on basic life science research, there are relatively limited funding opportunities for the development and maintenance of genomic software to support the continual growth of omics research . While there is no immediate solution to this, allowing these tools to be open to collaboration by the scientific community may be a viable option for maintaining their functional longevity.
Version control is the practice of tracking and managing changes to computer programs, documents or other information. Each version represents a snapshot of the resource at a specific point in time. You can always return to a particular version to revert changes, troubleshoot bugs or compare to another version. Version control also documents what changes were made, who made them and when they were made. Git is a useful version control system that is both open-source and easily integrated into your projects. It is also a collaborative platform for teams to work on the same project while being able to track changes made by each member.
Finally, one of the fundamental needs to address the reproducibility crisis is to improve the awareness of reproducible science. It is important for academic, government or industry entities to provide comprehensive trainings to scientific researchers new to the computational biology field. Understanding how to design reproducible research prior to conducting it will ensure best practices of data reproducibility are implemented right from the beginning.
While these eight ways can help to improve data reproducibility, it is important to note that data reproducibility does not always equal quality and validity of a research study. The ability to reproduce a published analysis is only the first step in validating novel research findings. Additional review of the research design and result interpretation are required to determine whether such findings truly represent novel scientific knowledge with high confidence.
1. Baker M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452–454. https://doi.org/10.1038/533452a
2. Soergel D. A. (2014). Rampant software errors may undermine scientific results. F1000Research, 3, 303. https://doi.org/10.12688/f1000research.5930.2
3. Miyakawa T. (2020). No raw data, no science: another possible source of the reproducibility crisis. Molecular brain, 13(1), 24. https://doi.org/10.1186/s13041-020-0552-2
4. Mangul, S., Mosqueiro, T., Abdill, R. J., Duong, D., Mitchell, K., Sarwal, V., Hill, B., Brito, J., Littman, R. J., Statz, B., Lam, A. K., Dayama, G., Grieneisen, L., Martin, L. S., Flint, J., Eskin, E., & Blekhman, R. (2019). Challenges and recommendations to improve the installability and archival stability of omics computational tools. PLoS biology, 17(6), e3000333. https://doi.org/10.1371/journal.pbio.3000333
5. Ősz, Á., Pongor, L. S., Szirmai, D., & Győrffy, B. (2019). A snapshot of 3649 Web-based services published between 1994 and 2017 shows a decrease in availability after 2 years. Briefings in bioinformatics, 20(3), 1004–1010. https://doi.org/10.1093/bib/bbx159
6. Siepel A. (2019). Challenges in funding and developing genomic software: roots and remedies. Genome biology, 20(1), 147. https://doi.org/10.1186/s13059-019-1763-7