Bioinformatics is an interdisciplinary field combining expertise and best practices primarily from biology and computer science. Experimental designs and laboratory techniques, such as the use of appropriate sample controls and routine instrument calibration, are important for hypothesis testing and data collection in wet lab research. Likewise, software design, validation and verification, regular update and bug fixes are important for reproducible in silico analyses of sequencing and other biological data. While omics research entails both the laboratory experiments and the downstream computational analyses, research efforts are often focused on the former. The importance of quality standard and proper  bioinformatic software selection based on the research application is relatively under-appreciated.

Bioinformatics:

The collection, classification, storage, and analysis of biochemical and biological information using computers especially as applied to molecular genetics and genomics (Merriam-Webster)

Big data scientific research relies on proper software programs for data management, processing and analyses. Reproducible and reliable results rely on both quality wet lab and dry lab practices. We need to uphold the same standards for software engineering best practices in bioinformatics.

Below are a few best practices for software development and testing that are equally applicable to the scientific research field:

Coding standards - Making your code readable and self-documenting enables more efficient code review either by ourselves at a later time or by others during collaboration or the peer review process for publications. Have you ever looked back at your codes and had trouble understanding a convoluted function you created few years ago? Avoid that by writing simpler and less code. This will also enable easier error detection and the re-use of existing codes for different applications.

Semantic versioning - How version numbers are uniquely assigned and incremented helps developers and users to predict the changes made in each software release. It also keeps track of each transition or improvement in the software development process.  Software releases are represented by a version number in the format of Major.Minor.Patch (eg: version 3.2.1):

  • Major version change is not backward compatible. The new release is not interoperable with the operating system used by previous versions of the same software.
  • Minor version change adds functionality to the software while maintaining backward compatibility
  • Patch version change implements backward compatible bug fixes with no functionality change

Continuous integration and continuous delivery or deployment (CI/CD) - CI/CD refers to the automation of build, test and preparation of code changes for the deployment into a test or production environment before a software release. This automation allows for the continual implementation of code changes and monitoring of software behaviours during development. Changes from multiple collaborators are merged into a centralized repository where build and test runs are automated prior to a software release. It minimizes complex merge conflicts by frequently integrating smaller changes from the different sources. Codes can be rapidly integrated and tested for errors prior to delivery and deployment. This helps to speed up software validation before release and prevent the error accumulation that may otherwise make the troubleshooting process gradually more difficult.

Open-source development - Publicly available source code allows for full transparency of the functionality and design of a software. Open-source software undergoes continual and unbiased review by community experts. The collective efforts can help identify errors and suggest improvements to help create and maintain high quality software. This increases the reliability of software by welcoming indiscriminate contributions from the open-source community rather than management by a private team or entity.

As bioinformatics and omics data become an integral part of scientific research, we need to carefully evaluate the validity of each step of the research process, from laboratory data collection to computational analyses. The development and selection of bioinformatic software should be subjected to the same scrutiny as laboratory assays. Already well established in the computer science field, the best practices in software engineering should also be applied to life sciences in which data analysis and interpretation may have profound impact on scientific findings.