Benchmarking is a test that measure the performance of a software against a standard. It is used to compare the quality of a software against similar products, a software update against previous versions or software under different parameters or computing environments. Benchmarking results can inform us of the risks related to software accuracy, stability and performance (e.g, runtime & resource usage) so we can implement changes to mitigate any detected issues. This will also help users to select proper tools for their scientific applications.
A well designed benchmarking study will provide accurate, unbiased and evidence-based information on both the strengths and weaknesses of a tool. The quality and complexity of the input data (e.g., noise or low read depth) can have a significant impact on software performance, so it is important to select an appropriate test dataset that is representative of the characteristics of the real life data used by the software. The test datasets may be sourced from real data collected from experimental samples. Alternatively, if experimental data is not available, synthetic data made to simulate a sample condition that is relevant to the intended use case of the software. For example, one may generate synthetic data by spiking in synthetic RNA with known concentration in RNA-Seq experiments or using 16S rRNA from a known mixture of bacterial community in metagenomic studies. While the ground truth is often absent in experimental data, we can still benchmark a software by comparing test data against each other or against a currently accepted “gold standard.”
The benchmarking results can be summarized by quantitative metrics that evaluate different aspects of the software performance. In computational biology bench marking studies are usually published by software developers when a new scientific or bioinformatic tool is released. Likewise, researchers may also conduct a comprehensive assessment of multiple current state-of-art software for a particular applications, such as commonly used sequence assemblers, variant callers or taxonomic classifiers.
When should we benchmark?
During software development - Verification and validation are important for detecting flaws and implementing fixes prior to the software release. A repetitive systematic benchmarking process, also known as continuous testing, allows developers to receive immediate feedback on the software during development. Is the software behaviour consistent with the developer’s expectation and user specification? Does it perform consistently across different computing environments or using different test datasets?
After a software update - After new features, improvements or bug fixes are implemented, we need to verify that the software performance is as good or better than the previous versions. We need to make sure any code changes do not break the software or negatively affect the performance of the software
Prior to software selection for an analysis - This is an important step in experimental design for any computational research study. There is an increasing plethora of bioinformatic tools with similar functions and applications. Rather than choosing a tool based on popularity or usage frequency, researchers should objectively select a software or computational method that is most appropriate to their data and research questions. We can reference comprehensive benchmarking studies that have been published recently. Alternatively, researchers may design their own benchmarking study using their datasets to ensure their choice of bioinformatic software is evidence-based.
Common benchmarking metrics in bioinformatics
Concordance measures the level of identity between two measurements. It shows how similar the predictions are among different analyses
Precision measures the proportion of positive predictions (true positives and false positives) that are correctly identified (true positives)
Recall, also known as sensitivity, measures the proportion of positive results (true positives and false negatives) are correctly identified (true positives).
Accuracy refers to the ability to measure what a software/model is supposed to measure. Specifically, it is a ratio of correctly predicted observations (true positives and true negatives) to the total observations
F1 score is a combined metric that reflects the weighted average of both the precision and recall values. A high (close to 1) or low F1 (close to 0) score indicates that both precision and recall are high or low, respectively. A medium F1 score indicates high recall and low precision, or vice versa.