Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2017 Jul 20:15:379-386.
doi: 10.1016/j.csbj.2017.07.002. eCollection 2017.

Scalability and Validation of Big Data Bioinformatics Software

Affiliations
Review

Scalability and Validation of Big Data Bioinformatics Software

Andrian Yang et al. Comput Struct Biotechnol J. .

Abstract

This review examines two important aspects that are central to modern big data bioinformatics analysis - software scalability and validity. We argue that not only are the issues of scalability and validation common to all big data bioinformatics analyses, they can be tackled by conceptually related methodological approaches, namely divide-and-conquer (scalability) and multiple executions (validation). Scalability is defined as the ability for a program to scale based on workload. It has always been an important consideration when developing bioinformatics algorithms and programs. Nonetheless the surge of volume and variety of biological and biomedical data has posed new challenges. We discuss how modern cloud computing and big data programming frameworks such as MapReduce and Spark are being used to effectively implement divide-and-conquer in a distributed computing environment. Validation of software is another important issue in big data bioinformatics that is often ignored. Software validation is the process of determining whether the program under test fulfils the task for which it was designed. Determining the correctness of the computational output of big data bioinformatics software is especially difficult due to the large input space and complex algorithms involved. We discuss how state-of-the-art software testing techniques that are based on the idea of multiple executions, such as metamorphic testing, can be used to implement an effective bioinformatics quality assurance strategy. We hope this review will raise awareness of these critical issues in bioinformatics.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1
Scalability and validation – two important aspects of big data bioinformatics.
Fig. 2.
Fig. 2
Examples of MapReduce-based (a) and Spark-based (b) big data bioinformatics analysis frameworks.
Fig. 3.
Fig. 3
An example to illustrate how metamorphic testing (MT) can be used to test the correctness of a RNA-seq feature quantification pipeline.

References

    1. Viceconti M., Hunter P., Hose R. Big data, big knowledge: big data for personalized healthcare. IEEE J Biomed Health Inform. Jul. 2015;19(4):1209–1215. - PubMed
    1. Baker M. Next-generation sequencing: adjusting to data overload. Nat Methods. Jul. 2010;7(7):495–499.
    1. Goodwin S., McPherson J.D., McCombie W.R. Coming of age: ten years of next-generation sequencing technologies. Nat Rev Genet. Jun. 2016;17(6):333–351. - PMC - PubMed
    1. Yu P., Lin W. Single-cell transcriptome study as big data. Genomics Proteomics Bioinformatics. Feb. 2016;14(1):21–30. - PMC - PubMed
    1. Marx V. Biology: the big challenges of big data. Nature. Jun. 2013;498(7453):255–260. - PubMed

LinkOut - more resources