Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Aug 21;5(3):lqad074.
doi: 10.1093/nargab/lqad074. eCollection 2023 Sep.

BioConvert: a comprehensive format converter for life sciences

Affiliations

BioConvert: a comprehensive format converter for life sciences

Hugo Caro et al. NAR Genom Bioinform. .

Abstract

Bioinformatics is a field known for the numerous standards and formats that have been developed over the years. This plethora of formats, sometimes complementary, and often redundant, poses many challenges to bioinformatics data analysts. They constantly need to find the best tool to convert their data into the suitable format, which is often a complex, technical and time consuming task. Moreover, these small yet important tasks are often difficult to make reproducible. To overcome these difficulties, we initiated BioConvert, a collaborative project to facilitate the conversion of life science data from one format to another. BioConvert aggregates existing software within a single framework and complemented them with original code when needed. It provides a common interface to make the user experience more streamlined instead of having to learn tens of them. Currently, BioConvert supports about 50 formats and 100 direct conversions in areas such as alignment, sequencing, phylogeny, and variant calling. In addition to being useful for end-users, BioConvert can also be utilized by developers as a universal benchmarking framework for evaluating and comparing numerous conversion tools. Additionally, we provide a web server implementing an online user-friendly interface to BioConvert, hence allowing direct use for the community.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Template of a new converter performing conversion from format A to format B. Methods are implemented using Python and optionally external binaries.
Figure 2.
Figure 2.
Example of an implicit conversion where extensions suffice for inferring the type of conversion required.
Figure 3.
Figure 3.
Example of an explicit conversion where extensions can not be resolved automatically.
Figure 4.
Figure 4.
Example of an explicit conversion with an implicit output.
Figure 5.
Figure 5.
Example of an explicit conversion with an explicit output.
Figure 6.
Figure 6.
Formats and conversions available in BioConvert are represented as a directed acyclic graph. Nodes correspond to formats and edges correspond to conversions. Colors indicate the degree of each format (number of connections/conversions that a node/format has in the graph).
Figure 7.
Figure 7.
The formats included in BioConvert cover NGS formats. In this graph, nodes (formats) are clustered according to their field of expertise. We could identify several topics including variant calling, phylogeny, sequencing data, alignments, ...
Figure 8.
Figure 8.
Example of conversion that produces two output files.
Figure 9.
Figure 9.
Number of methods implemented in each conversion. Most have only 1 or 2 methods.
Figure 10.
Figure 10.
Single-mode benchmarking. BioConvert provides a sub-command to compare the computational time of all methods available within a given converter (here FastQ to FastA). Each method is run several times to estimate the average time for each method as well as the standard errors. In this instance, the mawk method gives the best performance. Results and error bars may fluctuate depending on hardware performances and concurrent running processes. Benchmark obtained with a SSD hard disk, with compressed input and uncompressed output files.
Figure 11.
Figure 11.
The multi-mode benchmark of the fastq2fasta conversion involves repeating the single-mode benchmark multiple times to better understand the variability within a method and provide more confidence in determining the fastest method. The mawk method has consistently low variation and is one of the fastest methods.
Figure 12.
Figure 12.
Benchmarking of the bam2sam converter with four methods implemented in BioConvert. The only difference between top and bottom panels is related to the version of samtools used for benchmark (1.7 and 1.15 respectively). In the top panel, the performance of samtools and sambamba were similar, while in the bottom panel, samtools was 2-3 times faster. This difference can be attributed to the updated version of samtools resulting in a significantly increased performance.

References

    1. Stein L. Creating a bioinformatics nation. Nature. 2002; 417:119–120. - PubMed
    1. Andrews S., Krueger F., Segonds-Pichon A., Biggins L., Krueger C., Wingett S.. FASTQC. A quality control tool for high throughput sequence data. 2010;
    1. Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R.. The sequence alignment/map format and SAMtools. Bioinformatics. 2009; 25:2078–2079. - PMC - PubMed
    1. Afgan E., Nekrutenko A., Grüning B.A., Blankenberg D., Goecks J., Schatz M.C., Ostrovsky A.E., Mahmoud A., Lonie A.J., Syme A.et al. .. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2022 update. Nucleic Acids Res. 2022; 50:W345–W351. - PMC - PubMed
    1. Ewels P.A., Peltzer A., Fillinger S., Patel H., Alneberg J., Wilm A., Garcia M.U., Di Tommaso P., Nahnsen S.. The nf-core framework for community-curated bioinformatics pipelines. Nat. Biotechnol. 2020; 38:276–278. - PubMed