Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Apr 16:12:657240.
doi: 10.3389/fpls.2021.657240. eCollection 2021.

Comparison of Short-Read Sequence Aligners Indicates Strengths and Weaknesses for Biologists to Consider

Affiliations

Comparison of Short-Read Sequence Aligners Indicates Strengths and Weaknesses for Biologists to Consider

Ryan Musich et al. Front Plant Sci. .

Abstract

Aligning short-read sequences is the foundational step to most genomic and transcriptomic analyses, but not all tools perform equally, and choosing among the growing body of available tools can be daunting. Here, in order to increase awareness in the research community, we discuss the merits of common algorithms and programs in a way that should be approachable to biologists with limited experience in bioinformatics. We will only in passing consider the effects of data cleanup, a precursor analysis to most alignment tools, and no consideration will be given to downstream processing of the aligned fragments. To compare aligners [Bowtie2, Burrows Wheeler Aligner (BWA), HISAT2, MUMmer4, STAR, and TopHat2], an RNA-seq dataset was used containing data from 48 geographically distinct samples of the grapevine powdery mildew fungus Erysiphe necator. Based on alignment rate and gene coverage, all aligners performed well with the exception of TopHat2, which HISAT2 superseded. BWA perhaps had the best performance in these metrics, except for longer transcripts (>500 bp) for which HISAT2 and STAR performed well. HISAT2 was ~3-fold faster than the next fastest aligner in runtime, which we consider a secondary factor in most alignments. At the end, this direct comparison of commonly used aligners illustrates key considerations when choosing which tool to use for the specific sequencing data and objectives. No single tool meets all needs for every user, and there are many quality aligners available.

Keywords: accuracy; alignment; comparison; runtime; short-read sequencing.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

Figure 1
Figure 1
Creating an FM-index of the word “knickknack.” The first step is to generate all rotations of the reference sequence and sort them lexicographically. The last column is stored as the Burrows-Wheeler transform (BWT) in (A) and the corresponding suffix array in (B). A rank table is created from the BWT, which lists the occurrence and order of each unique character shown in (C). (D) shows the lookup table, which lists the index of the first occurrence of each character from the first column of the sorted matrix.
Figure 2
Figure 2
Creating a suffix array and tree from the word “knickknack.” The overall word with its associated indexes is shown in (A). (B) shows the unsorted suffix array with their associated indexes. (C) shows the completed suffix tree.
Figure 3
Figure 3
Alignment rate distributions for each of the seven aligners tested across 48 RNA-seq samples. Aligners with the same character above the boxplots were not significantly different based on Tukey’s HSD with 95% confidence level.
Figure 4
Figure 4
Runtime per read distributions for each of the seven aligners tested across 48 RNA-seq samples. Aligners with the same character above the boxplots were not significantly different based on Tukey’s HSD with 95% confidence level.
Figure 5
Figure 5
Transcriptome coverage based on a minimum alignment length cutoff. This figure shows the transcriptome coverage for each aligner calculated from the BLAST+ alignment results. The y-axis shows the transcriptome coverage calculated for varying alignment length cutoffs ranging from 100 to 2,500 base pairs (x-axis). Coverage for an aligner was calculated taking the total number of alignments returned by BLAST+ with length greater than or equal to the cutoff value and dividing by the total number of transcripts in the Erysiphe necator reference transcriptome. Each line is a different aligner except for the “Reference” plot, which represents the percent of transcripts from the reference transcriptome that follow the alignment length cutoffs. This line represents the theoretical maximum for each cutoff value.
Figure 6
Figure 6
Average runtime per read vs. average alignment rate for modern aligners. TopHat2 was also used but the results were excluded from this figure as it was found to be an outlier. The x-axis shows the average percentage of reads successfully mapped to the E. necator reference genome from all 48 samples. The y-axis shows the average runtime per read from all 48 samples.

References

    1. Andrews S. (2018). FastQC (v0.11.7). Available at: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ (Accessed May 1, 2020).
    1. Cadle-Davidson L., Wakefield L., Seem R. C., Gadoury D. M. (2009). Specific isolation of RNA from the grape powdery mildew pathogen Erysiphe necator, an epiphytic, obligate parasite. J. Phytopathol. 158, 69–71. 10.1111/j.1439-0434.2009.01578.x - DOI
    1. Camacho C., Coulouris G., Avagyan V., Ma N., Papadopoulos J., Bealer K., et al. (2009). BLAST+: architecture and applications. BMC Bioinform. 10:421. 10.1186/1471-2105-10-421, PMID: - DOI - PMC - PubMed
    1. Delcher A. L., Kasif S., Fleischmann R. D., Peterson J., White O., Salzberg S. L. (1999). Alignment of whole genomes. Nucleic Acids Res. 27, 2369–2376. 10.1093/nar/27.11.2369, PMID: - DOI - PMC - PubMed
    1. Dobin A., Davis C. A., Schlesinger F., Drenkow J., Zaleski C., Jha S., et al. (2013). STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21. 10.1093/bioinformatics/bts635, PMID: - DOI - PMC - PubMed

LinkOut - more resources