Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jan 4;22(1):8.
doi: 10.1186/s13059-020-02229-3.

Reference flow: reducing reference bias using multiple population genomes

Affiliations

Reference flow: reducing reference bias using multiple population genomes

Nae-Chyun Chen et al. Genome Biol. .

Abstract

Most sequencing data analyses start by aligning sequencing reads to a linear reference genome, but failure to account for genetic variation leads to reference bias and confounding of results downstream. Other approaches replace the linear reference with structures like graphs that can include genetic variation, incurring major computational overhead. We propose the reference flow alignment method that uses multiple population reference genomes to improve alignment accuracy and reduce reference bias. Compared to the graph aligner vg, reference flow achieves a similar level of accuracy and bias avoidance but with 14% of the memory footprint and 5.5 times the speed.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
The reference flow workflow: Reads are aligned to reference genome in the first pass. Reads with high mapping quality alignments are “committed.” Unaligned reads or reads with low mapping quality are “deferred” and re-aligned to one or more additional references. The process can iterate, with similar logic for how reads are committed or deferred to another pass. Deferrals could follow the shape of an overall “reference flow graph.” Once all alignments are complete, alignments are merged. For a read aligning to more than one reference, only the best is reported, with ties broken arbitrarily. Alignments are translated (“lifted over”) to the coordinates of a standard reference like GRCh38
Fig. 2
Fig. 2
Alignment results using different methods. a Alignment sensitivity for 100 samples selected from the 1000 Genomes Project; 2 million reads are simulated from each sample. b The number of strongly biased heterozygous sites, and c the overall REF-to-ALT ratio for 25 samples; 20 million reads are simulated for each sample. The columns are sorted by median alignment sensitivity
Fig. 3
Fig. 3
Histograms of allelic balance using a high-coverage real WGS dataset of individual NA12878 (SRR622457). Experiments are performed using GRCh38 (GRC), global major reference (Major), diploid personalized genome (Personalized), vg using alleles with frequency ≥ 10% (vg), reference flow using 1000-bp phased blocks with 5 super populations (RandFlow-LD) and reference flow using 1000-bp phased blocks with 26 populations (RandFlow-LD-26)
Fig. 4
Fig. 4
Number of strongly biased HET sites stratified by RepeatMasker class, after aligning single-end reads from SRR622457. HET sites are determined using 1000 Genomes Project calls for NA12878, the individual sequenced in SRR622457. RandFlow methods and vg reduce the number of biased sites substantially for L1, Alu, and ERV1. RandFlow-LD-26 reduces the number of biased sites most among the methods tested

References

    1. Church DM, Schneider VA, Steinberg KM, Schatz MC, Quinlan AR, Chin CS, Kitts PA, Aken B, Marth GT, Hoffman MM, Herrero J, Mendoza ML, Durbin R, Flicek P. Extending reference assembly models. Genome Biol. 2015;16:13. - PMC - PubMed
    1. Brandt DY, Aguiar VR, Bitarello BD, Nunes K, Goudet J, Meyer D. Mapping bias overestimates reference allele frequencies at the HLA genes in the 1000 genomes project phase I data. G3: Gene Genomes Genet. 2015;5(5):931–41. - PMC - PubMed
    1. Van De Geijn B, McVicker G, Gilad Y, Pritchard JK. WASP: allele-specific software for robust molecular quantitative trait locus discovery. Nat Methods. 2015;12(11):1061–3. - PMC - PubMed
    1. Degner JF, Marioni JC, Pai AA, Pickrell JK, Nkadori E, Gilad Y, Pritchard JK. Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics. 2009;25(24):3207–12. - PMC - PubMed
    1. Rozowsky J, Abyzov A, Wang J, Alves P, Raha D, Harmanci A, Leng J, Bjornson R, Kong Y, Kitabayashi N, et al. AlleleSeq: analysis of allele-specific expression and binding in a network framework. Mol Syst Biol. 2011;7(1):522. - PMC - PubMed

Publication types

LinkOut - more resources