. 2024 Apr 19;25(1):101.

doi: 10.1186/s13059-024-03240-8.

Measuring, visualizing, and diagnosing reference bias with biastools

Mao-Jan Lin¹, Sheila Iyer², Nae-Chyun Chen², Ben Langmead³

Affiliations

¹ Department of Computer Science, Johns Hopkins University, Baltimore, USA. mlin77@jhu.edu.
² Department of Computer Science, Johns Hopkins University, Baltimore, USA.
³ Department of Computer Science, Johns Hopkins University, Baltimore, USA. langmea@cs.jhu.edu.

PMID: 38641647
PMCID: PMC11027314
DOI: 10.1186/s13059-024-03240-8

Measuring, visualizing, and diagnosing reference bias with biastools

Mao-Jan Lin et al. Genome Biol. 2024.

. 2024 Apr 19;25(1):101.

doi: 10.1186/s13059-024-03240-8.

Authors

Mao-Jan Lin¹, Sheila Iyer², Nae-Chyun Chen², Ben Langmead³

Affiliations

¹ Department of Computer Science, Johns Hopkins University, Baltimore, USA. mlin77@jhu.edu.
² Department of Computer Science, Johns Hopkins University, Baltimore, USA.
³ Department of Computer Science, Johns Hopkins University, Baltimore, USA. langmea@cs.jhu.edu.

PMID: 38641647
PMCID: PMC11027314
DOI: 10.1186/s13059-024-03240-8

Abstract

Many bioinformatics methods seek to reduce reference bias, but no methods exist to comprehensively measure it. Biastools analyzes and categorizes instances of reference bias. It works in various scenarios: when the donor's variants are known and reads are simulated; when donor variants are known and reads are real; and when variants are unknown and reads are real. Using biastools, we observe that more inclusive graph genomes result in fewer biased sites. We find that end-to-end alignment reduces bias at indels relative to local aligners. Finally, we use biastools to characterize how T2T references improve large-scale bias.

Keywords: Pangenomics; Reference bias; Sequence alignment.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Fig. 1**
Illustration of the types of balance measurement — SB, MB, and AB — with respect to read simulation, read mapping, and halpotype assignment. Note that the mismapped reads are excluded when calculating MB, and the reads assigned “Others” are also excluded when calculating AB. Columns indicate distinct types of bias event. “Loss^∗” indicates a bias event due to reads with ALT alleles failing to align. “Loss^∗∗” indicates a bias event due to reads mapping elsewhere than their true point of origin. “Flux” indicates bias from gaining mismapped reads from other sites. “Local” indicates that local repeat content, as well as sequencing errors, combine to make a gap placement ambiguous

**Fig. 2**
Normalized mapping balance to normalized assignment balance (NMB-NAB) plot of a SNV sites with naive assignment method, b SNV sites with context-aware assignment method, c insertion and deletion sites with naive assignment method, and d insertion and deletion sites with context-aware assignment method. Each dot represent a variant site in HG002 chromosome 20. The simulated reads are aligned using Bowtie 2 and default parameters. The balance and bias subcategories are classified based on the position of the dots (“Biased-site classification” section). For visual clarity, sites with no correctly-mapped REF reads are omitted; the full plot including these sites is available as Additional file 1: Fig. S1

**Fig. 3**
Bias-by-allele-length plots if we consider only Simulation Balance (blue), Mapping Balance (orange), Assignment Balance using context-aware assignment (green), and the same using naive assignment (red). Variant length varies along the x-axis, with positive values standing for insertion and negative values for deletions, and 0 for SNVs. The alignment is done by Bowtie 2 on HG002 simulated data. Top: Balance for all four measures. Dots represent median of the distribution and the whiskers indicate the first and third quartiles. Middle: Zoom-in on Mapping Balance and context-aware Assignment Bias with data normalized by subtracting median SB in each stratum. Bottom: number of variants with each length. Gaps exceeding 25 bp are collapsed into the $- 25$ or 25 strata

**Fig. 4**
Bias-by-allele-length for 8 alignment workflows. We used simulated and real WGS datasets derived from HG002. We subsetted to reads aligning to HET sites on chromosome 20. Variants are arranged according to their length, with positive values standing for insertions and negative values standing for deletions. Zero indicates SNVs. a Fraction of ALT alleles in the simulation (blue) and after mapping of simulated reads (other colors). b Fraction of ALT alleles after mapping and context-aware assignment using simulated reads. c Fraction of ALT alleles after mapping and context-aware assignment using real reads. d The number of incidents of each size

**Fig. 5**
The receiver operating characteristic (ROC) curve and the precision and recall (PR) curve of the biastools classifier on Bowtie2 alignment. a ROC curve of SNVs, b PR curve of SNVs, c ROC curve of gaps, d PR curve of gaps. The four lines are the simulated (blue and orange) and real data (green and red) based on multiplication scoring (mul) and addition scoring (add). auc: area under curve

**Fig. 6**
Biastools called bias region of HG002 with two different method. The tracks from top down are: combined Z-score for direct-to-GRC alignment, combined Z-score for Leviosam2 alignment, IGV read arrangement of direct alignment, read arrangement of Leviosam2, “Biased region” of direct alignment, “Biased region” of Leviosam2. Combined Z scores include read depth, variant density, and non-diploid variant. The scores above 10 are truncated in the panel to show the details between 0 and 10. Note that the read coverage tracks use different scales. For direct alignment, the track ranges from 0 to 254, while that of Leviosam2 ranges from 0 to 60

**Fig. 7**
The aligned reads and variants in alignment coordinate and expansion coordinate. For expansion coordinate, the expansion can be anchored on the left side of the variant or the right side of the variant

**Fig. 8**
Two examples of repetitive context. a The repetitive is extending to the right side, so the effective variant is extending to the right end so that the ALT context sequence is no longer a prefix of REF context sequence. b The case original ALT context sequence is a substring of REF context sequence. There are two choices of effective variant. Biastools would chose the shorter effective variant (choice 2)

**Fig. 9**
The illustration of biases categorization with NAB and NMB. Variants positioned within the green circle with a radius of 0.1 at the origin are classified as balance. Variants in the yellow region along the diagonal are categorized as bias “loss”. The blue region, where $| NMB | > 0.1$ and excluding the bias “loss” region classifies variants as either bias “flux” or bias “local”. The classification between “flux” or “local” is determined by if there are more than 5 reads being mismapped to the site. Variants falling outside these categories are classified as outliers. NAB: normalized assignment balance, NMB: normalized mapping balance

See this image and copyright information in PMC

Update of

Measuring, visualizing and diagnosing reference bias with biastools.
Lin MJ, Iyer S, Chen NC, Langmead B. Lin MJ, et al. bioRxiv [Preprint]. 2024 Feb 15:2023.09.13.557552. doi: 10.1101/2023.09.13.557552. bioRxiv. 2024. Update in: Genome Biol. 2024 Apr 19;25(1):101. doi: 10.1186/s13059-024-03240-8. PMID: 37745608 Free PMC article. Updated. Preprint.

References

1. Anson EL, Myers EW. ReAligner: a program for refining DNA sequence multi-alignments. J Comput Biol. 1997;4(3):369–383. doi: 10.1089/cmb.1997.4.369. - DOI - PubMed
1. Assmus J, Kleffe J, Schmitt AO, Brockmann GA. Equivalent indels-ambiguous functional classes and redundancy in databases. PLoS ONE. 2013;8(5):e62803. doi: 10.1371/journal.pone.0062803. - DOI - PMC - PubMed
1. Baid G, Nattestad M, Kolesnikov A, Goel S, Yang H, Chang PC, et al. Google Brain Genomics Sequencing Dataset for Benchmarking and Development. Dataset. 2020. https://console.cloud.google.com/storage/browser/brain-genomics-public/r.... Accessed 15 Apr 2024.
1. Brandt DY, Aguiar VR, Bitarello BD, Nunes K, Goudet J, Meyer D. Mapping Bias Overestimates Reference Allele Frequencies at the HLA Genes in the 1000 Genomes Project Phase I Data. G3 (Bethesda) 2015;5(5):931–41. doi: 10.1534/g3.114.015784. - DOI - PMC - PubMed
1. Chen NC, Paulin LF, Sedlazeck FJ, Koren S, Phillippy AM, Langmead B. Improved sequence mapping using a complete reference genome and lift-over. Nat Methods. 2024;21(1):41–49. doi: 10.1038/s41592-023-02069-6. - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Measuring, visualizing, and diagnosing reference bias with biastools

Affiliations

Measuring, visualizing, and diagnosing reference bias with biastools

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Update of

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources