On the complexity of haplotyping a microbial community

Samuel M Nicholls^{1

2

3

4}, Wayne Aubrey¹, Kurt De Grave^{2

5}, Leander Schietgat^{2

6}, Christopher J Creevey^{3

7}, Amanda Clare¹

Affiliations

¹ Department of Computer Science, Aberystwyth University, Aberystwyth SY23 3DB, UK.
² Department of Computer Science, Katholieke Universiteit Leuven, 3001 Leuven, Belgium.
³ Institute of Biological, Rural and Environmental Sciences, Aberystwyth University, Aberystwyth SY23 3DA, UK.
⁴ Institute of Microbiology and Infection, School of Biosciences, University of Birmingham, Birmingham B15 2TT, UK.
⁵ Flanders Make, 3920 Lommel, Belgium.
⁶ Artificial Intelligence Lab, Vrije Universiteit Brussel, 1050 Ixelles, Belgium.
⁷ Institute of Global Food Security, School of Biological Sciences, Queen's University, Belfast BT9 5DL, UK.

PMID: 33444437
PMCID: PMC8208737
DOI: 10.1093/bioinformatics/btaa977

On the complexity of haplotyping a microbial community

Samuel M Nicholls et al. Bioinformatics. 2021.

. 2021 Jun 16;37(10):1360-1366.

doi: 10.1093/bioinformatics/btaa977.

Authors

Samuel M Nicholls^{1

2

3

4}, Wayne Aubrey¹, Kurt De Grave^{2

5}, Leander Schietgat^{2

6}, Christopher J Creevey^{3

7}, Amanda Clare¹

Affiliations

¹ Department of Computer Science, Aberystwyth University, Aberystwyth SY23 3DB, UK.
² Department of Computer Science, Katholieke Universiteit Leuven, 3001 Leuven, Belgium.
³ Institute of Biological, Rural and Environmental Sciences, Aberystwyth University, Aberystwyth SY23 3DA, UK.
⁴ Institute of Microbiology and Infection, School of Biosciences, University of Birmingham, Birmingham B15 2TT, UK.
⁵ Flanders Make, 3920 Lommel, Belgium.
⁶ Artificial Intelligence Lab, Vrije Universiteit Brussel, 1050 Ixelles, Belgium.
⁷ Institute of Global Food Security, School of Biological Sciences, Queen's University, Belfast BT9 5DL, UK.

PMID: 33444437
PMCID: PMC8208737
DOI: 10.1093/bioinformatics/btaa977

Abstract

Motivation: Population-level genetic variation enables competitiveness and niche specialization in microbial communities. Despite the difficulty in culturing many microbes from an environment, we can still study these communities by isolating and sequencing DNA directly from an environment (metagenomics). Recovering the genomic sequences of all isoforms of a given gene across all organisms in a metagenomic sample would aid evolutionary and ecological insights into microbial ecosystems with potential benefits for medicine and biotechnology. A significant obstacle to this goal arises from the lack of a computationally tractable solution that can recover these sequences from sequenced read fragments. This poses a problem analogous to reconstructing the two sequences that make up the genome of a diploid organism (i.e. haplotypes) but for an unknown number of individuals and haplotypes.

Results: The problem of single individual haplotyping was first formalized by Lancia et al. in 2001. Now, nearly two decades later, we discuss the complexity of 'haplotyping' metagenomic samples, with a new formalization of Lancia et al.'s data structure that allows us to effectively extend the single individual haplotype problem to microbial communities. This work describes and formalizes the problem of recovering genes (and other genomic subsequences) from all individuals within a complex community sample, which we term the metagenomic individual haplotyping problem. We also provide software implementations for a pairwise single nucleotide variant (SNV) co-occurrence matrix and greedy graph traversal algorithm.

Availability and implementation: Our reference implementation of the described pairwise SNV matrix (Hansel) and greedy haplotype path traversal algorithm (Gretel) is open source, MIT licensed and freely available online at github.com/samstudio8/hansel and github.com/samstudio8/gretel, respectively.

PubMed Disclaimer

Figures

**Fig. 1.**
An example SNP matrix. Read fragments represented by grey boxes (left) are aligned to some reference with known SNP loci. The alleles at the SNP loci are represented by white and grey circles. These reads can be alternatively represented by an m × n SNP matrix (right). Each row of the matrix models one of the m read fragments and each column corresponds to one of the n SNPs. Elements encode the allele at a given SNP for a particular read fragment as a 0 or 1, or a—if the read does not cover that position. A column containing only one element indicates the corresponding SNP site is homozygous, otherwise it is heterozygous

**Fig. 2.**
Three corresponding representations, (a) a set of aligned reads r1.r4, with called variants s1.s3, (b) the pairwise SNV co-occurrence matrix H where each possible pair of symbols (00, 01, 10, 11) has a matrix storing counts of occurrences of that ordered symbol pair between two positions across the aligned reads, (c) a simple graph that can be constructed by considering the evidence provided by adjacent variants. H is represented by an upper triangular matrix, as it is unnecessary to store the same observations with reversed positions in the lower diagonal nor observations of transitions to self along the main diagonal. Note for simplicity this example uses an alphabet of only two symbols, but in practice we consider an alphabet $Σ = {A, C, G, T, N, -}$

**Fig. 3.**
Considering only adjacent SNVs, one may create paths for which there was no actual observed evidence. Here, the reads {0011, 0001, 0100} do not support either of the results {0000, 0101}, but both are valid paths through a graph that permits edges between pairs of adjacent SNVs

See this image and copyright information in PMC

References

1. Aguiar D., Istrail S. (2013) Haplotype assembly in polyploid genomes and identical by descent shared tracts. Bioinformatics, 29, i352–360. - PMC - PubMed
1. Churchill G.A., Waterman M.S. (1992) The accuracy of DNA sequences: estimating sequence quality. Genomics, 14, 89–98. - PubMed
1. Cilibrasi R. et al. (2005) On the complexity of several haplotyping problems. In: Casadio,R. and Myers,G. (eds) Algorithms in Bioinformatics. Springer, Berlin, Heidelberg, pp. 128–139.
1. Ebler J. et al. (2019) Haplotype-aware diplotyping from noisy long reads. Genome Biol., 20, 116. - PMC - PubMed
1. Forster S.C. et al. (2019) A human gut bacterial genome and culture collection for improved metagenomic analyses. Nat. Biotechnol., 37, 186–192. - PMC - PubMed

Grants and funding

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

On the complexity of haplotyping a microbial community

Affiliations

On the complexity of haplotyping a microbial community

Authors

Affiliations

Abstract

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous