Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jun 16;37(10):1360-1366.
doi: 10.1093/bioinformatics/btaa977.

On the complexity of haplotyping a microbial community

Affiliations

On the complexity of haplotyping a microbial community

Samuel M Nicholls et al. Bioinformatics. .

Abstract

Motivation: Population-level genetic variation enables competitiveness and niche specialization in microbial communities. Despite the difficulty in culturing many microbes from an environment, we can still study these communities by isolating and sequencing DNA directly from an environment (metagenomics). Recovering the genomic sequences of all isoforms of a given gene across all organisms in a metagenomic sample would aid evolutionary and ecological insights into microbial ecosystems with potential benefits for medicine and biotechnology. A significant obstacle to this goal arises from the lack of a computationally tractable solution that can recover these sequences from sequenced read fragments. This poses a problem analogous to reconstructing the two sequences that make up the genome of a diploid organism (i.e. haplotypes) but for an unknown number of individuals and haplotypes.

Results: The problem of single individual haplotyping was first formalized by Lancia et al. in 2001. Now, nearly two decades later, we discuss the complexity of 'haplotyping' metagenomic samples, with a new formalization of Lancia et al.'s data structure that allows us to effectively extend the single individual haplotype problem to microbial communities. This work describes and formalizes the problem of recovering genes (and other genomic subsequences) from all individuals within a complex community sample, which we term the metagenomic individual haplotyping problem. We also provide software implementations for a pairwise single nucleotide variant (SNV) co-occurrence matrix and greedy graph traversal algorithm.

Availability and implementation: Our reference implementation of the described pairwise SNV matrix (Hansel) and greedy haplotype path traversal algorithm (Gretel) is open source, MIT licensed and freely available online at github.com/samstudio8/hansel and github.com/samstudio8/gretel, respectively.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
An example SNP matrix. Read fragments represented by grey boxes (left) are aligned to some reference with known SNP loci. The alleles at the SNP loci are represented by white and grey circles. These reads can be alternatively represented by an m × n SNP matrix (right). Each row of the matrix models one of the m read fragments and each column corresponds to one of the n SNPs. Elements encode the allele at a given SNP for a particular read fragment as a 0 or 1, or a—if the read does not cover that position. A column containing only one element indicates the corresponding SNP site is homozygous, otherwise it is heterozygous
Fig. 2.
Fig. 2.
Three corresponding representations, (a) a set of aligned reads r1.r4, with called variants s1.s3, (b) the pairwise SNV co-occurrence matrix H where each possible pair of symbols (00, 01, 10, 11) has a matrix storing counts of occurrences of that ordered symbol pair between two positions across the aligned reads, (c) a simple graph that can be constructed by considering the evidence provided by adjacent variants. H is represented by an upper triangular matrix, as it is unnecessary to store the same observations with reversed positions in the lower diagonal nor observations of transitions to self along the main diagonal. Note for simplicity this example uses an alphabet of only two symbols, but in practice we consider an alphabet Σ={A,C,G,T,N,}
Fig. 3.
Fig. 3.
Considering only adjacent SNVs, one may create paths for which there was no actual observed evidence. Here, the reads {0011, 0001, 0100} do not support either of the results {0000, 0101}, but both are valid paths through a graph that permits edges between pairs of adjacent SNVs

References

    1. Aguiar D., Istrail S. (2013) Haplotype assembly in polyploid genomes and identical by descent shared tracts. Bioinformatics, 29, i352–360. - PMC - PubMed
    1. Churchill G.A., Waterman M.S. (1992) The accuracy of DNA sequences: estimating sequence quality. Genomics, 14, 89–98. - PubMed
    1. Cilibrasi R. et al. (2005) On the complexity of several haplotyping problems. In: Casadio,R. and Myers,G. (eds) Algorithms in Bioinformatics. Springer, Berlin, Heidelberg, pp. 128–139.
    1. Ebler J. et al. (2019) Haplotype-aware diplotyping from noisy long reads. Genome Biol., 20, 116. - PMC - PubMed
    1. Forster S.C. et al. (2019) A human gut bacterial genome and culture collection for improved metagenomic analyses. Nat. Biotechnol., 37, 186–192. - PMC - PubMed