Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Mar 4:7:269.
doi: 10.3389/fmicb.2016.00269. eCollection 2016.

GenSeed-HMM: A Tool for Progressive Assembly Using Profile HMMs as Seeds and its Application in Alpavirinae Viral Discovery from Metagenomic Data

Affiliations

GenSeed-HMM: A Tool for Progressive Assembly Using Profile HMMs as Seeds and its Application in Alpavirinae Viral Discovery from Metagenomic Data

João M P Alves et al. Front Microbiol. .

Abstract

This work reports the development of GenSeed-HMM, a program that implements seed-driven progressive assembly, an approach to reconstruct specific sequences from unassembled data, starting from short nucleotide or protein seed sequences or profile Hidden Markov Models (HMM). The program can use any one of a number of sequence assemblers. Assembly is performed in multiple steps and relatively few reads are used in each cycle, consequently the program demands low computational resources. As a proof-of-concept and to demonstrate the power of HMM-driven progressive assemblies, GenSeed-HMM was applied to metagenomic datasets in the search for diverse ssDNA bacteriophages from the recently described Alpavirinae subfamily. Profile HMMs were built using Alpavirinae-specific regions from multiple sequence alignments (MSA) using either the viral protein 1 (VP1; major capsid protein) or VP4 (genome replication initiation protein). These profile HMMs were used by GenSeed-HMM (running Newbler assembler) as seeds to reconstruct viral genomes from sequencing datasets of human fecal samples. All contigs obtained were annotated and taxonomically classified using similarity searches and phylogenetic analyses. The most specific profile HMM seed enabled the reconstruction of 45 partial or complete Alpavirinae genomic sequences. A comparison with conventional (global) assembly of the same original dataset, using Newbler in a standalone execution, revealed that GenSeed-HMM outperformed global genomic assembly in several metrics employed. This approach is capable of detecting organisms that have not been used in the construction of the profile HMM, which opens up the possibility of diagnosing novel viruses, without previous specific information, constituting a de novo diagnosis. Additional applications include, but are not limited to, the specific assembly of extrachromosomal elements such as plastid and mitochondrial genomes from metagenomic data. Profile HMM seeds can also be used to reconstruct specific protein coding genes for gene diversity studies, and to determine all possible gene variants present in a metagenomic sample. Such surveys could be useful to detect the emergence of drug-resistance variants in sensitive environments such as hospitals and animal production facilities, where antibiotics are regularly used. Finally, GenSeed-HMM can be used as an adjunct for gap closure on assembly finishing projects, by using multiple contig ends as anchored seeds.

Keywords: Alpavirinae; de novo diagnosis; metagenomic analysis; sequence assembly; viral discovery.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Workflow of the seed-driven progressive assembly process. GenSeed-HMM automatically identifies the type of starting seed (A). The sequencing read database is indexed and, if needed, translated (B). DNA, protein or profile HMM seeds are then used to select reads from the database using blastn, tblastn, or hmmsearch, respectively. The list of positive reads is introduced into the progressive assembly cycle (C). The reads retrieved from the database are assembled and the contig ends are extracted and used as new seeds in an iterative process. The progressive assembly contains several checkpoints and is completed when a set of finishing criteria are fulfilled. In the final procedure (D), all contigs are checked in regard to the presence of the starting seed and final files are stored.
Figure 2
Figure 2
Comparison of progressive assembly using different HMM seeds. Contig profiles obtained by progressive assembly with GenSeed-HMM using a 454 dataset from fecal samples from human patients (Reyes et al., 2010) and profile HMM seeds derived from Alpavirinae major capsid protein VP1 (A) and replication initiation protein VP4 (B). Contigs are ranked in decreasing order of size. Each marker represents a distinct contig. Profile HMMs used as seeds are depicted.
Figure 3
Figure 3
Consistency among HMM seeds. Venn diagram representing shared contigs reconstructed by progressive assembly using GenSeed-HMM with profile HMM seeds VP1R1, VP1R4, VP4R1, and VP4R3. Contigs were included in the same cluster when presenting at least 90% similarity at the nucleotide level covering at least 90% of the length of the shortest contig. Contigs were then taxonomically classified by blastx to reference proteins from Microviridae and searched for the presence of the VP1R4 seed using hmmsearch. A large percent of shared contigs among all four seeds is observed and belonging to Alpavirinae genomes covering the VP1R4 seed. Notice that contigs that were not present within the VP1R4 seed were usually not assigned to Alpavirinae (low precision) or do not contain the VP1R4 region, suggesting potential shorter non-overlapping contigs.
Figure 4
Figure 4
Phylogenetic analysis. Maximum likelihood phylogenetic analysis of (A) full-length VP1 protein and (B) a shorter region comprising only the VP1R4 HMM region. Sequences were translated from the contigs reconstructed by GenSeed-HMM using the VP1R4 seed. Different subfamilies of the Microviridae family are depicted in distinct colors, references were obtained from Roux et al. (2012). Branches represented by sequences derived in this work are labeled in black. Asterisks in the nodes indicate bootstrap values higher than 70%. Numbers represent contig numbers as observed in Figure 5.
Figure 5
Figure 5
Read abundance of contigs. Heatmap diagram representing read abundance in contigs reconstructed by progressive assembly with GenSeed-HMM and the VP1R4 seed. Fecal biospecimens were collected from different families (F1–F4) composed of monozygotic twins (T1 and T2) and their respective mothers (M). Time points of sample collection and technical replicates (R) are depicted. Data source: 454 dataset from fecal samples of human patients (Reyes et al., 2010).
Figure 6
Figure 6
Comparison between global and progressive assembly. Comparison of cumulative contig lengths using progressive assembly with GenSeed-HMM and VP1R4 HMM seed and global assembly with Newbler. Data sources: (A,C,D) 454 dataset from fecal samples of human patients (Reyes et al., 2012); (B) Illumina dataset from a sewage treatment plant at the municipality of Taboão da Serra, São Paulo, Brazil (unpublished data). Contigs from progressive assembly and VP1R4-positive contigs from global assembly were clustered at 97% identity over at least 90% of the shortest contig, each cluster consisted at most of one contig from each dataset. A total of 53 clusters were generated, nine unique for the progressive assembly and eight unique for global assembly. Plotted is the comparison in lengths (C) and coverage (D) for related contigs obtained by progressive and global assemblies and ranked by size.

References

    1. Arumugam M., Raes J., Pelletier E., Le Paslier D., Yamada T., Mende D. R., et al. (2011). Enterotypes of the human gut microbiome. Nature 473, 174–180. 10.1038/nature09944 - DOI - PMC - PubMed
    1. Belák S., Karlsson O. E., Blomström A. L., Berg M., Granberg F. (2013). New viruses in veterinary medicine, detected by metagenomic approaches. Vet. Microbiol. 165, 95–101. 10.1016/j.vetmic.2013.01.022 - DOI - PubMed
    1. Bexfield N., Kellam P. (2011). Metagenomics and the molecular identification of novel viruses. Vet. J. 190, 191–198. 10.1016/j.tvjl.2010.10.014 - DOI - PMC - PubMed
    1. Bibby K., Peccia J. (2013). Identification of viral pathogen diversity in sewage sludge by metagenome analysis. Environ. Sci. Technol. 47, 1945–1951. 10.1021/es305181x - DOI - PMC - PubMed
    1. Breitbart M., Salamon P., Andresen B., Mahaffy J. M., Segall A. M., Mead D., et al. (2002). Genomic analysis of uncultured marine viral communities. Proc. Natl. Acad. Sci. U.S.A. 99, 14250–14255. 10.1073/pnas.202488399 - DOI - PMC - PubMed