Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Jul 21:6:234.
doi: 10.3389/fgene.2015.00234. eCollection 2015.

Remote homology and the functions of metagenomic dark matter

Affiliations

Remote homology and the functions of metagenomic dark matter

Briallen Lobb et al. Front Genet. .

Abstract

Predicted open reading frames (ORFs) that lack detectable homology to known proteins are termed ORFans. Despite their prevalence in metagenomes, the extent to which ORFans encode real proteins, the degree to which they can be annotated, and their functional contributions, remain unclear. To gain insights into these questions, we applied sensitive remote-homology detection methods to functionally analyze ORFans from soil, marine, and human gut metagenome collections. ORFans were identified, clustered into sequence families, and annotated through profile-profile comparison to proteins of known structure. We found that a considerable number of metagenomic ORFans (73,896 of 484,121, 15.3%) exhibit significant remote homology to structurally characterized proteins, providing a means for ORFan functional profiling. The extent of detected remote homology far exceeds that obtained for artificial protein families (1.4%). As expected for real genes, the predicted functions of ORFans are significantly similar to the functions of their gene neighbors (p < 0.001). Compared to the functional profiles predicted through standard homology searches, ORFans show biologically intriguing differences. Many ORFan-enriched functions are virus-related and tend to reflect biological processes associated with extreme sequence diversity. Each environment also possesses a large number of unique ORFan families and functions, including some known to play important community roles such as gut microbial polysaccharide digestion. Lastly, ORFans are a valuable resource for finding novel enzymes of interest, as we demonstrate through the identification of hundreds of novel ORFan metalloproteases that all possess a signature catalytic motif despite a general lack of similarity to known proteins. Our ORFan functional predictions are a valuable resource for discovering novel protein families and exploring the boundaries of protein sequence space. All remote homology predictions are available at http://doxey.uwaterloo.ca/ORFans.

Keywords: ORFan; comparative metagenomics; functional annotation; metagenome; metaproteome; orphan; profile-profile comparison; remote homology.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Pipeline for detection and functional annotation of metagenomic ORFan proteins. Protein-coding sequences (CDSs) were predicted from assembled metagenomic contigs, and searched against conserved domain databases. CDSs that could not be annotated by domain homology were further clustered, and representatives were BLASTed against the NCBI nr database. Remaining CDS clusters lacking detected homologs were considered ORFans, and these were subjected to remote homology detection using HHblits and HHsearch, which were used to perform profile-profile searches against the Protein Data Bank.
Figure 2
Figure 2
Estimated false discovery rate of ORFan remote homology detection and functional prediction. (A) Distributions of HHsearch probability scores for ORFans from three metagenomes, and shuffled sequences, searched against a PDB-derived HMM library. There is an abundance of high-scoring predictions (i.e., above 80% probability) for ORFan proteins compared to the expected (null) distribution. This separation becomes even greater when an HHsearch E-value threshold of 1 is applied (see inset). (B) The number of shared GO terms between functionally annotated ORFans (probability scores >80%) and their metagenomic neighbors (see Materials and Methods) is shown for three metagenomes. The null distributions, as estimated by randomly shuffling ORFan identities/positions, are shown along with the z-scores relative to these distributions. The mean values for the random distributions are: GOS (486.3), GPC (57.8), and HG (494.4).
Figure 3
Figure 3
Metagenome-specific ORFan families and functions. Shown are projections of three-dimensional scatterplots in which each axis indicates the proportion of ORFans from a specific metagenome with a specific annotation (left panel—families; right panel—functions). ORFan families are defined based on their top remote homology match in the PDB database, and functions are defined by GO terms as described in the Methods. Data points that project uniquely along one axis therefore indicate metagenome-specific ORFan families or functions, while those close to the origin indicate similar proportions among all three metagenomes. Cases described in the text have been labeled.
Figure 4
Figure 4
One example of 257 predicted metalloprotease ORFan sequence clusters. The example shown is a predicted metalloprotease ORFan from the HG metagenome with similarity to the protease domain of the anthrax toxin. The catalytic zinc-metalloprotease (HExxH) catalytic motif is conserved between the query and template, however the remaining sequence similarity is weak. In general, ORFan metalloproteases were predicted based on detected remote homology to protein structures of known or putative proteases and peptidases, as well as presence of the HExxH motif.

References

    1. Adekoya O. A., Sylte I. (2009). The thermolysin family (M4) of enzymes: therapeutic and biotechnological potential. Chem. Biol. Drug Des. 73, 7–16. 10.1111/j.1747-0285.2008.00757.x - DOI - PubMed
    1. Altschul S. F., Madden T. L., Schäffer A. A., Zhang J., Zhang Z., Miller W., et al. . (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402. 10.1093/nar/25.17.3389 - DOI - PMC - PubMed
    1. Andersson J. O., Andersson S. G. (2001). Pseudogenes, junk DNA, and the dynamics of Rickettsia genomes. Mol. Biol. Evol. 18, 829–839. 10.1093/oxfordjournals.molbev.a003864 - DOI - PubMed
    1. Böttger A., Doxey A. C., Hess M. W., Pfaller K., Salvenmoser W., Deutzmann R., et al. (2012). Horizontal gene transfer contributed to the evolution of extracellular surface structures: the freshwater polyp Hydra is covered by a complex fibrous cuticle containing glycosaminoglycans and proteins of the PPOD and SWT (sweet tooth) families. PLoS ONE 7:e52278 10.1371/journal.pone.0052278 - DOI - PMC - PubMed
    1. Cantarel B. L., Coutinho P. M., Rancurel C., Bernard T., Lombard V., Henrissat B. (2009). The Carbohydrate-Active EnZymes database (CAZy): an expert resource for Glycogenomics. Nucleic Acids Res. 37, D233–D238. 10.1093/nar/gkn663 - DOI - PMC - PubMed