Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Feb 22;15(2):e8290.
doi: 10.15252/msb.20188290.

Unraveling the hidden universe of small proteins in bacterial genomes

Affiliations

Unraveling the hidden universe of small proteins in bacterial genomes

Samuel Miravet-Verde et al. Mol Syst Biol. .

Abstract

Identification of small open reading frames (smORFs) encoding small proteins (≤ 100 amino acids; SEPs) is a challenge in the fields of genome annotation and protein discovery. Here, by combining a novel bioinformatics tool (RanSEPs) with "-omics" approaches, we were able to describe 109 bacterial small ORFomes. Predictions were first validated by performing an exhaustive search of SEPs present in Mycoplasma pneumoniae proteome via mass spectrometry, which illustrated the limitations of shotgun approaches. Then, RanSEPs predictions were validated and compared with other tools using proteomic datasets from different bacterial species and SEPs from the literature. We found that up to 16 ± 9% of proteins in an organism could be classified as SEPs. Integration of RanSEPs predictions with transcriptomics data showed that some annotated non-coding RNAs could in fact encode for SEPs. A functional study of SEPs highlighted an enrichment in the membrane, translation, metabolism, and nucleotide-binding categories. Additionally, 9.7% of the SEPs included a N-terminus predicted signal peptide. We envision RanSEPs as a tool to unmask the hidden universe of small bacterial proteins.

Keywords: mass spectroscopy; mycoplasmas; protein prediction; random forest classifier; small proteins.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no conflict of interest.

Figures

Figure 1
Figure 1. General workflow
First, we generated databases of all the putative ORFs encoded by the genomes of 109 different bacteria. The database of M. pneumoniae was used to perform the shotgun MS and RNA‐Seq studies that were aimed at evaluating the coverage and performance of experimental approaches in the discovery of SEPs. In a parallel, experiment‐independent manner, RanSEPs performed in silico predictions of potential novel proteins in the database. Results coming from both experimental and computational approaches are integrated in a validation step using a set of 570 SEPs characterized both in this work and in previous studies. Finally, RanSEPs predictions for the 109 bacterial genomes are combined together to assess the functional diversity and importance of predicted SEPs. The second part of the figure highlights how RanSEPs functions. In step 0 (gray box), RanSEPs detects annotated standard proteins (purple) and SEPs (yellow). By BLASTP, non‐conserved standard and SEP proteins are detected (pink and light pink, respectively). In parallel, protein features are computed and filtered by Recursive Feature Elimination. These features are combined with general features of biological interest. In step 1 (yellow box), RanSEPs randomly subsets annotated standard and small proteins into a positive (green and yellow), a feature (blue and yellow), and a negative (pink and light pink) set from the bulk of non‐conserved sequences. During step 2 (blue box), specific features that vary with each iteration are appended. In step 3 (purple box), the labeled positive and negative sets are divided into training and test sets. Step 4 (green box) consists of collecting the classifiers and classification task results, and computing the final statistics and scores for all the sequences. Step 0 is only run once, and then, it is out of the iteration process. Steps 1–3 are repeated as many times as iterations selected by the user. Step 4 is computed at the end to integrate the results of each iteration.
Figure 2
Figure 2. Assessment of the detection coverage by “‐omics” approaches
  1. Evaluation of expression by RNA‐Seq and number of peptides required to detect an annotated protein by MS in M. pneumoniae. The plot represents the relationship between expression levels (average expression from RNA‐Seq data) and number of possible unique tryptic peptides (UTPs) for two sets of studied proteins: detected (blue dots) and not detected (orange dots) by MS.

  2. Evaluation of thresholds and artefactual signals in MS data. The histogram represents the total number of SEP proteins detected in 116 shotgun MS experiments with 1 UTP, 1 UTP and 1 NUTP, or ≥ 2 UTPs for three categories. Color code: annotated (blue bars), putative new (orange bars), and decoy set (gray bars).

  3. Number of SEPs detected by increasing the number of experiments. Color code is the same as in panel (B). Each line represents the accumulated number of different SEPs detected (y‐axis) when combining 1–116 MS datasets (x‐axis) from M. pneumoniae. Each line has an associated error that is shaded and represents the standard deviation within combinations of datasets (e.g., x = 80 will present the average number of proteins detected taking every combination of datasets in groups of 80 samples).

Figure 3
Figure 3. RanSEPs predictions
  1. Feature weight prediction in M. pneumoniae. Weights of the different features considered in the classification by RanSEPs. Bars indicate the global averaged variance that each feature explains by itself along with its associated standard deviation (black line) (25 iterations to estimate the error).

  2. Method accuracy comparative. Receiver operating characteristic curve for RanSEPs (orange) and five additional tools (blue gradient). The closer a curve to the left‐hand border, the more accurate the tool. The area under the curve (AUC) associated with each method is presented, with values closer to 1 indicating a more accurate method. The dashed gray line represents a classifier that assigns the coding class randomly.

  3. Boxplot representing the relationship between RanSEPs‐positive (“RanSEPs+”, score ≥ 0.5) and RanSEPs‐negative (“RanSEPs”, score < 0.5) SEPs predictions and associated RCV (ribosome profiling ratio coverage, in log2) in Escherichia coli. Only annotations ≤ 300 nucleotides in length were included. As positive and negative controls, we considered annotated SEPs (“Annotated”) and non‐coding RNAs (“ncRNAs”), respectively. Annotations within RanSEPs+, RanSEPs, and ncRNAs overlapping with known annotated genes were excluded. Annotations with RCV = 0.0 are filtered out, and the number within the box represents the percentage of values in that class that are kept in the comparative. Along the top, P‐values computed by Mann–Whitney rank test are indicated.

Figure 4
Figure 4. A comparison of the feature weights used for the prediction of SEPs in 109 bacterial genomes
Clustered heat map using nearest point algorithm and representing the weights of different features in 109 bacterial genomes, and the clustering relations between features (top dendrogram) and species (side dendrogram). Rightmost light‐orange and light‐blue bars are included to differentiate the two main clusters. Numbers in the right vertical axis are short references representing the names of the bacterial genomes (Dataset EV15). The right three columns represent biological features not used in the classification. The ratio of the percentage of SEPs compared to the median value is colored as blue and orange for ≤ 13.16% of SEPs and > 13.16%, respectively. Blue and orange colors in the %GC column represent genomes with ≤ 38 and > 38% GC content (median value = 38), respectively. Genome size column separates species into small‐genome bacteria (≤ 1.5 Mb, blue) and large‐genome bacteria (> 1.5 Mb, orange).
Figure 5
Figure 5. Functional assessment of RanSEPs results
  1. Landscape of the SEPs with functional annotations in NCBI considering 109 bacterial genomes (Number of SEPs = 25,229 SEPs).

  2. Functional inference of the predicted SEPs (N = 11,238) as determined using BLASTP against NCBI‐annotated SEPs having an associated function (N = 5,175). The color code associated with each category is the same as in panel (A).

Similar articles

Cited by

References

    1. Alix E, Blanc‐Potard A‐B (2008) Peptide‐assisted degradation of the Salmonella MgtC virulence factor. EMBO J 27: 546–557 - PMC - PubMed
    1. Altschul S (1990) Basic local alignment search tool. J Mol Biol 215: 403–410 - PubMed
    1. Angiuoli SV, Gussman A, Klimke W, Cochrane G, Field D, Garrity G, Kodira CD, Kyrpides N, Madupu R, Markowitz V, Tatusova T, Thomson N, White O (2008) Toward an online repository of Standard Operating Procedures (SOPs) for (meta)genomic annotation. OMICS 12: 137–141 - PMC - PubMed
    1. Avila EE (2017) Functions of antimicrobial peptides in vertebrates. Curr Protein Pept Sci 18: 1098–1119 - PubMed
    1. Baumgartner D, Kopf M, Klähn S, Steglich C, Hess WR (2016) Small proteins in cyanobacteria provide a paradigm for the functional analysis of the bacterial micro‐proteome. BMC Microbiol 16: 285 - PMC - PubMed

Publication types