Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jan 6;16(1):e0244858.
doi: 10.1371/journal.pone.0244858. eCollection 2021.

Probabilistic models of biological enzymatic polymerization

Affiliations

Probabilistic models of biological enzymatic polymerization

Marshall Hampton et al. PLoS One. .

Abstract

In this study, hierarchies of probabilistic models are evaluated for their ability to characterize the untemplated addition of adenine and uracil to the 3' ends of mitochondrial mRNAs of the human pathogen Trypanosoma brucei, and for their generative abilities to reproduce populations of these untemplated adenine/uridine "tails". We determined the most ideal Hidden Markov Models (HMMs) for this biological system. While our HMMs were not able to generatively reproduce the length distribution of the tails, they fared better in reproducing nucleotide composition aspects of the tail populations. The HMMs robustly identified distinct states of nucleotide addition that correlate to experimentally verified tail nucleotide composition differences. However they also identified a surprising subclass of tails among the ND1 gene transcript populations that is unexpected given the current idea of sequential enzymatic action of untemplated tail addition in this system. Therefore, these models can not only be utilized to reflect biological states that we already know about, they can also identify hypotheses to be experimentally tested. Finally, our HMMs supplied a way to correct a portion of the sequencing errors present in our data. Importantly, these models constitute rare simple pedagogical examples of applied bioinformatic HMMs, due to their binary emissions.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Tail composition by length.
U/A composition of tails at each nucleotide position indicated on the x axis (Position 1 is the first non-encoded nucleotide attached to the mRNA’s 3’ end) for the population of tails of exactly each length specified on the y axis. Red and green indicate 100 percent U and 100 percent A, respectively.
Fig 2
Fig 2. Homopolymer distribution in tails.
Distribution of lengths of A-homopolymer (left) and U-homopolymer (right) found within tail populations of the three transcripts that are only expected to possess the primarily homopolymer-containing in-tails. Y axis scales differ in the left and right graphs.
Fig 3
Fig 3. CO1B and CO1P A-homopolymer distributions.
The frequency distribution of lengths of A-homopolymers in CO1B and CO1P transcript tail populations. A1 indicates the first homopolymer encountered starting from the first nucleotide of non-templated addition, A2 indicates the second, on to the eighth encountered homopolymer (A8), in the populations of tails in which they occur.
Fig 4
Fig 4. Initial A-homopolymer distributions.
Distribution of initial A-homopolymer lengths for tail populations of all analyzed transcripts (solid lines) and poly(A) tail-only sub-populations (dotted lines) for each transcript tail dataset.
Fig 5
Fig 5. Model B5.
5-state model (B5) used to determine tail addition in previous studies.
Fig 6
Fig 6. Model B1.
The 1-state model (B1) of nontemplated nucleotide addition on Trypanosoma brucei mitochondrial mRNAs.
Fig 7
Fig 7. Model equivalence.
Equivalence of mixed hidden states with a distributive process of nontemplated nucleotide addition.
Fig 8
Fig 8. Model B2.
2-state model of nontemplated tail addition (B2), with five independent transition parameters a,b,c,d, and e. This model cannot be used to distinguish between in-tail and ex-tail addition.
Fig 9
Fig 9. Model B3.
3-state model of nontemplated tail addition (B3), with state transition percentages from the model trained on combined CO1P, ND1B, and ND1P experimentally-derived Illumina sequenced tail data.
Fig 10
Fig 10. Actual and model tail length distributions.
Observed and model output tail length distributions using B-series HMMs of all tested complexity levels.
Fig 11
Fig 11. Model B3 A1-homopolymers.
Distribution of initial (A1) homopolymer lengths generated by model B3.
Fig 12
Fig 12. Error corrected tail lengths.
Histograms of tail lengths for ex-tail containing datasets CO1P, ND1P, and ND1B inclusive and exclusive of sequences containing corrected G and C erroneous nucleotides.
Fig 13
Fig 13. Unstructured model topologies.
Post-training unstructured model topologies and emissions of U and A nontemplated tail addition to populations of 3’ mRNA ends of the mitochondrial genes indicated at the top. Models go from simplest in the top row (G1, 1 state) to complex at the bottom (G5, 5 states). The areas covered by the separate colors in each state circle are proportional to their emissions: uncolored circle labeled ‘B’ indicates the beginning/end state, red is single adenine addition, and blue is single uracil addition. The thickness of the arrows connecting states is proportional to the transition probability.
Fig 14
Fig 14. Final model topologies.
Pre-training model topologies and emissions for the best unstructured models for each tail dataset. The areas covered by the separate colors in each state circle are proportional to their emissions: uncolored circle labeled ‘B’ indicates the beginning/end state, red is a single adenine addition, and blue is a single uracil addition.
Fig 15
Fig 15. Post-training final models.
Post-training model topologies and emissions for the best unstructured models for each tail dataset. The three models in the top row are in-tail only models, while the three bottom row models include ex-tails. The areas covered by the separate colors in each state circle are proportional to their emissions: uncolored circle labeled ‘B’ indicates the beginning/end state, red is a single adenine addition, and blue is a single uracil addition. Line thickness indicates transition probability, with thicker arrows indicating higher probability and thinner arrows indicated lower probability.

References

    1. Burge C, Karlin S. Prediction of complete gene structures in human genomic DNA. Journal of Molecular Biology. 1997;268:78–94. 10.1006/jmbi.1997.0951 - DOI - PubMed
    1. Durbin R, Eddy SR, Krogh A, Mitchison GJ. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge UK; 1998.
    1. Mullen TE, Marzluff WF. Degradation of histone mRNA requires oligouridylation followed by decapping and simultaneous degradation of the mRNA both 5’ to 3’ and 3’ to 5’. Genes & Development. 2008;22(1):50–65. 10.1101/gad.1622708 - DOI - PMC - PubMed
    1. Chang H, Lim J, Ha M, Kim VN. TAIL-seq: genome-wide determination of poly (A) tail length and 3’ end modifications. Molecular Cell. 2014;53(6):1044–1052. 10.1016/j.molcel.2014.02.007 - DOI - PubMed
    1. Horton TL, Landweber LF. Mitochondrial RNAs of myxomycetes terminate with non-encoded 3’ poly(U) tails. Nucleic Acids Research. 2000;28:4750–4754. 10.1093/nar/28.23.4750 - DOI - PMC - PubMed

Publication types

LinkOut - more resources