Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Feb 3;23(3):1735.
doi: 10.3390/ijms23031735.

Genome-Wide Prediction of Transcription Start Sites in Conifers

Affiliations

Genome-Wide Prediction of Transcription Start Sites in Conifers

Eugeniya I Bondar et al. Int J Mol Sci. .

Abstract

The identification of promoters is an essential step in the genome annotation process, providing a framework for gene regulatory networks and their role in transcription regulation. Despite considerable advances in the high-throughput determination of transcription start sites (TSSs) and transcription factor binding sites (TFBSs), experimental methods are still time-consuming and expensive. Instead, several computational approaches have been developed to provide fast and reliable means for predicting the location of TSSs and regulatory motifs on a genome-wide scale. Numerous studies have been carried out on the regulatory elements of mammalian genomes, but plant promoters, especially in gymnosperms, have been left out of the limelight and, therefore, have been poorly investigated. The aim of this study was to enhance and expand the existing genome annotations using computational approaches for genome-wide prediction of TSSs in the four conifer species: loblolly pine, white spruce, Norway spruce, and Siberian larch. Our pipeline will be useful for TSS predictions in other genomes, especially for draft assemblies, where reliable TSS predictions are not usually available. We also explored some of the features of the nucleotide composition of the predicted promoters and compared the GC properties of conifer genes with model monocot and dicot plants. Here, we demonstrate that even incomplete genome assemblies and partial annotations can be a reliable starting point for TSS annotation. The results of the TSS prediction in four conifer species have been deposited in the Persephone genome browser, which allows smooth visualization and is optimized for large data sets. This work provides the initial basis for future experimental validation and the study of the regulatory regions to understand gene regulation in gymnosperms.

Keywords: TATA-box; conifer; gymnosperms; promoter prediction; transcription factor binding site; transcription start site.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Figures

Figure 1
Figure 1
Frequency of the TATA(A/T)A(A/T) motif in the TSS-centered promoter region.
Figure 2
Figure 2
Distribution of DNA free energy around TSS position predicted by TSSPlant.
Figure 3
Figure 3
Positional distribution of transcription factor binding sites (TFBS) in Larix sibirica, Picea abies, Picea glauca, and Pinus taeda based on PWM scanning using TRANSFAC. (a) AP2/EREBP-related factors; (b) Homeodomain; (c) Heat shock transcription factors; (d) Myb transcription factors.
Figure 4
Figure 4
Orthologous genes of FLORICAULA/LEAFY-like proteins in L. sibirica, P. taeda, P. abies, and P. glauca with corresponding predicted TSS positions (depicted by the vertically-oriented labels) in their upstream regions are aligned using the genome browser Persephone. Red, yellow, green, and blue boxes represent exons. Light blue ribbon-like connectors indicate identical areas, blue lines mark nucleotide substitutions, and red lines indicate indels. The visualization is available at https://web.persephonesoft.com/?bookmark=43C6DEFD15C23F5F40A8AFF25F844042 (accessed on 31 January 2022).
Figure 5
Figure 5
Orthologous genes of WLIM2a in L. sibirica, P. abies, and P. glauca with corresponding predicted TSS positions (depicted by the vertically-oriented labels) in their upstream regions. Red, green, and blue boxes represent exons. Light blue ribbon-like connectors indicate identical areas, blue lines mark nucleotide substitutions, and red lines indicate indels. The visualization is available at https://web.persephonesoft.com/?bookmark=4239E3155493E8E21C61A9932BD502EE (accessed on 31 January 2022).
Figure 6
Figure 6
Some GC statistics for four conifer species, Larix sibirica, Picea abies, Picea glauca, Pinus taeda, and two model plant species, Arabidopsis thaliana and Oryza sativa: (a) GC3 gradient of coding sequences, (b) GC3 gradient slope, (c) GC3 distribution across all CDSs, (d) CG-skew around TSSs.
Figure 7
Figure 7
The difference in coding sequence length between GC3-poor and GC3-rich genes; 10% and 90% quantiles were used to divide genes into GC3-poor and GC3-rich classes (blue and red, respectively).
Figure 8
Figure 8
Distribution of the exon number per gene in GC3-poor and GC3-rich genes in L. sibirica, P. abies, P. glauca, and P. taeda. The number of genes in the GC3-poor and GC3-rich categories was the same within each organism.

Similar articles

Cited by

References

    1. Tatarinova T., Kryshchenko A., Triska M., Hassan M., Murphy D., Neely M., Schumitzky A. NPEST: A nonparametric method and a database for transcription start site prediction. Quant. Biol. 2013;1:261–271. doi: 10.1007/s40484-013-0022-2. - DOI - PMC - PubMed
    1. Reyes A., Huber W. Alternative start and termination sites of transcription drive most transcript isoform differences across human tissues. Nucleic Acids Res. 2018;46:582–592. doi: 10.1093/nar/gkx1165. - DOI - PMC - PubMed
    1. Juven-Gershon T., Kadonaga J.T. Regulation of gene expression via the core promoter and the basal transcriptional machinery. Dev. Biol. 2010;339:225–229. doi: 10.1016/j.ydbio.2009.08.009. - DOI - PMC - PubMed
    1. Alexandrov N.N., Troukhan M.E., Brover V.V., Tatarinova T., Flavell R.B., Feldmann K.A. Features of Arabidopsis genes and genome discovered using full-length cDNAs. Plant Mol. Biol. 2006;60:69–85. doi: 10.1007/s11103-005-2564-9. - DOI - PubMed
    1. Alexandrov N.N., Brover V.V., Freidin S., Troukhan M.E., Tatarinova T.V., Zhang H., Swaller T.J., Lu Y.-P., Bouck J., Flavell R.B., et al. Insights into corn genes derived from large-scale cDNA sequencing. Plant Mol. Biol. 2009;69:179–194. doi: 10.1007/s11103-008-9415-4. - DOI - PMC - PubMed

LinkOut - more resources