Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jan 12;16(1):e0244202.
doi: 10.1371/journal.pone.0244202. eCollection 2021.

Revising transcriptome assemblies with phylogenetic information

Affiliations

Revising transcriptome assemblies with phylogenetic information

August Guang et al. PLoS One. .

Abstract

A common transcriptome assembly error is to mistake different transcripts of the same gene as transcripts from multiple closely related genes. This error is difficult to identify during assembly, but in a phylogenetic analysis such errors can be diagnosed from gene phylogenies where they appear as clades of tips from the same species with improbably short branch lengths. treeinform is a method that uses phylogenetic information across species to refine transcriptome assemblies within species. It identifies transcripts of the same gene that were incorrectly assigned to multiple genes and reassign them as transcripts of the same gene. The treeinform method is implemented in Agalma, available at https://bitbucket.org/caseywdunn/agalma, and the general approach is relevant in a variety of other contexts.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. An example gene phylogeny from the test dataset before running treeinform.
Each tip is an exemplar transcript that was initially assigned to a different gene. In front, corresponding multiple sequence alignment, with sites ordered from highest to lowest identity to the inferred ancestral site for clarity on sequence diversity. Black indicates a difference from the ancestral sequence. The four Hydra transcripts in color were assigned to different genes by Trinity [2] despite two of the transcripts sharing the exact same sequences, and the two other transcripts differing by a small gap. After treeinform, all transcripts from these four genes are reassigned to a single gene.
Fig 2
Fig 2. Histogram of subtree lengths for internal nodes in each gene phylogeny from the test dataset containing tip descendants from the same species.
Subtree lengths greater than 1 were filtered out for clarity.
Fig 3
Fig 3. Cluster size counts for Trinity assembly and Corset clustering algorithm on Trinity contigs.
There are 3 Trinity clusters with size greater than 30, while there are 20 Corset clusters with size greater than 30.
Fig 4
Fig 4. Histogram of subtree lengths for internal nodes in each Siphonophora subset gene tree from Agalma with Corset clusterings containing tip descendants from the same species.
Subtree lengths greater than 1 were filtered out for clarity.
Fig 5
Fig 5. Histogram of the inferred duplication times with an overlaid mixture model.
Component 1 of the mixture model (red) captures the technical issues we address here, where transcripts from the same gene are assigned to different genes, and component 2 (blue) captures the true biological pattern, where transcripts from different genes are correctly assigned to different genes. We first ran phyldog [16] on the test dataset using the multiple sequence alignments and a given species phylogeny [20]. This provided gene phylogenies with internal nodes annotated as duplication or speciation events. We then used the annotations to time-calibrate the gene phylogenies for the mixture model.
Fig 6
Fig 6. Percentage of reassigned transcripts (log scale).
47,688 genes were included in the gene phylogenies, of which 23,396 (49.06%) were in gene families of 2 or more, and thus candidates for reassignment. The default threshold for treeinform is marked by the grey vertical dashed line.
Fig 7
Fig 7. Density from theoretical and the empirical density under 3 different thresholds before treeinform was run.
The distribution before treeinform has a large peak on the left that is removed by treeinform with all examined thresholds. Black line represents theoretical density.
Fig 8
Fig 8. Histogram of subtree lengths for internal nodes in each Drosophila and Echinoidea subset gene tree from Agalma containing tip descendants from the same species.
Top is Drosophila and bottom is Echinoidea. Subtree lengths greater than 1 were filtered out for clarity.
Fig 9
Fig 9. Precision vs. recall for pairs of transcripts with regards to known CDS as treeinform threshold increases.
Top: Plot for Drosophila with CDS. Precision does not increase with any threshold, only recall. Bottom: Plot for Strongylocentrotus purpuratus. The biggest improvement is made at a threshold value 5e-06, with precision and recall both increasing up to a threshold of 0.05.

Similar articles

Cited by

References

    1. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10(1):57–63. 10.1038/nrg2484 - DOI - PMC - PubMed
    1. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotech. 2011;29(7):644–652. 10.1038/nbt.1883 - DOI - PMC - PubMed
    1. Schulz MH, Zerbino DR, Vingron M, Birney E. Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics. 2012;28(8):1086–1092. 10.1093/bioinformatics/bts094 - DOI - PMC - PubMed
    1. Xie Y, Wu G, Tang J, Luo R, Patterson J, Liu S, et al. SOAPdenovo-Trans: De novo transcriptome assembly with short RNA-Seq reads. Bioinformatics. 2014. 10.1093/bioinformatics/btu077 - DOI - PubMed
    1. Iñiguez LP, Hernández G. The evolutionary relationship between alternative splicing and gene duplication. Frontiers in Genetics. 2017;8(FEB):1–7. 10.3389/fgene.2017.00014 - DOI - PMC - PubMed

Publication types

LinkOut - more resources