Using the miraEST assembler for reliable and automated mRNA transcript assembly and SNP detection in sequenced ESTs

Bastien Chevreux¹, Thomas Pfisterer, Bernd Drescher, Albert J Driesel, Werner E G Müller, Thomas Wetter, Sándor Suhai

Affiliations

PMID: 15140833
PMCID: PMC419793
DOI: 10.1101/gr.1917404

Using the miraEST assembler for reliable and automated mRNA transcript assembly and SNP detection in sequenced ESTs

Bastien Chevreux et al. Genome Res. 2004 Jun.

. 2004 Jun;14(6):1147-59.

doi: 10.1101/gr.1917404. Epub 2004 May 12.

Authors

Bastien Chevreux¹, Thomas Pfisterer, Bernd Drescher, Albert J Driesel, Werner E G Müller, Thomas Wetter, Sándor Suhai

Affiliation

¹ Department of Molecular Biophysics, German Cancer Research Centre Heidelberg, 69120 Heidelberg, Germany. bastien@chevreux.org

PMID: 15140833
PMCID: PMC419793
DOI: 10.1101/gr.1917404

Abstract

We present an EST sequence assembler that specializes in reconstruction of pristine mRNA transcripts, while at the same time detecting and classifying single nucleotide polymorphisms (SNPs) occuring in different variations thereof. The assembler uses iterative multipass strategies centered on high-confidence regions within sequences and has a fallback strategy for using low-confidence regions when needed. It features special functions to assemble high numbers of highly similar sequences without prior masking, an automatic editor that edits and analyzes alignments by inspecting the underlying traces, and detection and classification of sequence properties like SNPs with a high specificity and a sensitivity down to one mutation per sequence. In addition, it includes possibilities to use incorrectly preprocessed sequences, routines to make use of additional sequencing information such as base-error probabilities, template insert sizes, strain information, etc., and functions to detect and resolve possible misassemblies. The assembler is routinely used for such various tasks as mutation detection in different cell types, similarity analysis of transcripts between organisms, and pristine assembly of sequences from various sources for oligo design in clinical microarray experiments.

PubMed Disclaimer

Figures

**Figure 1**
Example of a misassembled transcript when SNPs are disregarded. Assembly of three input sequences are shown at *left*; the resulting transcripts of this assembly are shown at *right*. The three sequences s₁, s_1^*, and s₂ contain different homologous parts, represented by the different shades of gray, and exactly one SNP position. A normal assembly algorithm will assemble first s₁, then s₂ (because of the long overlapping alignment in the white part), and then might try to align s_1^*, but fail because of the large mismatch. The SNP position with G in sequence s₁ and A in s₂ is treated as typical noise in the alignment algorithms and ignored. The resulting transcript sequences are therefore wrong, as they do not represent the sequences found in vivo: t₁ is a mix of two transcripts and does not code a true protein.

**Figure 2**
The same example as in Figure 1, but in this example, the assembly algorithm honors SNP positions that were detected during earlier iterations of the assembly process. The alignment between s₁ and s₂ will, therefore, not be made, as there is a mismatch at the SNP position, even with the long overlap between both sequences. Instead, the assembler will align s₁ and s_1*, as they do not contain mismatches at SNP positions. The result is a correct representation of the transcriptome.

**Figure 3**
The multipass and iterative nature of the assembler becomes clear as in this schematic diagram of the phases of a miraEST assembly. Previously unknown information (like possible SNP sites) can be discovered and taken into account throughout all of the assembly stages. Solid arrows show imperative pathways, dashed arrows denote optional pathways that may or may be not taken, depending on assembly parameter values and the actual data.

**Figure 4**
Snapshot of a contig in the sequence assembly after the first iteration (visual representation by means of the gap4 program). All sequences were assembled together. After the assembly, miraEST searched for unresolved mismatches with good signal qualities, tagging entire columns as dangerous potential SNP sites for the next iteration. miraEST tagged strong SNP sites bright red, weak sites in blue; bases differing from the consensus are shown in green by the gap4 program. Some bases were not tagged, although they cover a possible SNP site; these bases generally have trace signals of bad quality that the assembler deemed to be too dangerous to be taken as differentiation criterion. miraEST will dismantle that contig and reassemble the sequences immediately, this time using the information gained about the potential SNP sites in the previous assembly to correctly discern between different mRNA transcripts having different SNP variants. The black rectangle amidst the sequences depicts the three trace signal extracts that have been exemplarily shown below; the smaller black boxes within the rectangle depict the discrepancy bases that have also been surrounded by black boxes in the traces. All sequences have indisputable trace curves and quality values (shown as a blue bar above the traces). One can clearly see that there will be at least three different mRNA transcripts to be built, on the basis of the double-base mutation in the middle of the box, one reading CC, the next CT, and the last TC.

**Figure 5**
The last (optional) step of the EST assembly consists of the input sequences being given strain information to show the effect when two different organism strains (named sponge1 and sponge2) are sequenced and analyzed. In this example, miraEST classified the SNPs into two categories: PROS (shown in light blue) for SNPs that occur only between strains/organisms (e.g., column 661) and PIOS (shown in light green) for SNPs that occur both within a strain as between different strains (e.g., column 662). Interestingly enough, most of the SNPs shown in this example will not cause a change in the amino acids of the resulting protein, with one notable exception, the SNP of sponge1_singlet4 at base position 662 causes a TAA codon to be expressed, which is a stop codon. The SNPs of the same sequence at position 686 and 707 would cause mutations in the amino acid sequence, but are, because of the TAA mutation earlier, in the3′ UTR of this particular mRNA transcript.

See this image and copyright information in PMC

References

1. Allex, C.F., Baldwin, S.F., Shavlik, J.W., and Blattner, F.R. 1996. Improving the quality of automatic DNA sequence assembly using fluorescent trace-data classifications. Intell. Systems Mol. Biol. 4: 3–14. - PubMed
1. Arslan, A.N., Egecioglu, O., and Pevzner, P.A. 2001. A new approach to sequence comparison: Normalized sequence alignment. Bioinformatics 17: 327–337. - PubMed
1. Baeza-Yates, R.A. and Gonnet, G.H. 1992. A new approach to text searching. Commun. of the Assoc. for Comp. Mach. 35: 74–82.
1. Barker, G., Batley, J., O'Sullivan, H., Edwards, K.J., and Edwards, D. 2003. Redundancy based detection of sequence polymorphisms in expressed sequence tag data using autoSNP. Bioinformatics 19: 421–422. - PubMed
1. Bonfield, J.K. and Staden, R. 1996. Experiment files and their application during large-scale sequencing projects. DNA Seq. 6: 109–117. - PubMed

WEB SITE REFERENCES

1. http://www.chevreux.org/projects_mira.html; homepage of the MIRA V2 assembly system.
1. http://www.dkfz.de/mbp-ased/; homepage of the MIRA V1 assembly system and EdIt automatic editor.

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Using the miraEST assembler for reliable and automated mRNA transcript assembly and SNP detection in sequenced ESTs

Affiliation

Using the miraEST assembler for reliable and automated mRNA transcript assembly and SNP detection in sequenced ESTs

Authors

Affiliation

Abstract

Figures

References

WEB SITE REFERENCES

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials