. 2006 Mar 6:7:110.

doi: 10.1186/1471-2105-7-110.

IsoSVM--distinguishing isoforms and paralogs on the protein level

Michael Spitzer¹, Stefan Lorkowski, Paul Cullen, Alexander Sczyrba, Georg Fuellen

Affiliations

PMID: 16519805
PMCID: PMC1431569
DOI: 10.1186/1471-2105-7-110

IsoSVM--distinguishing isoforms and paralogs on the protein level

Michael Spitzer et al. BMC Bioinformatics. 2006.

. 2006 Mar 6:7:110.

doi: 10.1186/1471-2105-7-110.

Authors

Michael Spitzer¹, Stefan Lorkowski, Paul Cullen, Alexander Sczyrba, Georg Fuellen

Affiliation

¹ Division of Bioinformatics, Biology Department, Schlossplatz 4, 48149 Münster, Germany. michael.spitzer@uni-muenster.de

PMID: 16519805
PMCID: PMC1431569
DOI: 10.1186/1471-2105-7-110

Abstract

Background: Recent progress in cDNA and EST sequencing is yielding a deluge of sequence data. Like database search results and proteome databases, this data gives rise to inferred protein sequences without ready access to the underlying genomic data. Analysis of this information (e.g. for EST clustering or phylogenetic reconstruction from proteome data) is hampered because it is not known if two protein sequences are isoforms (splice variants) or not (i.e. paralogs/orthologs). However, even without knowing the intron/exon structure, visual analysis of the pattern of similarity across the alignment of the two protein sequences is usually helpful since paralogs and orthologs feature substitutions with respect to each other, as opposed to isoforms, which do not.

Results: The IsoSVM tool introduces an automated approach to identifying isoforms on the protein level using a support vector machine (SVM) classifier. Based on three specific features used as input of the SVM classifier, it is possible to automatically identify isoforms with little effort and with an accuracy of more than 97%. We show that the SVM is superior to a radial basis function network and to a linear classifier. As an example application we use IsoSVM to estimate that a set of Xenopus laevis EST clusters consists of approximately 81% cases where sequences are each other's paralogs and 19% cases where sequences are each other's isoforms. The number of isoforms and paralogs in this allotetraploid species is of interest in the study of evolution.

Conclusion: We developed an SVM classifier that can be used to distinguish isoforms from paralogs with high accuracy and without access to the genomic data. It can be used to analyze, for example, EST data and database search results. Our software is freely available on the Web, under the name IsoSVM.

PubMed Disclaimer

Figures

**Figure 1**
**Visualization of a part of an alignment of (A)** two paralogous sequences (the human ABCB4 and ABCB1 protein) and **(B)** two isoforms (the human ABCB4 protein and its isoform c), representing an ideal case. Positions with matches between the two sequences are indicated by "|", mismatches by "#" and amino acids vs. gap characters by ":". The values of the three features (cf. ***Methods***, section *Features*) for the *full-length* sequences compared in panel (A) are (i) *sequence similarity* 75.76%, (ii) *inverse CBIN count* 0.0027, (iii) *fraction of consecutive matches and mismatches* 0.7111. For the *full-length* sequences compared in panel (B) we have (i) sequence similarity 96.33%, (ii) inverse CBIN count 0.3333, (iii) fraction of consecutive matches and mismatches 0.9969.

**Figure 2**
**Features displayed by the samples in the canonical training dataset.** Panels **(A)** to **(C)** illustrate combinations of two of the three features. Panel **(D)** illustrates all three features at the same time. Samples arising from the comparison of paralogous sequences are shown in blue, whereas isoforms are shown in red. An *inverse CBIN count* of 1/n arises if n CBINs are featured by a given sample. Though the samples of both classes separate well in general, some samples of one class "overlap" into the other class.

**Figure 3**
**Illustration of the different cases of consecutive blocks of identities or non-identities (CBINs). (A)** CBIN of matches, **(B)** CBIN of gaps (counted as mismatches), **(C)** CBIN of mismatches, **(D)** example of a comparison of two sequences with an alignment length of 32. Matches are denoted by "|", mismatches by "#" and amino acids aligned to gaps by ":". The example alignment of length 32 features eight CBINs. The values of the three features are: (i) sequence similarity 0.594, (ii) inverse CBIN count 0.125, (iii) fraction of consecutive matches and mismatches 0.75.

**Figure 4**
**Accuracy of classifiers measured by jackknife resampling, employing all three features.** Performance of the SVM classifier is compared to classifiers based on an RBF network as well as a linear classifier. Mean accuracy and standard error of the mean were assessed by 100-fold jackknife resampling using 7604 samples resulting from a visual inspection process of protein sequences taken from Genbank.

**Figure 5**
**Visual inspection process.** Matches in the alignments are colored in blue and mismatches in red. Amino acids aligned to gaps are indicated in green. Panels **(A)** to **(D)** illustrate alignments of two protein sequences classified as isoforms (panels **(A)** and **(B)**) or as paralogs (panels **(C)** and **(D)**). The sequences shown in panel **(A)** feature a shared subsequence (a putative constitutive exon), marked in blue. The upper sequence features an additional exon at the beginning (marked in green) that is missing in the lower sequence. In contrast, a putative exon at the end (also shown in green) is found in the lower sequence only. Comparison of the two putative isoforms shown in panel **(B)** reveals two constitutive exons in the middle and towards the end of the alignment, colored in blue (the only mismatch is interpreted as a sequencing error, or a polymorphism). These are separated by a stretch of amino acids aligned to gaps, interpreted as an exon skipped in the lower sequence. At the beginning of the alignment, the upper sequence features a long stretch of amino acids aligned to gaps and a few mismatches; two mutually exclusive exons are a plausible interpretation, since the lower sequence (starting with G and not with M) is incomplete and its first exon is probably much longer. At the end of the alignment both sequences feature a stretch of mismatches and gaps (colored in red), interpreted as mutually exclusive exons (indicated by a black frame). The sequences compared in panel **(C)** give rise to a sample of the paralog class. In general, the alignment features many mismatches, interpreted as substitutions, and six stretches of amino acids aligned to gaps (putative deletions). Panel **(D)** illustrates another putative paralog. Besides a shared stretch (featuring numerous substitutions) in the middle of the alignment, the upper sequence features putative deletions, or missing exons. It may thus be a case of an isoform of a paralog.

**Figure 6**
**SVM training process.** The complete dataset generated by visual inspection was split into two parts, yielding a canonical training dataset of 3,802 samples and a canonical testing dataset of 3,802 samples, each consisting of an equal number of isoform and paralog instances. The canonical training dataset was again split into four subsets (denoted by numbers in circles) and submitted to the grid-search procedure. The resulting classifier was then tested on the canonical testing dataset.

See this image and copyright information in PMC

Cited by

Ancient dynamin segments capture early stages of host-mitochondrial integration.
Purkanti R, Thattai M. Purkanti R, et al. Proc Natl Acad Sci U S A. 2015 Mar 3;112(9):2800-5. doi: 10.1073/pnas.1407163112. Epub 2015 Feb 17. Proc Natl Acad Sci U S A. 2015. PMID: 25691734 Free PMC article.
Phylogenomic profiles of whole-genome duplications in Poaceae and landscape of differential duplicate retention and losses among major Poaceae lineages.
Zhang T, Huang W, Zhang L, Li DZ, Qi J, Ma H. Zhang T, et al. Nat Commun. 2024 Apr 17;15(1):3305. doi: 10.1038/s41467-024-47428-9. Nat Commun. 2024. PMID: 38632270 Free PMC article.
PET/MRI Radiomics in Patients With Brain Metastases.
Lohmann P, Kocher M, Ruge MI, Visser-Vandewalle V, Shah NJ, Fink GR, Langen KJ, Galldiks N. Lohmann P, et al. Front Neurol. 2020 Feb 7;11:1. doi: 10.3389/fneur.2020.00001. eCollection 2020. Front Neurol. 2020. PMID: 32116995 Free PMC article. Review.
PIC-Me: paralogs and isoforms classifier based on machine-learning approaches.
Oh J, Lee SG, Park C. Oh J, et al. BMC Bioinformatics. 2021 Oct 21;22(Suppl 11):311. doi: 10.1186/s12859-021-04229-x. BMC Bioinformatics. 2021. PMID: 34674638 Free PMC article.
Revising transcriptome assemblies with phylogenetic information.
Guang A, Howison M, Zapata F, Lawrence C, Dunn CW. Guang A, et al. PLoS One. 2021 Jan 12;16(1):e0244202. doi: 10.1371/journal.pone.0244202. eCollection 2021. PLoS One. 2021. PMID: 33434218 Free PMC article.

See all "Cited by" articles

References

1. Alberts B, Johnson A, Lewis J, Raff M, Roberts K, Walter P. Molecular Biology of the Cell. 4. Garland Publishing, New York; 2000.
1. Graveley BR. Alternative splicing: increasing diversity in the proteomic world. Trends Genet. 2001;17:100–107. - PubMed
1. Cartegni L, Chew SL, Krainer AR. Listening to silence and understanding nonsense: exonic mutations that affect splicing. Nature Reviews Genetics. 2002;3:285–298. - PubMed
1. Grabowski PJ, Black DL. Alternative RNA splicing in the nervous system. Prog Neurobiol. 2001;65:289–308. - PubMed
1. Fitch WM. Distinguishing homologous from analogous proteins. Syst Zool. 1970;19:99–113. - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

IsoSVM--distinguishing isoforms and paralogs on the protein level

Affiliation

IsoSVM--distinguishing isoforms and paralogs on the protein level

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Research Materials

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Research Materials