. 2006 Jun 27:7:323.

doi: 10.1186/1471-2105-7-323.

Identification of putative domain linkers by a neural network - application to a large sequence database

Satoshi Miyazaki¹, Yutaka Kuroda, Shigeyuki Yokoyama

Affiliations

PMID: 16800897
PMCID: PMC1538634
DOI: 10.1186/1471-2105-7-323

Identification of putative domain linkers by a neural network - application to a large sequence database

Satoshi Miyazaki et al. BMC Bioinformatics. 2006.

. 2006 Jun 27:7:323.

doi: 10.1186/1471-2105-7-323.

Authors

Satoshi Miyazaki¹, Yutaka Kuroda, Shigeyuki Yokoyama

Affiliation

¹ Department of Biophysics and Biochemistry, Graduate School of Science, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-0033, Japan.

PMID: 16800897
PMCID: PMC1538634
DOI: 10.1186/1471-2105-7-323

Abstract

Background: The reliable dissection of large proteins into structural domains represents an important issue for structural genomics/proteomics projects. To provide a practical approach to this issue, we tested the ability of neural network to identify domain linkers from the SWISSPROT database (101602 sequences).

Results: Our search detected 3009 putative domain linkers adjacent to or overlapping with domains, as defined by sequence similarity to either Protein Data Bank (PDB) or Conserved Domain Database (CDD) sequences. Among these putative linkers, 75% were "correctly" located within 20 residues of a domain terminus, and the remaining 25% were found in the middle of a domain, and probably represented failed predictions. Moreover, our neural network predicted 5124 putative domain linkers in structurally un-annotated regions without sequence similarity to PDB or CDD sequences, which suggest to the possible existence of novel structural domains. As a comparison, we performed the same analysis by identifying low-complexity regions (LCR), which are known to encode unstructured polypeptide segments, and observed that the fraction of LCRs that correlate with domain termini is similar to that of domain linkers. However, domain linkers and LCRs appeared to identify different types of domain boundary regions, as only 32% of the putative domain linkers overlapped with LCRs.

Conclusion: Overall, our study indicates that the two methods detect independent and complementary regions, and that the combination of these methods can substantially improve the sensitivity of the domain boundary prediction. This finding should enable the identification of novel structural domains, yielding new targets for large scale protein analyses.

PubMed Disclaimer

Figures

**Figure 1**
Classification of the predicted linkers and the low complexity regions. (A) Schematic representation of the positions of the predicted domain boundaries relative to the putative structural domains. The our classes are: correct matches at both ends (class 1), correct matches at either end (class 2), overlaps (class 4), and unmatched locations(class 3). Percentages of putative domain linkers (B) and low-complexity regions (C) in the four classes. An error window parameter, on the horizontal axis, is used to accommodate the terminal ambiguity of the assigned sequence regions. When the distance between the ends of a putative domain linker (B) or a low-complexity region (C), and the end of a putative structural domain was smaller than the error window, we considered the position of the predicted domain boundary to be correct. The error window parameter was varied from 5 to 50 residues.

**Figure 2**
Putative domain linkers and low-complexity regions assigned in SWISSPROT sequences. Each thick black horizontal bar represents a SWISSPROT sequence used as a test sequence. The SWISSPROT ID number is indicated on the top left of the corresponding sequence. In each SWISSPROT sequence, sequence regions similar to PDB and CDD sequences were assigned as putative structural domains. A green horizontal bar represents a sequence region similar to a PDB sequence. Similarly, the horizontal bars colored in blue, red and magenta represent sequence regions similar to CDD sequences, corresponding to the Pfam, SMART and LOAD (Library Of Ancient Domains) libraries, respectively. Sequence regions predicted to be putative domain linkers are designated by vertical bars in colors ranging from yellow to brown, according to the neural network output values. Low-complexityregions are designated by cyan rectangles overlaid on black bars.

**Figure 3**
Complexity distribution. The sequence entropy distributions are shown for the putative domain linkers (thick solid line) and the low-complexity regions (thick broken line) longer than 45 residues. The sequence entropy was calculated by a sliding window of 45 residues over the putative domain linkers [43, 51]. The thin solid line represents the sequence entropy of all of the putative domain linkers (including those shorter than 45 residues) calculated with a window equal to the length of the linker.

**Figure 4**
Comparison with blind prediction. The success rate (prediction quality index) of blind prediction is plotted as a function of the error window parameter (cross marks). The prediction quality factors for domain linkers (diamonds), low-complexity regions (squares), and a combined prediction (triangles) are also shown.

**Figure 5**
Correlation between the positions of domain linkers and putative structural domains. The horizontal scale represents the number of residues in the error window between the linker termini and the corresponding putative structural domain termini. This is calculated as the number of residues separating the last residue (or the first residue) of a domain linker in Classes 1 and 2 from the first residue (or respectively the last residue) of the corresponding putative structural domain. (A) Distribution calculated for putative structural domains detected by similarity to PDB and CDD, (B) to PDB, and (C) to CDD.

See this image and copyright information in PMC

References

1. O'Toole N, Raymond S, Cygler M. Coverage of protein sequence space by current structural genomics targets. J Struct Funct Genomics. 2003;4:47–55. doi: 10.1023/A:1026156025612. - DOI - PubMed
1. Kim SH. Shining a light on structural genomics. Nat Struct Biol. 1998;5 Suppl:643–645. doi: 10.1038/1334. - DOI - PubMed
1. Shapiro L, Lima CD. The Argonne Structural Genomics Workshop: Lamaze class for the birth of a new science. Structure. 1998;6:265–267. doi: 10.1016/S0969-2126(98)00030-6. - DOI - PubMed
1. Brenner SE, Barken D, Levitt M. The PRESAGE database for structural genomics. Nucleic Acids Res. 1999;27:251–253. doi: 10.1093/nar/27.1.251. - DOI - PMC - PubMed
1. Mallick P, Goodwill KE, Fitz-Gibbon S, Miller JH, Eisenberg D. Selecting protein targets for structural genomics of Pyrobaculum aerophilum: validating automated fold assignment methods by using binary hypothesis testing. Proc Natl Acad Sci U S A. 2000;97:2450–2455. doi: 10.1073/pnas.050589297. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Identification of putative domain linkers by a neural network - application to a large sequence database

Affiliation

Identification of putative domain linkers by a neural network - application to a large sequence database

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Research Materials