Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2006 Jun 27:7:323.
doi: 10.1186/1471-2105-7-323.

Identification of putative domain linkers by a neural network - application to a large sequence database

Affiliations

Identification of putative domain linkers by a neural network - application to a large sequence database

Satoshi Miyazaki et al. BMC Bioinformatics. .

Abstract

Background: The reliable dissection of large proteins into structural domains represents an important issue for structural genomics/proteomics projects. To provide a practical approach to this issue, we tested the ability of neural network to identify domain linkers from the SWISSPROT database (101602 sequences).

Results: Our search detected 3009 putative domain linkers adjacent to or overlapping with domains, as defined by sequence similarity to either Protein Data Bank (PDB) or Conserved Domain Database (CDD) sequences. Among these putative linkers, 75% were "correctly" located within 20 residues of a domain terminus, and the remaining 25% were found in the middle of a domain, and probably represented failed predictions. Moreover, our neural network predicted 5124 putative domain linkers in structurally un-annotated regions without sequence similarity to PDB or CDD sequences, which suggest to the possible existence of novel structural domains. As a comparison, we performed the same analysis by identifying low-complexity regions (LCR), which are known to encode unstructured polypeptide segments, and observed that the fraction of LCRs that correlate with domain termini is similar to that of domain linkers. However, domain linkers and LCRs appeared to identify different types of domain boundary regions, as only 32% of the putative domain linkers overlapped with LCRs.

Conclusion: Overall, our study indicates that the two methods detect independent and complementary regions, and that the combination of these methods can substantially improve the sensitivity of the domain boundary prediction. This finding should enable the identification of novel structural domains, yielding new targets for large scale protein analyses.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Classification of the predicted linkers and the low complexity regions. (A) Schematic representation of the positions of the predicted domain boundaries relative to the putative structural domains. The our classes are: correct matches at both ends (class 1), correct matches at either end (class 2), overlaps (class 4), and unmatched locations(class 3). Percentages of putative domain linkers (B) and low-complexity regions (C) in the four classes. An error window parameter, on the horizontal axis, is used to accommodate the terminal ambiguity of the assigned sequence regions. When the distance between the ends of a putative domain linker (B) or a low-complexity region (C), and the end of a putative structural domain was smaller than the error window, we considered the position of the predicted domain boundary to be correct. The error window parameter was varied from 5 to 50 residues.
Figure 2
Figure 2
Putative domain linkers and low-complexity regions assigned in SWISSPROT sequences. Each thick black horizontal bar represents a SWISSPROT sequence used as a test sequence. The SWISSPROT ID number is indicated on the top left of the corresponding sequence. In each SWISSPROT sequence, sequence regions similar to PDB and CDD sequences were assigned as putative structural domains. A green horizontal bar represents a sequence region similar to a PDB sequence. Similarly, the horizontal bars colored in blue, red and magenta represent sequence regions similar to CDD sequences, corresponding to the Pfam, SMART and LOAD (Library Of Ancient Domains) libraries, respectively. Sequence regions predicted to be putative domain linkers are designated by vertical bars in colors ranging from yellow to brown, according to the neural network output values. Low-complexityregions are designated by cyan rectangles overlaid on black bars.
Figure 3
Figure 3
Complexity distribution. The sequence entropy distributions are shown for the putative domain linkers (thick solid line) and the low-complexity regions (thick broken line) longer than 45 residues. The sequence entropy was calculated by a sliding window of 45 residues over the putative domain linkers [43, 51]. The thin solid line represents the sequence entropy of all of the putative domain linkers (including those shorter than 45 residues) calculated with a window equal to the length of the linker.
Figure 4
Figure 4
Comparison with blind prediction. The success rate (prediction quality index) of blind prediction is plotted as a function of the error window parameter (cross marks). The prediction quality factors for domain linkers (diamonds), low-complexity regions (squares), and a combined prediction (triangles) are also shown.
Figure 5
Figure 5
Correlation between the positions of domain linkers and putative structural domains. The horizontal scale represents the number of residues in the error window between the linker termini and the corresponding putative structural domain termini. This is calculated as the number of residues separating the last residue (or the first residue) of a domain linker in Classes 1 and 2 from the first residue (or respectively the last residue) of the corresponding putative structural domain. (A) Distribution calculated for putative structural domains detected by similarity to PDB and CDD, (B) to PDB, and (C) to CDD.

Similar articles

Cited by

References

    1. O'Toole N, Raymond S, Cygler M. Coverage of protein sequence space by current structural genomics targets. J Struct Funct Genomics. 2003;4:47–55. doi: 10.1023/A:1026156025612. - DOI - PubMed
    1. Kim SH. Shining a light on structural genomics. Nat Struct Biol. 1998;5 Suppl:643–645. doi: 10.1038/1334. - DOI - PubMed
    1. Shapiro L, Lima CD. The Argonne Structural Genomics Workshop: Lamaze class for the birth of a new science. Structure. 1998;6:265–267. doi: 10.1016/S0969-2126(98)00030-6. - DOI - PubMed
    1. Brenner SE, Barken D, Levitt M. The PRESAGE database for structural genomics. Nucleic Acids Res. 1999;27:251–253. doi: 10.1093/nar/27.1.251. - DOI - PMC - PubMed
    1. Mallick P, Goodwill KE, Fitz-Gibbon S, Miller JH, Eisenberg D. Selecting protein targets for structural genomics of Pyrobaculum aerophilum: validating automated fold assignment methods by using binary hypothesis testing. Proc Natl Acad Sci U S A. 2000;97:2450–2455. doi: 10.1073/pnas.050589297. - DOI - PMC - PubMed

Publication types