Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Sep 18:20:5516-5523.
doi: 10.1016/j.csbj.2022.09.011. eCollection 2022.

Regions with two amino acids in protein sequences: A step forward from homorepeats into the low complexity landscape

Affiliations

Regions with two amino acids in protein sequences: A step forward from homorepeats into the low complexity landscape

Pablo Mier et al. Comput Struct Biotechnol J. .

Abstract

Low complexity regions (LCRs) differ in amino acid composition from the background provided by the corresponding proteomes. The simplest LCRs are homorepeats (or polyX), regions composed of mostly-one amino acid type. Extensive research has been done to characterize homorepeats, and their taxonomic, functional and structural features depend on the amino acid type and sequence context. From them, the next step towards the study of LCRs are the regions composed of two types of amino acids, which we call polyXY. We classify polyXY in three categories based on the arrangement of the two amino acid types 'X' and 'Y': direpeats (e.g. 'XYXYXY'), joined (e.g. 'XXXYYY') and shuffled (e.g. 'XYYXXY'). We developed a script to search for polyXY, and located them in a comprehensive set of 20,340 reference proteomes. These results are available in a dedicated web server called XYs, in which the user can also submit their own protein datasets to detect polyXY. We studied the distribution of polyXY types by amino acid pair XY and category, and show that polyXY in Eukaryota are mainly located within intrinsically disordered regions. Our study provides a first step towards the characterization of polyXY as protein motifs.

Keywords: Linear motifs; Low complexity regions; Protein sequence analysis; polyXY.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Figures

None
Graphical abstract
Fig. 1
Fig. 1
Detection of polyXY regions. A window of length six (bars; default parameter) slides over a protein sequence. At each position a test is run to check that only two types of residues are present and that each of them is detected at least twice. Green and red bars indicate positive and negative results of the test. The positives are overlapped to produce extended polyXY regions (thick green bars at the bottom). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Fig. 2
Fig. 2
Description of polyXY features found in major taxonomic groups. (A) Proportion of polyXY per category detected in all the proteomes considered per taxa. (B) Top 10 most prevalent polyXY per taxa. (C) Fraction of polyXY regions containing a given amino acid per taxa, compared to the background frequency of the amino acid. Labels are shown for amino acids present in more than 15 % of polyXY regions. The x = y/2 line is indicated in black.
Fig. 3
Fig. 3
Proportion of direpeat polyXY regions per number of units. For the top 10 most abundant types per taxa.
Fig. 4
Fig. 4
Structure models of direpeats polyGL regions. AlphaFold predictions for (A) Arabinose import protein AraG (UniProtKB:Q882I8; AlphaFold:AF-Q882I8-F1) and (B) GMP synthase GuaA (UniProtKB:Q5NG38; AlphaFold:AF-Q5NG38-F1), from bacteria Pseudomonas syringae pv. tomato and Francisella tularensis subsp. tularensis, respectively. Positionally annotated domains overlapping with the polyXY regions are colored in blue. The polyXY region, colored in red, is a direpeats polyGL. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Fig. 5
Fig. 5
Joined polyVG prevails over joined polyGV in archaean proteins. (A) Relative position of joined polyGV and joined polyVG in archaean proteins; a non-parametric Mann–Whitney U statistical test was performed to compare the distributions. (B) Overlap of polyVG regions with any domain or with FAD binding domains, taken from the positionally annotated domains in the UniProtKB database, and considering the start position of the polyVG region before or after position 50 of the protein. (C) Positions 1–20 of the logos from the FAD binding domain 2 (PF00890) and FAD binding domain 3 (PF01494), obtained from the Pfam database.
Fig. 6
Fig. 6
Categorization of shuffled polyXY. Number of shuffled polyXY regions per taxa containing a joined polyXY, a direpeat polyXY, or both.
Fig. 7
Fig. 7
Overlap between polyXY, globular domains and disordered regions. Per polyXY region, a randomly-placed region with the same length was checked for overlap with a domain or a disordered region in the same protein (black circle). The top 10 most prevalent polyXY regions in Eukarya were considered. Non-parametric Mann–Whitney U statistical tests were performed to compare the distributions.

Similar articles

Cited by

References

    1. Mier P., Paladin L., Tamana S., Petrosian S., Hajdu-Soltesz B., Urbanek A., et al. Disentangling the complexity of low complexity proteins. Brief Bioinform. 2020;21(2):458–472. - PMC - PubMed
    1. Mier P., Alanis-Lobato G., Andrade-Navarro M.A. Context characterization of amino acid homorepeats using evolution, position, and order. Proteins. 2017;85(4):709–719. - PubMed
    1. Romov P.A., Li F., Lipke P.N., Epstein S.L., Qiu W.-G. Comparative genomics reveals long, evolutionarily conserved, low-complexity islands in yeast proteins. J Mol Biol. 2006;63(3):415–425. - PubMed
    1. Chaudhry S.R., Lwin N., Phelan D., Escalante A.A., Gattistuzzi F.U. Comparative analysis of low complexity regions in Plasmodia. Sci Rep. 2018;8(1):335. - PMC - PubMed
    1. Ntountoumi C., Vlastaridis P., Mossialos D., Stathopoulos C., Iliopoulos I., Promponas V., et al. Low complexity regions in the proteins of prokaryotes perform important functional roles and are highly conserved. Nucl Acids Res. 2019;47(19):9998–10009. - PMC - PubMed

LinkOut - more resources