Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Jun;174(2):886-903.
doi: 10.1104/pp.17.00294. Epub 2017 Apr 26.

Pipeline to Identify Hydroxyproline-Rich Glycoproteins

Affiliations

Pipeline to Identify Hydroxyproline-Rich Glycoproteins

Kim L Johnson et al. Plant Physiol. 2017 Jun.

Abstract

Intrinsically disordered proteins (IDPs) are functional proteins that lack a well-defined three-dimensional structure. The study of IDPs is a rapidly growing area as the crucial biological functions of more of these proteins are uncovered. In plants, IDPs are implicated in plant stress responses, signaling, and regulatory processes. A superfamily of cell wall proteins, the hydroxyproline-rich glycoproteins (HRGPs), have characteristic features of IDPs. Their protein backbones are rich in the disordering amino acid proline, they contain repeated sequence motifs and extensive posttranslational modifications (glycosylation), and they have been implicated in many biological functions. HRGPs are evolutionarily ancient, having been isolated from the protein-rich walls of chlorophyte algae to the cellulose-rich walls of embryophytes. Examination of HRGPs in a range of plant species should provide valuable insights into how they have evolved. Commonly divided into the arabinogalactan proteins, extensins, and proline-rich proteins, in reality, a continuum of structures exists within this diverse and heterogenous superfamily. An inability to accurately classify HRGPs leads to inconsistent gene ontologies limiting the identification of HRGP classes in existing and emerging omics data sets. We present a novel and robust motif and amino acid bias (MAAB) bioinformatics pipeline to classify HRGPs into 23 descriptive subclasses. Validation of MAAB was achieved using available genomic resources and then applied to the 1000 Plants transcriptome project (www.onekp.com) data set. Significant improvement in the detection of HRGPs using multiple-k-mer transcriptome assembly methodology was observed. The MAAB pipeline is readily adaptable and can be modified to optimize the recovery of IDPs from other organisms.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Schematic of the predicted structures of selected HRGPs. The major angiosperm HRGP multigene families are the AGPs (A–C), cross-linking (CL)-EXTs (D and E), and PRPs (G and H). Hybrid HRGPs (F) contain motifs characteristic of more than one HRGP family and are commonly found in green algae (I–K). The protein motifs that direct the hydroxylation of Pro to Hyp and undergo subsequent O-glycosylation are as follows: (1) SP, AP, TP, VP, and GP (light blue bars), to which large type II arabinogalactan chains (type II AG; orange) are added; (2) SP3-5 glycomotif repeats (red bars) that direct the addition of short arabinose (Ara) side chains (dark red) on Hyp residues and Gal (green) on Ser residues. In CL-EXTs, these SP3-5 motifs alternate with Y cross-linking motifs (dark blue bars representing YXY, VYK, and YY) in the protein backbone. Y motifs can form both intramolecular and intermolecular cross-links. Intermolecular cross-links (gray) occur through the formation of diisodityrosine. Algal HRGPs have single Tyr residues outside the Pro-rich regions (dark blue, dashed; I–K); (3) PRP motifs (brown bars) direct minimal glycosylation of short Ara residues (G and H). Chimeric HRGPs have a recognized PFAM domain (black vertical lined box; B, E, and H) in addition to a HRGP region.
Figure 2.
Figure 2.
Disorder prediction, sequence alignment, and phylogenetic trees of the Arabidopsis classical GPI-AGPs and CL-EXTs. A, Protein disorder (PONDR) plots (see “Materials and Methods”) for AtAGP6 (At5g14380; left), AtEXT3 (At1g21310; middle), and prolyl-4-hydroxylase (AtP4H1; At2g43080; right) using VL-XT (red), VL3 (green), and VSL2 (blue). PONDR prediction scores above the threshold line (0.5) predict disorder; below the line, they predict order. B, Sequence alignment (MUSCLE) of 16 Arabidopsis GPI-AGPs with the non-GPI-AGP AtAGP51 included for comparison. Endoplasmic reticulum (ER; N-terminal) and GPI-anchor (C-terminal) signal sequences are colored in green and orange, respectively. Glycomotifs and selected residues are highlighted as follows: AP1-3 (yellow); SP1-2 (blue); SPPP (also found in EXTs; blue underlined); TP1-3 (pink/purple); [G/V]P1-3 (gray); K (bright green); and M (olive green). This shows the diversity and lack of sequence conservation between family members. C, Maximum likelihood tree (MEGA) of Arabidopsis GPI-AGPs and AtAGP51. D, Maximum likelihood tree (MEGA) of Arabidopsis CL-EXT and AtLRX1 (chimeric CL-EXT). In C and D, numbers on the nodes represent support with 100 bootstrap replicates (70 or greater, green; 60–69, orange, 40–59, black) with subclades AGP-a to AGP-j (C) and EXT-a to EXT-g (D; denoted by horizontal lines). Scale bars for branch length measure the number of substitutions per site. The CL-EXT alignment with shaded motifs is shown in Supplemental Figure S2.
Figure 3.
Figure 3.
Overview of the MAAB pipeline for the identification and classification of non-chimeric HRGPs. The pipeline consists of two major stages: stage 1 (1a–1f), identification; and stage 2, classification. Stage 1 largely consists of removing unwanted sequences, including chimeric HRGPs and AG peptides, and retaining sequences with the desired amino acid bias (45% or greater) and ER signal sequence. Stage 2 filters sequences into four categories based on the percentage amino acid composition that is dominant by 2% or greater: AGPs (boxed in orange) if PAST, EXTs (boxed in red) if PSKY, and PRPs (boxed in pink) if PVKY. If no clear bias exists (Δ amino acid bias < 2%) the sequence is placed in the shared bias HRGPs (boxed in yellow). The next step is HRGP motif analysis, which uses motif type and number (no.). The motifs used for AGPs are [ASVTG]P, [ASVTG]PP, [AVTG]PPP; those used for EXT are SP3, SP4, SP5, [FY]XY, KHY, VY[HKDE], VxY, and YY; and those used for PRPs are PPV[QK], PPVx[KT], and KKPCPP. A relative HRGP motif count (for AGP and PRP bias) ensures that sequences have the motifs expected for the amino acid bias class they are categorized into (see “Materials and Methods”). The number of accepted AGP motifs is calculated from the number of AGP motifs divided by 2 (since two typical AGP motifs [e.g. SPAP] have a similar length to a typical EXT motif [e.g. SPPP] and a typical PRP motif [e.g. PPVxK]). Accepted CL-EXT motifs have a minimum requirement of two SP3-5 motifs and two Y motifs that must be present in a similar ratio (SPn:Y between 0.25 and 4). An additional MAAB class (class 24) arises for proteins with less than 15% known HRGP motifs (boxed in blue). After HRGP motif classification, the sequences that do not meet the above criteria (red arrow) are analyzed separately from the classical classes and placed into classes representing hybrid HRGPs. Before the final classification, all sequences are analyzed for the presence of a C-terminal GPI-anchor signal sequence. Sequences are thus categorized into one of 24 classes (Table I; see Fig. 4) with 23 classes of HRGPs: classes 1 to 4 representing the classical HRGPs classes; classes 5 to 23 representing minor HRGP classes consisting of, for example, hybrid HRGPs; and a final class, MAAB class 24, likely representing either non-HRGPs or unknown HRGPs.
Figure 4.
Figure 4.
Illustration of the parameters used for MAAB classification of HRGPs. Where possible, for the Arabidopsis sequences, we have included both the gene names and the nomenclature designated by Showalter et al. (2010; in blue text). The total number of Arabidopsis sequences identified for a given class is shown in parentheses, and up to four examples are shown. If no Arabidopsis sequence was present in a given class, then sequences from other species, either from Phytozome or 1KP, were used. For class 18, an Arabidopsis sequence (At4G15160.1) does not have the expected features of this class due to the partially order-dependent assignment of the minor classes (see “Materials and Methods”). The columns reporting amino acid bias, as used to classify sequences into AGP bias (orange), EXT bias (red), PRP bias (purple), or shared bias (yellow), are shaded as for Figure 3. Shading of motifs is used to highlight the number of hybrid sequences that satisfy 10% or greater motifs for any given HRGP class. SPn:Y is reported as the number of SPn motifs:number of Y motifs (ratio of SPn:Y reported as a fraction). White text for the SPn:Y ratio indicates that the sequence does not satisfy at least one of the criteria for CL-EXT: at least two SPn and two Y motifs (indicated by asterisks) or a ratio of SPn:Y between 0.25 and 4 (reported here as 0 if either value is 0). The order of motif searching is CL-EXT motifs first, followed by PRP motifs, and, finally, AGP motifs. Sequences are shown with HRGP motifs (as used for classification) highlighted as follows: light blue for AGP motifs, red for EXT SP3-5 motifs, dark blue for Y-based EXT motifs, and olive green for PRP motifs. In cases where motifs overlap, as occurs frequently in the shared bias classes (18–23), shading shows only the accepted, first identified, motif.
Figure 4.
Figure 4.
Illustration of the parameters used for MAAB classification of HRGPs. Where possible, for the Arabidopsis sequences, we have included both the gene names and the nomenclature designated by Showalter et al. (2010; in blue text). The total number of Arabidopsis sequences identified for a given class is shown in parentheses, and up to four examples are shown. If no Arabidopsis sequence was present in a given class, then sequences from other species, either from Phytozome or 1KP, were used. For class 18, an Arabidopsis sequence (At4G15160.1) does not have the expected features of this class due to the partially order-dependent assignment of the minor classes (see “Materials and Methods”). The columns reporting amino acid bias, as used to classify sequences into AGP bias (orange), EXT bias (red), PRP bias (purple), or shared bias (yellow), are shaded as for Figure 3. Shading of motifs is used to highlight the number of hybrid sequences that satisfy 10% or greater motifs for any given HRGP class. SPn:Y is reported as the number of SPn motifs:number of Y motifs (ratio of SPn:Y reported as a fraction). White text for the SPn:Y ratio indicates that the sequence does not satisfy at least one of the criteria for CL-EXT: at least two SPn and two Y motifs (indicated by asterisks) or a ratio of SPn:Y between 0.25 and 4 (reported here as 0 if either value is 0). The order of motif searching is CL-EXT motifs first, followed by PRP motifs, and, finally, AGP motifs. Sequences are shown with HRGP motifs (as used for classification) highlighted as follows: light blue for AGP motifs, red for EXT SP3-5 motifs, dark blue for Y-based EXT motifs, and olive green for PRP motifs. In cases where motifs overlap, as occurs frequently in the shared bias classes (18–23), shading shows only the accepted, first identified, motif.

References

    1. Babu MM. (2016) The contribution of intrinsically disordered regions to protein function, cellular complexity, and human disease. Biochem Soc Trans 44: 1185–1200 - PMC - PubMed
    1. Bennick A. (1987) Structural and genetic aspects of proline-rich proteins. J Dent Res 66: 457–461 - PubMed
    1. Berthold MR, Cebron N, Dill F, Gabriel TR, Kötter T, Meinl T, Ohl P, Sieb C, Thiel K, Wiswedel B (2008) KNIME: The Konstanz Information Miner. In Preisach C, Burkhardt H, SchmidtThieme L, Decker R, eds, Data Analysis, Machine Learning and Applications. Springer, Berlin, pp 319–326
    1. Buljan M, Frankish A, Bateman A (2010) Quantifying the mechanisms of domain gain in animal proteins. Genome Biol 11: R74. - PMC - PubMed
    1. Chaturvedi P, Singh AP, Batra SK (2008) Structure, evolution, and biology of the MUC4 mucin. FASEB J 22: 966–981 - PMC - PubMed

LinkOut - more resources