Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Aug 15;22(16):3227-38.
doi: 10.1093/hmg/ddt176. Epub 2013 Apr 16.

Dominant effects of the Huntington's disease HTT CAG repeat length are captured in gene-expression data sets by a continuous analysis mathematical modeling strategy

Affiliations

Dominant effects of the Huntington's disease HTT CAG repeat length are captured in gene-expression data sets by a continuous analysis mathematical modeling strategy

Jong-Min Lee et al. Hum Mol Genet. .

Abstract

In Huntington's disease (HD), the size of the expanded HTT CAG repeat mutation is the primary driver of the processes that determine age at onset of motor symptoms. However, correlation of cellular biochemical parameters also extends across the normal repeat range, supporting the view that the CAG repeat represents a functional polymorphism with dominant effects determined by the longer allele. A central challenge to defining the functional consequences of this single polymorphism is the difficulty of distinguishing its subtle effects from the multitude of other sources of biological variation. We demonstrate that an analytical approach based upon continuous correlation with CAG size was able to capture the modest (∼21%) contribution of the repeat to the variation in genome-wide gene expression in 107 lymphoblastoid cell lines, with alleles ranging from 15 to 92 CAGs. Furthermore, a mathematical model from an iterative strategy yielded predicted CAG repeat lengths that were significantly positively correlated with true CAG allele size and negatively correlated with age at onset of motor symptoms. Genes negatively correlated with repeat size were also enriched in a set of genes whose expression were CAG-correlated in human HD cerebellum. These findings both reveal the relatively small, but detectable impact of variation in the CAG allele in global data in these peripheral cells and provide a strategy for building multi-dimensional data-driven models of the biological network that drives the HD disease process by continuous analysis across allelic panels of neuronal cells vulnerable to the dominant effects of the HTT CAG repeat.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
CAGlength-independent HTT mRNA expression levels in human lymphoblastoid cell lines. (A) Two microarray probes in the Affymetrix U133 plus 2 microarray (202389_s_at, red; 202390_s_at, blue) represent human HTT mRNA. The approximate locations of target sequences of those two microarray probes are shown relative to different transcripts (red and blue vertical lines). Genomic location was based on UCSC Genome Browser GRCh37/hg19 assembly. (B) HTT expression levels determined by two microarray probes (y-axis, log2 scale gcRMA values) were plotted against the length of longer CAG repeat lengths (x-axis). Red and blue circles represent 202389_s_at and 202390_s_at, respectively. (C) HTT expression levels determined by two microarray probes (y-axis, log2 scale) were plotted against the length of shorter CAG repeat lengths (x-axis). (D) HTT mRNA levels determined by microarray probes may represent the sum of mRNAs expressed from both alleles (an allele with longer CAG and the other allele with shorter allele). Therefore, HTT expression levels (y-axis, log2 scale) were plotted against the sum of longer and shorter CAG repeat lengths (x-axis). (E) To determine whether lengths of CAG repeats were significantly associated with the expression levels of HTT, HTT gene-expression levels determined by microarray probe 202389_s_at were modeled as a function of: (i) the length of longer CAG repeat and that of shorter CAG repeat and (ii) the sum of longer and shorter CAGs in a multiple regression model. (F) HTT gene-expression levels determined by microarray probe 20239090_s_at were modeled as a function of: (i) the length of longer CAG repeat and that of shorter CAG repeat and (ii) the sum of longer and shorter CAGs. Slope represents estimated slope of each variable. Summary statistics suggest that the lengths of CAG repeats are not significant determinants of HTT mRNA levels.
Figure 2.
Figure 2.
Effects of sizes of probes and training set on the prediction models. (A) To understand the relationship between the number of probes used in the model and the model performance, we used fixed numbers of top probes for PLSR modeling. Among 107 discovery set samples, 97 samples were randomly chosen as training samples and the remaining 10 samples were used as test samples. Gene-expression data and CAG repeat lengths of training samples were used to calculate correlation, and fixed numbers of top probes (50–1000 probes) were used to build prediction models. For example, 1000 on the x-axis means the top 1000 correlated probes with CAG repeat length of training samples were used for mathematical modeling. Then, CAG repeat lengths of test samples were predicted using the same parameters to calculate the error rate (root meansquared error of prediction; RMSEP). These procedures were repeated 1000 times. The frequency of error rates (RMSEP on the y-axis) were represented as smoothed color density, where points with darker color represent higher frequency compared with points with lighter color. The general trend line (red) was based on the locally weighted scatterplot smoothing (LOWESS). (B) Prediction models were generated using different numbers of training samples to understand the effects of the size of training samples on the prediction power of models. Models using true CAG of training samples (red) or permuted CAGs (blue) were built to predict CAGs of test samples. For each split, seven prediction models were built using 200, 250, 300, 350, 400, 450 and 500 top probes, and the averages of predicted CAGs were taken as the representative predicted CAG. These procedures were repeated for 1000 times. Significance of correlation between predicted CAGs and measured CAGs were calculated using Pearson's correlation method. P-values (y-axis) were plotted against the size of training samples (x-axis). Red and blue traces represent prediction models using true CAGs and permuted CAGs, respectively. (C) Pearson's correlation coefficients (y-axis) obtained from same models were plotted against the size of training samples (x-axis).
Figure 3.
Figure 3.
Predictive power of CAG-correlated signature. Prediction models using true CAG (left column) or permuted CAG repeat lengths (right column) of training samples were built, and error rates and predicted CAGs of test samples were recorded during 10 000 iterations of modeling. For each data split, seven prediction models were built using 200, 250, 300, 350, 400, 450 and 500 top correlated probes, and the mean of the seven predicted CAG repeat lengths was calculated as a representative predicted CAG repeat length for each test sample. Representative predicted CAGs were compared with experimentally determined CAGs of test samples (x-axis) to calculate error rates (RMSEP). (A) Distribution of error rates (RMSEP) in test samples based on prediction models using true CAG repeat lengths. Mean and SD of RMSEP were 0.9302 and 0.179, respectively. (B) Distribution of error rates in test samples based on prediction models using scrambled CAG repeat lengths. Mean and SD of RMSEP were 1.654 and 0.347, respectively. (C) Average of predicted CAGs from 10 000 iterations using true CAGs of training samples were averaged to generate the final predicted CAGs for all 107 samples (y-axis). In order to understand the extent to which predicted CAG reflects the effects of the true CAG (x-axis), a regression model was constructed (red line). Predicted CAG was significantly associated with measured CAG (slope, 0.2289; P-value, 5.94E − 7; R2, 0.2122). CAGs on x-axis and y-axis represent standardized CAG values. (D) Predicted CAGs of test samples from 10 000 iterations using permuted CAGs of training samples were averaged to generate the final predicted CAG for all 107 samples. x-Axis and y-axis represent predicted CAG repeat length and experimentally determined CAG repeat length, respectively. Predicted CAG was not significantly associated with measured CAG in CAG-permuted models (slope, 0.02442; P-value, 0.844; R2, 0.0003725). CAGs on x-axis and y-axis represent standardized CAG values.
Figure 4.
Figure 4.
Reproducible CAG-correlated gene-expression signatures. Among 107 discovery samples, we randomly chose 20 replication cell lines as replication samples to evenly cover the CAG distribution of discovery set. For those 20 samples, we prepared six sets of independent RNA samples from three cell stocks, yielding six gene-expression data for each sample. Thus, there were seven gene-expression data for each replication sample (one in the discovery set and six in the replication set). For each replication sample, we removed seven corresponding gene-expression data from the discovery and replication set to make seven prediction models using different numbers of top probes (200, 250, 300, 350, 400, 450 and 500). Then, unused data (i.e. test samples) were fed into those seven prediction models to obtain predicted CAGs of test samples, and the averages of predicted CAGs for the given samples were calculated. Since the primary purpose of this test was not to compare experimentally determined CAGs with predicted CAGs of replication samples but to compare predicted CAGs with predicted CAGs of replication samples from different cell culture, we evaluated the similarity among predicted CAGs. To evaluate the similarity between the predicted CAGs of replication samples, pair-wise correlation plot was generated and significance (Pearson's correlation) was calculated (numbers in each square). D and R represent discovery set and replication set, respectively. CAGs on x-axis and y-axis represent standardized CAG values.
Figure 5.
Figure 5.
Genes consistently correlated with CAG repeat length. Ten thousand iterations of model optimization procedures have been performed to identify microarray probes participating in each of the 10 000 optimized prediction models. The analysis results were summarized so we could distinguish the probes consistently correlated with CAG regardless of the training samples (e.g. probes at the top left corner represent probes with a high frequency and a high mean rank) from the probes correlated in a training sample-dependent manner (e.g. probes at the bottom of the plot represent probes with a low frequency). Red circle indicates microarray probe 202389_s_at representing the HTT (frequency of 1, mean rank of 864, mean correlation coefficient of −0.2678 and mean correlation P-value of 0.008002).
Figure 6.
Figure 6.
Relevance of CAG-correlated gene-expression signature to human HD. (A) Age at onset of motor symptoms was available for 12 samples among the subjects we profiled. For those 12 samples, predicted (blue circles) or actual CAG (red circles) was compared with age at onset of motor symptoms (y-axis). Regression models are shown that describe the relationship between age at onset of motor symptoms and actual CAG repeat length (red line) and predicted CAG length (blue line) for these 12 samples. As age at onset is highly variable for a given CAG repeat length and this sample size is small, we have also plotted this relationship (black line) based upon 3674 HD samples from Lee et al. (11). For a full description of the CAG–ageatonset relationship, refer to Lee et al. (11). Our prediction models slightly underestimated CAG for the samples with CAG in the HD range because only a modest fraction of the variance of the gene-expression signature is caused by the HTT CAG repeat length. (B) To test statistical enrichment of lymphoblast CAG-correlated genes in CAG-correlated genes in human HD brains, we compiled a positively correlated gene set (315 probes; Pearson's correlation P-value <0.01) and a negatively correlated gene set (403 probes; Pearson's correlation P-value <0.01). Probe results in lymphoblastoid expression data were matched to brain expression data using gene accession number, and the true gene set enrichment score of gene set was (red triangle) compared with a null distribution of gene-set enrichment scores constructed by randomly choosing same number of genes from the brain expression data. Negatively correlated genes in the lymphoblastoid gene-expression data showed a nominally significant enrichment in CAG-correlated genes in human cerebellum. Positively correlated genes in the lymphoblastoid gene-expression data was not enriched (data not shown). Also, neither prefrontal cortex nor visual cortex showed a statistical enrichment of the genes that were correlated with CAG repeat length in lymphoblastoid cell lines.

References

    1. HDCRG. A novel gene containing a trinucleotide repeat that is expanded and unstable on Huntington's disease chromosomes. The Huntington's Disease Collaborative Research Group. Cell. 1993;72:971–983. - PubMed
    1. Huntington G. On chorea. Med. Surg. Rep. 1872;26:320–321.
    1. Schoenfeld M., Myers R.H., Cupples L.A., Berkman B., Sax D.S., Clark E. Increased rate of suicide among patients with Huntington's disease. J. Neurol. Neurosurg. Psychiatry. 1984;47:1283–1287. - PMC - PubMed
    1. Hendricks A.E., Latourelle J.C., Lunetta K.L., Cupples L.A., Wheeler V., MacDonald M.E., Gusella J.F., Myers R.H. Estimating the probability of de novo HD cases from transmissions of expanded penetrant CAG alleles in the Huntington disease gene from male carriers of high normal alleles (27-35 CAG) Am. J. Med. Genet. A. 2009;149A:1375–1381. - PMC - PubMed
    1. Langbehn D.R., Brinkman R.R., Falush D., Paulsen J.S., Hayden M.R. A new model for prediction of the age of onset and penetrance for Huntington's disease based on CAG length. Clin. Genet. 2004;65:267–277. - PubMed

Publication types