Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Apr 1;19(2):185-198.
doi: 10.1093/biostatistics/kxx028.

Smooth quantile normalization

Affiliations

Smooth quantile normalization

Stephanie C Hicks et al. Biostatistics. .

Abstract

Between-sample normalization is a critical step in genomic data analysis to remove systematic bias and unwanted technical variation in high-throughput data. Global normalization methods are based on the assumption that observed variability in global properties is due to technical reasons and are unrelated to the biology of interest. For example, some methods correct for differences in sequencing read counts by scaling features to have similar median values across samples, but these fail to reduce other forms of unwanted technical variation. Methods such as quantile normalization transform the statistical distributions across samples to be the same and assume global differences in the distribution are induced by only technical variation. However, it remains unclear how to proceed with normalization if these assumptions are violated, for example, if there are global differences in the statistical distributions between biological conditions or groups, and external information, such as negative or control features, is not available. Here, we introduce a generalization of quantile normalization, referred to as smooth quantile normalization (qsmooth), which is based on the assumption that the statistical distribution of each sample should be the same (or have the same distributional shape) within biological groups or conditions, but allowing that they may differ between groups. We illustrate the advantages of our method on several high-throughput datasets with global differences in distributions corresponding to different biological conditions. We also perform a Monte Carlo simulation study to illustrate the bias-variance tradeoff and root mean squared error of qsmooth compared to other global normalization methods. A software implementation is available from https://github.com/stephaniehicks/qsmooth.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Using biological information to preserve global differences in distributions. Under the conditions of no global differences in distributions (A), qsmooth is similar to standard quantile normalization. Under the conditions of global differences in distributions (B) and (C), quantile normalization removes the global differences by making the distributions the same, but qsmooth preserves global differences in distributions. Examples of gene expression data with (A) PM values from n = 45 arrays comparing the gene expression of alveolar macrophages from nonsmokers, smokers and patients with asthma. (B) Gene counts from n = 7 from RNA-Seq samples comparing the T. cruzi life cycle at the epimastigote (insect vector) stage and extracellular trypomastigotes. Counts have an added pseudocount of 1 and then are log 2 transformed. (C) PM values from n = 82 arrays comparing brain and liver tissue samples.
Fig. 2.
Fig. 2.
Quantile normalization induces artificial differences in spike-in control genes using data with global differences in distributions. Comparing no normalization (row 1), quantile normalization (row 2) and qsmooth (row 3) applied RNA-Seq gene counts from brain and liver tissues in the bodymapRat dataset. Column 2 contains the density plots for only the spike-in control genes. Counts have an added pseudocount of 1 and then are log 2 transformed.
Fig. 3.
Fig. 3.
Scaling normalization methods do not adequately control within-group variability. Comparing density plots following either qsmooth (A), Relative Log Expression (RLE) (B), Trimmed Mean of M-Values (TMM) (C), upper quartile scaling (upperquartile) (D), library size (libSize) (E) or no (none) (F) normalization. Plotted are the artery tibial and the testis tissues from the GTEx consortium. All counts have an added pseudocount of 1 and then are log2 transformed.
Fig. 4.
Fig. 4.
Gene-specific effects induced from quantile normalization. Boxplots of the normalized expression for ENSG00000160882 (CYP11B1) and ENSG00000164532 (TBX20) are shown for 24 tissues profiled by GTEx. Top, we see CYP11B1 is more highly expressed in testis (TST) and more lowly expressed in other tissues in both (A) qsmooth and (B) raw expression profiles. However, following quantile normalization (C) CYP11B1 is relatively lowly expressed in TST but now more variably and highly expressed in the artery aorta (ATA). CYP11B1 produces 11 beta-hydroxylase, a final step necessary to convert 11-deoxycortisol into cortisol. Steroid 11 beta-hydroxylase deficiency is the second most common cause (5-8formula image) of congenital adrenal hyperplasia (Zachmann and others, 1983; Curnow and others, 1993; Joehrer and others, 1997). Bottom (D, E), TBX20 is a member of the T-box family and encodes the TBX20 transcription factor and helps dictate cardiac chamber differentiation and in adults regulates integrity, function and adaptation (Cai and others, 2005; Singh and others, 2005; Stennard and others, 2005; Takeuchi and others, 2005; Qian and others, 2008). We see TBX20 highly expressed in both raw and qsmooth normalized heart atrial appendage and left ventricle tissues (HRA, HRV). However, following (F) quantile normalization, expression of the gene in both heart tissues is almost zero and several other tissues are more highly or variably expressed.
Fig. 5.
Fig. 5.
Density plots (column 1) and boxplots (column 2) with global changes in distributions of beta values from n = 35 Illumina 450K DNAm arrays comparing raw data (row 1), quantile normalized data (row 2) and qsmooth data (row 3) on six purified cell types from whole blood: CD14+ Monocytes (Mono), CD19+ B-cells (Bcell), CD4+ T-cells (CD4T), CD56+ NK-cells (NK), CD8+ T-cells (CD8T), and Granulocytes (Gran). Column 3 shows first two principal components using three normalization methods.

References

    1. Aanes H., Winata C., Moen Lars F., Østrup O., Mathavan S., Collas P., Rognes T. and Aleström P. (2014). Normalization of rna-sequencing data from samples with varying mrna levels. PloS one 9, e89158. - PMC - PubMed
    1. Amaratunga D. and Cabrera J. (2001). Outlier Resistance, Standardization, and Modeling Issues for DNA Microarray Data. Basel: Birkhäuser.
    1. Anders S. and Huber W. (2010). Differential expression analysis for sequence count data. Genome biology 11, R106. - PMC - PubMed
    1. Aryee M. J, Jaffe A. E., Corrada-Bravo H., Ladd-Acosta C., Feinberg A. P., Hansen K. D and Irizarry R. A. (2014). Minfi: a flexible and comprehensive bioconductor package for the analysis of infinium dna methylation microarrays. Bioinformatics 30, 1363–1369. - PMC - PubMed
    1. Bolstad B. M, Irizarry R. A, Åstrand M. and Speed T. P. (2003). A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19, 185–193. - PubMed

Publication types

MeSH terms