Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008 Oct 2:9:409.
doi: 10.1186/1471-2105-9-409.

Normalization of Illumina Infinium whole-genome SNP data improves copy number estimates and allelic intensity ratios

Affiliations

Normalization of Illumina Infinium whole-genome SNP data improves copy number estimates and allelic intensity ratios

Johan Staaf et al. BMC Bioinformatics. .

Abstract

Background: Illumina Infinium whole genome genotyping (WGG) arrays are increasingly being applied in cancer genomics to study gene copy number alterations and allele-specific aberrations such as loss-of-heterozygosity (LOH). Methods developed for normalization of WGG arrays have mostly focused on diploid, normal samples. However, for cancer samples genomic aberrations may confound normalization and data interpretation. Therefore, we examined the effects of the conventionally used normalization method for Illumina Infinium arrays when applied to cancer samples.

Results: We demonstrate an asymmetry in the detection of the two alleles for each SNP, which deleteriously influences both allelic proportions and copy number estimates. The asymmetry is caused by a remaining bias between the two dyes used in the Infinium II assay after using the normalization method in Illumina's proprietary software (BeadStudio). We propose a quantile normalization strategy for correction of this dye bias. We tested the normalization strategy using 535 individual hybridizations from 10 data sets from the analysis of cancer genomes and normal blood samples generated on Illumina Infinium II 300 k version 1 and 2, 370 k and 550 k BeadChips. We show that the proposed normalization strategy successfully removes asymmetry in estimates of both allelic proportions and copy numbers. Additionally, the normalization strategy reduces the technical variation for copy number estimates while retaining the response to copy number alterations.

Conclusion: The proposed normalization strategy represents a valuable tool that improves the quality of data obtained from Illumina Infinium arrays, in particular when used for LOH and copy number variation studies.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Occurrence of asymmetrical B allele frequencies and copy number estimates. Urothelial tumor UC152_I hybridized on an Infinium 370 k BeadChip is shown. CNV probes have been removed. (a) B allele frequency for chromosome 1. (b)Mirrored B allele frequency (mBAF) for chromosome 1, with individual SNPs colored according to BAF values: less than 0.5 (orange), above 0.5 (blue) showing the asymmetry of BAF values around 0.5. (c) BAF profile of chromosome 1, with individual SNPs colored according to genotype calls: AA (green), AB (yellow), BB (red) and no calls (gray). The cause of the BAF asymmetry also affects genotyping as seen for SNPs not assigned to a genotype (gray), which in the region 1q32.1 to qter (highlighted with a light blue background) predominantly are present with BAF < 0.5. (d) Copy number estimates (Log R ratio) for chromosome 1, with individual SNPs colored according to genotype. The cause of the BAF asymmetry also affects copy number estimates as seen for regions of gain and loss, where AA and BB SNPs do not overlap. (e) Scatter plot of normalized allele intensities X and Y with individual SNPs colored according to genotype. A lowess regression line (solid) for heterozygous SNPs and the expected X = Y line (dashed) are superimposed. (f) Boxplots of the distributions of allele intensities X (green) and Y (red).
Figure 2
Figure 2
Intensity transformations of X and Y by quantile normalization. HapMap sample NA06985 hybridized on an Infinium 370 k BeadChip is shown. SNPs have been colored based on individual genotype calls: AA (green), AB (yellow), and BB (red). SNPs without genotype call are excluded. (a) Scatter plot of BeadStudio allele intensities X and Y. A lowess regression line for heterozygous SNPs is superimposed (solid) together with the expected X = Y line (dashed) illustrating that the dye intensity bias affects heterozygous SNPs. (b) MR plot of BeadStudio allele intensities for chromosome 8 with superimposed lowess regression lines (solid) for each genotype population and locally fitted linear regression lines (dashed blue). The mean M value for each genotype population is indicated by horizontally dashed black lines. (c) MR plot of quantile normalized allele intensities for chromosome 8 with superimposed lowess regression lines (solid black) and locally fitted linear regression lines (dashed blue) for each genotype population, separately. (d) Scatter plot of the intensity transformation XQN/X vs X from quantile normalization. SNPs are colored by genotype. SNPs with low X intensity values (predominantly genotyped as BB) are increased significantly in intensity by QN. (e) Scatter plot of the intensity transformation YQN/Y vs Y from quantile normalization. SNPs are colored by genotype. (f) Histogram of BeadStudio X intensities. (g) Histogram of BeadStudioY intensities.
Figure 3
Figure 3
Effects of quantile normalization on allelic intensity ratios. Two urothelial carcinomas, UC456_R and UC152_I, analyzed using Infinium 370 k BeadChips are shown. SNPs have been colored based on individual genotype calls: AA (green), AB (yellow), BB (red), CNV probes (blue) and no calls (gray). Horizontal dashed lines represent BAF 0.05, 0.1, 0.5, 0.9 and 0.95, respectively. (a) BeadStudio normalized B allele frequency profile for chromosome 9 of UC456_R. (b) QN normalized B allele frequency profile for chromosome 9 of UC456_R. Compared to BeadStudio (a), QN increases variation for SNPs close to 1 in BAF and decreases variation for SNPs close to 0 in BAF. (c) tQN normalized B allele frequency profile for chromosome 9 of UC456_R. Application of a threshold for the increase in intensity of X and Y by QN lowers the variation of SNPs close to 1 in BAF compared to QN alone (b), and creates BAF values that are more symmetrical around BAF = 0.5 compared to BeadStudio (a). (d) tQN normalized B allele frequency profile for chromosome 1 of UC152_I. The region 1q32.1 to qter discussed in the text is highlighted with a light blue background. CNV probes have been removed.
Figure 4
Figure 4
Comparison of BAF asymmetry for regions of allelic imbalance before and after tQN across different Infinium II platforms. BAF profiles for 35 tumor samples were divided into an upper (BAF > 0.5) and lower (BAF < 0.5) part, transformed to mBAF and separately segmented. For a defined genomic region, the average difference in segmented mBAF between the upper and lower part is expected to be zero if no asymmetry is present. Genomic regions were based on segmentation breakpoints of the upper BAF part. Only regions > 30 SNPs and with a segmented mBAF value > 0.6 in the upper and/or lower part were used in the comparisons. Black squares correspond to BeadStudio data and red triangles correspond to tQN data. Error bars for each sample and normalization method show the interquartile range (IQR). (a) BAF asymmetry for 14 matched tumor-normal samples. The black bar denotes 11 paired urothelial tumors from data set 4 and the white bar denotes the paired tumor samples from data set 8. tQN data systematically show less difference between the upper and lower BAF part compared to BeadStudio for the 14 matched tumors. (b) BAF asymmetry for 21 unmatched urothelial, breast and CLL tumor samples. The black bar denotes the 5 unpaired urothelial tumors from data set 4, the blue bar denotes CLL tumors from data set 7 and the red bar denotes breast tumors from data set 6. tQN data systematically show less difference between the upper and lower BAF part compared to BeadStudio for the 21 unmatched tumors.
Figure 5
Figure 5
Effects of tQN on copy number estimates across different Infinium platforms. (a) Effect of tQN on log R ratio response to CNAs compared to BeadStudio data for 36 tumor samples. For each sample the mean difference in segmented log R ratio between BeadStudio and tQN data is plotted. For segments with log R ratio > 0 (red) the difference is BeadStudio minus tQN. For segments with log R ratio < 0 (green) the difference is tQN minus BeadStudio. A positive difference therefore corresponds to a better log R ratio response to CNAs for BeadStudio normalization compared to tQN for both types of segments. Error bars for each sample show the IQR of the difference. Horizontal bars denote the investigated data sets, urothelial tumors from data set 4 (black), breast/colon tumor samples from data set 8 (white), CLL samples from data set 7 (blue) and breast tumors from data set 6 (red). Only segments > 20 SNPs have been included. Segment definition was based on breakpoints from segmentation of BeadStudio copy number data. The small difference in segmented values between BeadStudio and tQN data indicates that tQN does not affect the log R ratio response to CNAs. (b) Boxplots of sample adaptive thresholds for BeadStudio normalized data (white) and tQN data (red) for 6 data sets. Top axis indicates the number of samples in each data set. tQN results in lower sample adaptive thresholds in four out of six data sets and equal thresholds in the remaining two. (c) tQN copy number estimates for chromosome 1 for urothelial tumor UC152_I with individual SNPs colored according to genotype calls: AA (green), AB (yellow), BB (red) and no calls (gray). CNV probes have been removed. tQN removes the asymmetry between AA and BB SNPs for regions of gain and loss observed in BeadStudio normalized data (compare to figure 1d).

References

    1. Pinkel D, Albertson DG. Comparative genomic hybridization. Annu Rev Genomics Hum Genet. 2005;6:331–354. doi: 10.1146/annurev.genom.6.080604.162140. - DOI - PubMed
    1. Rajagopalan H, Lengauer C. Aneuploidy and cancer. Nature. 2004;432:338–341. doi: 10.1038/nature03099. - DOI - PubMed
    1. Matsuzaki H, Dong S, Loi H, Di X, Liu G, Hubbell E, Law J, Berntsen T, Chadha M, Hui H, Yang G, Kennedy GC, Webster TA, Cawley S, Walsh PS, Jones KW, Fodor SP, Mei R. Genotyping over 100,000 SNPs on a pair of oligonucleotide arrays. Nat Methods. 2004;1:109–111. doi: 10.1038/nmeth718. - DOI - PubMed
    1. Gunderson KL, Steemers FJ, Lee G, Mendoza LG, Chee MS. A genome-wide scalable SNP genotyping assay using microarray technology. Nat Genet. 2005;37:549–554. doi: 10.1038/ng1547. - DOI - PubMed
    1. Peiffer DA, Le JM, Steemers FJ, Chang W, Jenniges T, Garcia F, Haden K, Li J, Shaw CA, Belmont J, Cheung SW, Shen RM, Barker DL, Gunderson KL. High-resolution genomic profiling of chromosomal aberrations using Infinium whole-genome genotyping. Genome Res. 2006;16:1136–1148. doi: 10.1101/gr.5402306. - DOI - PMC - PubMed

Publication types