Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013;14(9):R97.
doi: 10.1186/gb-2013-14-9-r97.

Assembly of a phased diploid Candida albicans genome facilitates allele-specific measurements and provides a simple model for repeat and indel structure

Assembly of a phased diploid Candida albicans genome facilitates allele-specific measurements and provides a simple model for repeat and indel structure

Dale Muzzey et al. Genome Biol. 2013.

Abstract

Background: Candida albicans is a ubiquitous opportunistic fungal pathogen that afflicts immunocompromised human hosts. With rare and transient exceptions the yeast is diploid, yet despite its clinical relevance the respective sequences of its two homologous chromosomes have not been completely resolved.

Results: We construct a phased diploid genome assembly by deep sequencing a standard laboratory wild-type strain and a panel of strains homozygous for particular chromosomes. The assembly has 700-fold coverage on average,allowing extensive revision and expansion of the number of known SNPs and indels. This phased genome significantly enhances the sensitivity and specificity of allele-specific expression measurements by enabling pooling and cross-validation of signal across multiple polymorphic sites. Additionally, the diploid assembly reveals pervasive and unexpected patterns in allelic differences between homologous chromosomes. Firstly, we see striking clustering of indels, concentrated primarily in the repeat sequences in promoters. Secondly, both indels and their repeat-sequence substrate are enriched near replication origins. Finally, we reveal an intimate link between repeat sequences and indels, which argues that repeat length is under selective pressure for most eukaryotes. This connection is described by a concise one-parameter model that explains repeat-sequence abundance in C. albicans as a function of the indel rate,and provides a general framework to interpret repeat abundance in species ranging from bacteria to humans.

Conclusions: The phased genome assembly and insights into repeat plasticity will be valuable for better understanding allele-specific phenomena and genome evolution.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Sequencing of strains that contain homozygous regions can resolve genome phasing. (A) Schematic illustrating the ambiguous phasing of two adjacent SNPs from chromosome 3 of C. albicans genome Assembly 21. (B) Idealized panel of strains to resolve phasing. The wild-type (WT) strain is heterozygous for all eight chromosomes, having both the A homolog in green, and the B homolog in blue. Additional strains to sequence were selected to be homozygous for the indicated chromosomes. (C) One phasing option from (A) can be excluded by sequencing the ‘3AA’ strain, since all reads are effectively from the A allele, pairing the T and C; SNPs on the B allele are inferred. (D) Illustration of how to calculate the max-to-sum ratio, with a SNP position highlighted in orange. (E, F) Histograms of max-to-sum ratios for all positions across chromosome 5 in wild type (E) and the ‘5AA’ strain (F); bars are in linear space, and the line plot is in log space.
Figure 2
Figure 2
Pooling reads across heterozygous and homozygous regions clearly identified SNPs. (A) For each homozygous strain independently, the number of positions with max-to-sum ratio <0.7 were considered ‘putative SNPs’; the total number of putative SNPs on each chromosome was called SNPshomo, and this number was divided by the corresponding value for wild type; to avoid confusion, the plotted number is the minimum of this quotient and 100%. (B) Putative SNP locations were identified in the wild-type strain, and the corresponding positions in homozygous strains were investigated for SNP status: if a putative SNP position from wild type was not a SNP in the indicated strain, it was shaded green (or pink, depending on the allele), whereas if both were SNPs, the latter was shaded white. (C) Scatterplot of max-to-sum ratios in heterozygous and homozygous regions for every position in the genome. Histograms at top and right show the distribution of data on each perpendicular axis as indicated, with bars in linear space and lines in log space. (D) The number of unphased SNPs in non-overlapping 50 kb windows tiled across the genome, with telomere and centromere locations as indicated.
Figure 3
Figure 3
Allele-specific bias in transcription is evident from pooling reads across phased SNPs. (A, C) orf19.238 (A) and orf19.3556 (C) have 8 and 11 non-overlapping regions, respectively, where RNA-seq reads include SNPs and can be attributed to either allele A in purple, or allele B in green. The bar graphs at top quantify the number of reads per SNP region, with the line graph at bottom indicating read density in a 20 nucleotide sliding window across each region. The density of reads lacking SNP information is indicated in gray. For visual clarity, the x-axis is nonlinear, such that SNP regions show data at every nucleotide, and non-SNP regions show data every 10 nucleotides. (B, D) Allele-specific biases for orf19.238 (B) and orf19.3556 (D), where histograms reflect the results from 10,000 bootstrap iterations. (D) The gray histogram shows how randomly permuting the phasing masks allele-specific bias, and the ‘max phasing’ line indicates the bias calculated if the maximum and minimum values for each bar in the top of (C) were attributed to allele B and allele A, respectively.
Figure 4
Figure 4
Indels are enriched in repeat sequences upstream of genes. (A) Close-up of 10 kb region of chromosome 1 containing several positions where hundreds of reads deviate from the reference in support of an indel. (B) Expected values for max-to-sum ratios of ‘reference’ and ‘indel’ reads in heterozygous and homozygous regions. (C) Scatterplot of max-to-sum ratios in heterozygous and homozygous regions for every putative indel in the genome. Histograms at top and right show the distribution of data on each perpendicular axis as indicated. The color of each point is based on the legend, where W and C indicate reads from the Watson and Crick strands, respectively. (D) The cutoff for indel designation, indicated in red, has a 5% false discovery rate (FDR), based on fitting the sum of gamma and Gaussian distributions, which reflect the true and false indels, respectively. The histogram in green considered only points with homozygous max-to-sum ratios <1.0 and rectilinear distances of 0.6 or less from the point [1.0,0.5]. (E) Indel density as a function of indel size and distance from the start codon. Density values were normalized to account for the fact that not all coding or intergenic regions span 1,000 nucleotides. (F) Indels are strongly enriched in repeat sequences. (G) Indels are not a sequencing artifact. The average size reported by all reads supporting an indel was calculated and then compiled into a histogram representing all indels. Random sequencing errors would have yielded density at non-integer values and, more importantly, around zero.
Figure 5
Figure 5
Indels are clustered throughout the genome. (A) A representative multikilobase span, where ‘X’ indicates an indel and dashes signify non-polymorphic repeat sequences. (B) The number of ‘–‘ characters between each indel (‘X’) was counted across the genome and compiled into a histogram in purple. In gray, the exponential distribution expected based on the observed indel probability and assuming random dispersion of indels. Inset: the analogous plot for ‘dense’ regions identified by the hidden Markov model (HMM). (C) (i) Schematic of the HMM used to distinguish indel-dense from indel-sparse regions. (ii) Fractional share of total indels (left) and number of bases in the genome (right) present in ‘dense’ (blue) and ‘sparse’ (red) regions. (D) Relative enrichment of three different sequence features between ‘dense’ and ‘sparse’ regions. Error bars indicate ±S.E.M. across regions, propagated through division. (E) The indel concentration, measured as indels-per-repeat sequence, in 7.5 kb windows centered at replication origins was calculated as a function of replication-origin offset (that is, 0 kb is the native origin location). Step size is 1 kb, and the average value across three adjacent windows is plotted. (F) The total number of repeat sequences present in non-overlapping 1 kb windows centered at replication origins.
Figure 6
Figure 6
One-parameter model reveals strong relationship between indel rate and repeat-sequence abundance. (A) Indel rate as a function of repeat length is plotted, with coloring indicating the inserted or deleted nucleotides as shown in the legend. Repeat length is the average of the ‘reference’ and ‘indel’ read lengths; thus, for single-base indels, repeat length is ‘x.5’ for integer values of x. (B-E) Gray dotted lines show repeat-sequence abundance as a function of length for A:T homopolymers (B, E) G:C homopolymers (C), and AT:TA dyad-repeats (D). The colored lines show the lowest-error model fit based on the indel rates in (A), with error and α values specified. To prevent overfitting at low repeat-length values, error is calculated as the average squared deviation in log space, not linear space. (F) Abundance of A:T homopolymers as a function of length in various indicated organisms. A histogram was generated for each species independently; to facilitate comparisons among species, the data were then normalized such that the abundance at length 3 is 1.0 and then scaled - to adjust for differences in genomic A:T content - such that the abundance at length 6 is 0.75. The dashed line indicates where α = 0.

References

    1. Browning SR, Browning BL. Haplotype phasing: existing methods and new developments. Nat Rev Genet. 2011;14:703–714. - PMC - PubMed
    1. Lin S, Chakravarti A, Cutler DJ. Haplotype and missing data inference in nuclear families. Genome Res. 2004;14:1624–1632. doi: 10.1101/gr.2204604. - DOI - PMC - PubMed
    1. Li X, Li J. Haplotype reconstruction in large pedigrees with untyped individuals through IBD inference. J Comput Biol. 2011;14:1411–1421. doi: 10.1089/cmb.2011.0167. - DOI - PMC - PubMed
    1. Ma L, Xiao Y, Huang H, Wang Q, Rao W, Feng Y, Zhang K, Song Q. Direct determination of molecular haplotypes by chromosome microdissection. Nat Methods. 2010;14:299–301. doi: 10.1038/nmeth.1443. - DOI - PMC - PubMed
    1. Fan HC, Wang J, Potanina A, Quake SR. Whole-genome molecular haplotyping of single cells. Nat Biotechnol. 2011;14:51–57. doi: 10.1038/nbt.1739. - DOI - PMC - PubMed

Publication types