Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Apr 17;10(4):e1003555.
doi: 10.1371/journal.pcbi.1003555. eCollection 2014 Apr.

Enhanced methods for local ancestry assignment in sequenced admixed individuals

Affiliations

Enhanced methods for local ancestry assignment in sequenced admixed individuals

Robert Brown et al. PLoS Comput Biol. .

Abstract

Inferring the ancestry at each locus in the genome of recently admixed individuals (e.g., Latino Americans) plays a major role in medical and population genetic inferences, ranging from finding disease-risk loci, to inferring recombination rates, to mapping missing contigs in the human genome. Although many methods for local ancestry inference have been proposed, most are designed for use with genotyping arrays and fail to make use of the full spectrum of data available from sequencing. In addition, current haplotype-based approaches are very computationally demanding, requiring large computational time for moderately large sample sizes. Here we present new methods for local ancestry inference that leverage continent-specific variants (CSVs) to attain increased performance over existing approaches in sequenced admixed genomes. A key feature of our approach is that it incorporates the admixed genomes themselves jointly with public datasets, such as 1000 Genomes, to improve the accuracy of CSV calling. We use simulations to show that our approach attains accuracy similar to widely used computationally intensive haplotype-based approaches with large decreases in runtime. Most importantly, we show that our method recovers comparable local ancestries, as the 1000 Genomes consensus local ancestry calls in the real admixed individuals from the 1000 Genomes Project. We extend our approach to account for low-coverage sequencing and show that accurate local ancestry inference can be attained at low sequencing coverage. Finally, we generalize CSVs to sub-continental population-specific variants (sCSVs) and show that in some cases it is possible to determine the sub-continental ancestry for short chromosomal segments on the basis of sCSVs.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exit.

Figures

Figure 1
Figure 1. Example of CSVs in a 2-way admixed individual (e.g. African American).
Lines denote the true local ancestry while the dots denote CSVs. Different dot types denote the continental ancestry of each CSV. From visual inspection it is relatively easy to discern the true ancestry from the three observed patterns. Spurious CSVs are denoted by CSVs mislabeling the true ancestry state.
Figure 2
Figure 2. Local ancestry inference accuracy in three simulated populations.
“Array data” denotes that a method was run only on the variants present on the Illumina 1 M genotyping array. “Full genome” denotes methods were run using all the variants. RFMix requires phased haplotype input, which was infered using Beagle; all other methods received unphased genotype data as input. Correlation values are the mean squared correlation across SNPs of the true vs. inferred ancestry across individuals. LAMP-LD and MULTIMIX were optimized to run with genotyping array data, possibly explaining the steep drop in accuracy when they are run using full sequencing data. MULTIMIX is not plotted when run on full sequencing data because it performed very poorly, possibly due to inaccurate parameters for sequencing data. Haploid and diploid errors are reported in Table 2.
Figure 3
Figure 3. Runtime (in CPU days) as a function of the number of individuals in a study with sequencing data.
Lanc-CSV is always faster than LAMP-LD and MULTIMIX when run on either full genome sequencing data or genotyping array data (see Figure S3 and Table S1). The full sequencing data contained ∼30 times more alleles than the genotyping array data. Only RFMix has comparable speed for full sequenced data and is faster for genotype array data. We show the runtime for RFMix with phasing time included.
Figure 4
Figure 4. Accuracy as a function of sequencing coverage.
African-Americans with only two distinct ancestral populations increases fastest in accuracy.
Figure 5
Figure 5. Accuracy as a function of sample size.
While accuracy increases with increasing numbers of admixed individuals, the most significant increase is seen in Mexican individuals. We report accuracy for Lanc-CSV using 200 admixed individuals, but accuracy exceeds this as the number of admixed individuals increases. This is due to the method being better able to correct for spurious CSVs and to add in new CSVs when there are more individuals.
Figure 6
Figure 6. Proportions of sCSVs from each population observed on a held out haplotype.
Each row represents the ancestry of the haplotype that was held out and each column represents the average number of sCSVs observed on the held out haplotype from the given population. Each row is normalized by the maximum value of the row so that the population with the most sCSVs observed has a value of 1. In each row, higher values are associated with populations in the same continental group as would be expected. The IBS have only fourteen individuals, which makes determining IBS sCSVs extremely difficult.
Figure 7
Figure 7. sCSVs allow for calling the sub-continental population of a haplotype.
Randomly drawn segments of haplotypes from known populations can be accurately assigned to the population of origin. Accuracy for each population is significantly correlated with the number of reference haplotypes for that population (r = 0.65, p-value = 0.042). The highest accuracies are seen in populations that are more isolated from other populations in their continents.
Figure 8
Figure 8. sCSVs are able to assign the correct continental group to small haplotype segments with high accuracy.
This shows most of the incorrectly called accuracies still call to the correct continental group.
Figure 9
Figure 9. The average number of sCSVs from each 1000 Genomes population observed per megabase on the African-African called local ancestry regions of the real ASW individuals on chromosome 10.
The large number of YRI sCSVs seen in these regions supports the hypothesis that the African admixture component in African Americans comes from western Africa. We plot the expected number of observed sCSVs per megabase on a YRI haplotype (red diamonds) and the expected number of observed sCSVs on an LWK haplotype (green squares). The observed counts more closely resemble the count profile expected from the YRI haplotypes.
Figure 10
Figure 10. The average number of sCSVs from each 1000 Genomes population observed on the European-European called local ancestry regions of the real ASW individuals.

References

    1. Hirschhorn JN, Daly MJ (2005) Genome-wide association studies for common diseases and complex traits. Nat Rev Genet 6: 95–108. - PubMed
    1. Novembre J, Di Rienzo A (2009) Spatial patterns of variation due to natural selection in humans. Nat Rev Genet 10: 745–755. - PMC - PubMed
    1. Wellcome Trust Case Control Consortium (2007) Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447: 661–678. - PMC - PubMed
    1. Bustamante CD, Burchard EG, De la Vega FM (2011) Genomics for the world. Nature 475: 163–165. - PMC - PubMed
    1. Qin H, Zhu X (2012) Power comparison of admixture mapping and direct association analysis in genome-wide association studies. Genet Epidemiol 36: 235–243. - PMC - PubMed

Publication types