Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2021 Jan;66(1):85-91.
doi: 10.1038/s10038-020-00851-4. Epub 2020 Oct 14.

A review of UMAP in population genetics

Affiliations
Review

A review of UMAP in population genetics

Alex Diaz-Papkovich et al. J Hum Genet. 2021 Jan.

Abstract

Uniform manifold approximation and projection (UMAP) has been rapidly adopted by the population genetics community to study population structure. It has become common in visualizing the ancestral composition of human genetic datasets, as well as searching for unique clusters of data, and for identifying geographic patterns. Here we give an overview of applications of UMAP in population genetics, provide recommendations for best practices, and offer insights on optimal uses for the technique.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no conflict of interest.

Figures

Fig. 1
Fig. 1
UMAP with (left) and without (right) HLA regions used on the Genizon database. The cluster in the dotted lines disappears when filtering for HLA and linkage disequilibrium
Fig. 2
Fig. 2
Visualizations of data from the 1000GP. The first two principal components (left) versus a two-dimensional UMAP embedding (right). ACB African Caribbean in Barbados, ASW Americans of African Ancestry in Southwest US, BEB Bengali from Bangladesh, CDX Chinese Dai in Xishuangbanna, China, CEU Utah residents with Northern/Western European ancestry, CHB Han Chinese in Beijing, CHS Southern Han Chinese, CLM Colombian in Medellin, Colombia, ESN Esan in Nigeria, FIN Finnish in Finland, GBR British in England and Scotland, GWD Gambian in Western Divisions in the Gambia, GIH Gujarati Indian in Houston, Texas, IBS Iberian in Spain, ITU Indian Telugu in the UK, JPT Japanese in Tokyo, KHV Kinh in Vietnam, LWK Luhya in Kenya, MSL Mende in Sierra Leone, MXL Mexican in Los Angeles, California, PEL Peruvian in Lima, PJL Punjabi in Lahore, Pakistan, PUR Puerto Rican, STU Sri Lankan Tamil in the UK, TSI Tuscani in Italy, YRI, Yoruba in Nigeria
Fig. 3
Fig. 3
PCA (left) and UMAP (right) projections of the UKB data, coloured by self-identified ethnic background. Unlike PCA, UMAP focuses on preserving local relationships and emphasizes fine-scale patterns in data. Groups in the UMAP projection are less compressed showing, for example, the relative size of the British and Irish populations in the UKB, alongside populations of other ancestries, while simultaneously showing the population structure between and within groups
Fig. 4
Fig. 4
The Genome Aggregation Database (gnomAD, left) and Biobank Japan (BBJ, right) visualized using UMAP. UMAP illustrates the ancestral diversity of gnomAD, showing many the relationships between populations on continental and subcontinental levels. For the relatively more homogeneous BBJ data, it splits data geographically into the large mainland cluster (consisting of Hokkaido, Tohoku, Kanto-Koshinetsu, Chubu-Hokuriku, Kinki, and Kyushu regions), and smaller non-mainland clusters. The gnomAD image is reproduced from [10], and the BBJ image is reproduced from [12]
Fig. 5
Fig. 5
UMAP projection of the same genotype data from the 1000GP comparing parametrization with a small (top) and large (bottom) number of nearest neighbours. Left images are coloured by population; right images are the same points but with the simplicial complex drawn. When adding more neighbours, subclusters become less separated, as with the LWK population, for example. Looking at the connectivity maps, we see new connections between continental groups, such as the Central/South American clusters and East Asian clusters. Darker lines indicate that individuals are closer to each other in genotype space

References

    1. McVean G. A genealogical interpretation of principal components analysis. PLoS Gen. 2009;5:e1000686. doi: 10.1371/journal.pgen.1000686. - DOI - PMC - PubMed
    1. Patterson N, Price AL, Reich D. Population structure and eigenanalysis. PLoS Gen. 2006;2:e190. doi: 10.1371/journal.pgen.0020190. - DOI - PMC - PubMed
    1. Maaten Lvd, Hinton G. Visualizing data using t-sne. J Mach Learn Res. 2008;9:2579–2605.
    1. McInnes L, Healy J, Melville J. UMAP: uniform manifold approximation and projection for dimension reduction. arXiv 2018. http://arxiv.org/abs/1802.03426.
    1. Becht E, McInnes L, Healy J, Dutertre C, Kwok IWH, Newel EW, et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat Biotechnol. 2019;37:38–44. doi: 10.1038/nbt.4314. - DOI - PubMed

LinkOut - more resources