Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Jan 1;37(1):18-30.
doi: 10.1093/molbev/msz176.

Discovery of Novel Sequences in 1,000 Swedish Genomes

Affiliations

Discovery of Novel Sequences in 1,000 Swedish Genomes

Jesper Eisfeldt et al. Mol Biol Evol. .

Abstract

Novel sequences (NSs), not present in the human reference genome, are abundant and remain largely unexplored. Here, we utilize de novo assembly to study NS in 1,000 Swedish individuals first sequenced as part of the SweGen project revealing a total of 46 Mb in 61,044 distinct contigs of sequences not present in GRCh38. The contigs were aligned to recently published catalogs of Icelandic and Pan-African NSs, as well as the chimpanzee genome, revealing a great diversity of shared sequences. Analyzing the positioning of NS across the chimpanzee genome, we find that 2,807 NS align confidently within 143 chimpanzee orthologs of human genes. Aligning the whole genome sequencing data to the chimpanzee genome, we discover ancestral NS common throughout the Swedish population. The NSs were searched for repeats and repeat elements: revealing a majority of repetitive sequence (56%), and enrichment of simple repeats (28%) and satellites (15%). Lastly, we align the unmappable reads of a subset of the thousand genomes data to our collection of NS, as well as the previously published Pan-African NS: revealing that both the Swedish and Pan-African NS are widespread, and that the Swedish NSs are largely a subset of the Pan-African NS. Overall, these results highlight the importance of creating a more diverse reference genome and illustrate that significant amounts of the NS may be of ancestral origin.

Keywords: ancestral deletion; de novo assembly; novel sequences; population genomics.

PubMed Disclaimer

Figures

<sc>Fig</sc>. 1.
Fig. 1.
Frequencies of GRCh37UC and NS. Histograms displaying the fraction of GRCh37UC (A) and NS (B) present at various frequencies across the population.
<sc>Fig</sc>. 2.
Fig. 2.
Characteristics of the GRCh37UC and NS. (A) A size histogram of the quality controlled GRCh37UC and NS. (B) A summary of the RepeatMasker analysis of the GRCh37UC, NS, and randomly selected regions of GRCh38; each bar display the fraction of a specific RepeatMasker repeat class, the “other” bar includes multiple classes, including RNA and low complexity sequence.
<sc>Fig</sc>. 3.
Fig. 3.
Alignment of the GRCh37UC. (A) The number of GRCh37UC clusters mapping confidently to GRCh38, Pan_tro 4.0 (PT4), the Pan-African NS, and an Icelandic NS catalog. (B, C) The distributions of GRCh37UC across GRCh38 and PT4. The colors of the inner circle indicate the percentage density of contig clusters within 3-Mb sized bins (white = 0%, blue < 0.1%, green < 0.5%, yellow < 1%, orange <5%, red < 10%, purple < 20%, and 20% => black).
Fig. 4.
Fig. 4.
Summary of the top ten most NS enriched chimpanzee genes having a human ortholog. (A) A barplot showing the number of NS positioned within the top ten genes. (B) The size of the intergenic region (kb) spanned by NS.
<sc>Fig</sc>. 5.
Fig. 5.
The fraction of URs that align to the (A) Swedish NS clusters and (B) Pan-African catalog of NS (Sherman et al. 2019). The boxplots are grouped according to the ethnicity of the individuals: African Caribbean from Barbados (ACB), Utah residents with western or northern European ancestry (CEU), Finnish in Finland (FIN), and Yoruba in Ibadan (YRI). The lower and upper hinges correspond to the first and third quartiles of each population, the per population median is indicated by the black horizontal line within the box. (C) Scatter plot of the fraction of reads aligning to the Swedish NS and the fraction of reads aligning to the Pan-African NS; each dot represents one of the 1KGP individuals. (D) Line plots showing the fraction of aligned UR remaining after removing Swedish NS clusters less frequent than the frequency thresholds (0.5%, 1%, 2.5%, 5%, 7.5%, and 10%); each line represents one of the four 1KGP populations and the error bars indicate the standard deviation among the populations.

Comment in

  • Improved Mapping of Swedish Genes.
    Caspermeyer J. Caspermeyer J. Mol Biol Evol. 2020 Jan 1;37(1):306. doi: 10.1093/molbev/msz247. Mol Biol Evol. 2020. PMID: 31880781 No abstract available.

Similar articles

Cited by

References

    1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ.. 1990. Basic local alignment search tool. J Mol Biol. 215(3):403–410. - PubMed
    1. Ameur A, Che H, Martin M, Bunikis I, Dahlberg J, Höijer I, Häggqvist S, Vezzi F, Nordlund J, Olason P, et al. 2018. De novo assembly of two Swedish genomes reveals missing segments from the human GRCh38 reference and improves variant calling of population-scale sequencing data. Genes (Basel) 9(10):486. - PMC - PubMed
    1. Ameur A, Dahlberg J, Olason P, Vezzi F, Karlsson R, Martin M, Viklund J, Kähäri AK, Lundin P, Che H, et al. 2017. SweGen: a whole-genome data resource of genetic variability in a cross-section of the Swedish population. Eur J Hum Genet. 25(11):1253–1260. - PMC - PubMed
    1. Audano PA, Sulovari A, Graves-Lindsay TA, Cantsilieris S, Sorensen M, Welch AME, Dougherty ML, Nelson BJ, Shah A, Dutcher SK, et al. 2019. Characterizing the major structural variant alleles of the human genome. Cell 176(3):663. - PMC - PubMed
    1. Butler J, MacCallum I, Kleber M, Shlyakhter IA, Belmonte MK, Lander ES, Nusbaum C, Jaffe DB.. 2008. ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Res. 18(5):810–820. - PMC - PubMed