. 2016 Jun 30:7:12065.

doi: 10.1038/ncomms12065.

Long-read sequencing and de novo assembly of a Chinese genome

Lingling Shi^{1

2

3}, Yunfei Guo⁴, Chengliang Dong⁴, John Huddleston⁵, Hui Yang⁴, Xiaolu Han⁶, Aisi Fu⁷, Quan Li⁴, Na Li¹, Siyi Gong¹, Katherine E Lintner⁸, Qiong Ding⁷, Zou Wang⁷, Jiang Hu⁹, Depeng Wang⁹, Feng Wang¹⁰, Lin Wang¹¹, Gholson J Lyon¹², Yongtao Guan¹³, Yufeng Shen¹⁴, Oleg V Evgrafov^{4

15}, James A Knowles^{4

15}, Francoise Thibaud-Nissen¹⁶, Valerie Schneider¹⁶, Chack-Yung Yu⁸, Libing Zhou^{1

2

3}, Evan E Eichler⁵, Kwok-Fai So^{1

2

3

17

18}, Kai Wang^{4

15}

Affiliations

¹ Guangdong-Hongkong-Macau Institute of CNS Regeneration, Jinan University, Guangzhou 510632, China.
² Ministry of Education Joint International Research Laboratory of CNS Regeneration, Jinan University, Guangzhou 510632, China.
³ Co-innovation Center of Neuroregeneration, Nantong University, Nantong 226001, China.
⁴ Zilkha Neurogenetic Institute, University of Southern California, Los Angeles, California 90089, USA.
⁵ Department of Genome Sciences, Howard Hughes Medical Institute, University of Washington, Seattle, Washington 98195, USA.
⁶ Genetic, Molecular, and Cellular Biology Program, Keck School of Medicine, University of Southern California, Los Angeles, California 90089, USA.
⁷ Wuhan Institute of Biotechnology, Wuhan 430000, China.
⁸ Department of Pediatrics, The Ohio State University, and The Research Institute at Nationwide Children's Hospital, Columbus, Ohio 43205, USA.
⁹ Nextomics Biosciences, Wuhan 430000, China.
¹⁰ School of Chemical Engineering and Pharmacy, Wuhan Institute of Technology, Wuhan 430000, China.
¹¹ Center for Tissue Engineering and Regenerative Medicine, Union Hospital, Huazhong University of Science and Technology, Wuhan 430022, China.
¹² Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, New York, New York 11797, USA.
¹³ USDA/ARS Children's Nutrition Research Center, Department of Pediatrics, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030, USA.
¹⁴ Departments of Systems Biology and Biomedical Informatics, Columbia University, New York, New York 10032, USA.
¹⁵ Department of Psychiatry &Behavioral Sciences, Keck School of Medicine, University of Southern California, Los Angeles, California 90033, USA.
¹⁶ National Center for Biotechnology Information, U.S. National Library of Medicine, Bethesda, Maryland 20894, USA.
¹⁷ Department of Ophthalmology, The University of Hong Kong, Hong Kong, China.
¹⁸ State Key Laboratory of Brain and Cognitive Sciences, The University of Hong Kong, Hong Kong, China.

PMID: 27356984
PMCID: PMC4931320
DOI: 10.1038/ncomms12065

Long-read sequencing and de novo assembly of a Chinese genome

Lingling Shi et al. Nat Commun. 2016.

. 2016 Jun 30:7:12065.

doi: 10.1038/ncomms12065.

Authors

Affiliations

¹ Guangdong-Hongkong-Macau Institute of CNS Regeneration, Jinan University, Guangzhou 510632, China.
² Ministry of Education Joint International Research Laboratory of CNS Regeneration, Jinan University, Guangzhou 510632, China.
³ Co-innovation Center of Neuroregeneration, Nantong University, Nantong 226001, China.
⁴ Zilkha Neurogenetic Institute, University of Southern California, Los Angeles, California 90089, USA.
⁵ Department of Genome Sciences, Howard Hughes Medical Institute, University of Washington, Seattle, Washington 98195, USA.
⁶ Genetic, Molecular, and Cellular Biology Program, Keck School of Medicine, University of Southern California, Los Angeles, California 90089, USA.
⁷ Wuhan Institute of Biotechnology, Wuhan 430000, China.
⁸ Department of Pediatrics, The Ohio State University, and The Research Institute at Nationwide Children's Hospital, Columbus, Ohio 43205, USA.
⁹ Nextomics Biosciences, Wuhan 430000, China.
¹⁰ School of Chemical Engineering and Pharmacy, Wuhan Institute of Technology, Wuhan 430000, China.
¹¹ Center for Tissue Engineering and Regenerative Medicine, Union Hospital, Huazhong University of Science and Technology, Wuhan 430022, China.
¹² Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, New York, New York 11797, USA.
¹³ USDA/ARS Children's Nutrition Research Center, Department of Pediatrics, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030, USA.
¹⁴ Departments of Systems Biology and Biomedical Informatics, Columbia University, New York, New York 10032, USA.
¹⁵ Department of Psychiatry &Behavioral Sciences, Keck School of Medicine, University of Southern California, Los Angeles, California 90033, USA.
¹⁶ National Center for Biotechnology Information, U.S. National Library of Medicine, Bethesda, Maryland 20894, USA.
¹⁷ Department of Ophthalmology, The University of Hong Kong, Hong Kong, China.
¹⁸ State Key Laboratory of Brain and Cognitive Sciences, The University of Hong Kong, Hong Kong, China.

PMID: 27356984
PMCID: PMC4931320
DOI: 10.1038/ncomms12065

Abstract

Short-read sequencing has enabled the de novo assembly of several individual human genomes, but with inherent limitations in characterizing repeat elements. Here we sequence a Chinese individual HX1 by single-molecule real-time (SMRT) long-read sequencing, construct a physical map by NanoChannel arrays and generate a de novo assembly of 2.93 Gb (contig N50: 8.3 Mb, scaffold N50: 22.0 Mb, including 39.3 Mb N-bases), together with 206 Mb of alternative haplotypes. The assembly fully or partially fills 274 (28.4%) N-gaps in the reference genome GRCh38. Comparison to GRCh38 reveals 12.8 Mb of HX1-specific sequences, including 4.1 Mb that are not present in previously reported Asian genomes. Furthermore, long-read sequencing of the transcriptome reveals novel spliced genes that are not annotated in GENCODE and are missed by short-read RNA-Seq. Our results imply that improved characterization of genome functional variation may require the use of a range of genomic technologies on diverse human populations.

PubMed Disclaimer

Conflict of interest statement

J.Hu and D.W. are employees of Nextomics Biosciences. G.J.L. serves on the advisory boards of Omicia, Inc., GenePeeks, Inc. and Good Start Genetics, Inc. K.W. is a board member and shareholder of Tute Genomics, Inc. and Nextomics Biosciences. E.E.E. is on the scientific advisory board of DNAnexus, Inc. and is a consultant for Kunming University of Science and Technology (KUST) as part of the 1000 China Talent Program. The remaining authors declare no competing financial interests.

Figures

**Figure 1. Summary of gap filling in GRCh38.**
(a) Length distribution of all gaps (stretches of ‘N' in genome sequence) in GRCh38. (b) Length distribution of all gaps that can be fully or partially closed. (c) Violin plots showing the distribution of LINE, SINE, LTR, simple repeat and satellite in closed gaps and in GRCh38. (d) A dotplot showing how a gap on 17p13.3 is closed by a contig in HX1. The plot shows comparison of two sequences and each dot indicates a region of close similarity between them. (e) Genome browser screenshot of the gap region that was closed. The gap is flanked by two contigs that are new in GRCh38 (not carried forward from GRCh37), yet an HX1 associated contig (000850F-001-01) can completely align to flanking regions, therefore filling this assembly gap and revising its length from 718 to 731 bp.

**Figure 2. Detection of structural variants by different technologies.**
(a) Chromosome ideogram showing large-scale (>1 kb) deletions (blue) and insertions (red) identified from long-read sequencing data. (b) Pie chart showing the distribution of different classes of structural variants identified from long-read sequencing data. (c) Venn diagram showing the overlap of structural variants between HX1, CHM1 and the 1000 Genomes Project for insertions and deletions, respectively. (d) Integrative Genomics Viewer screenshot of the long-read (upper panel) and short-read alignment (lower panel) around an ∼200-kb deletion.(e) Alignment of *de novo* assembled genome map (blue) to reference genome map (green) where the ∼200-kb deletion occurs. Black vertical lines represent labels for the enzyme recognition site. Contig 2 shows identical label patterns as reference, yet contig 1 contains the deletion. (f) Integrative Genomics Viewer screenshot of long-read (upper panel) and short-read (lower panel) alignment around a 132-bp deletion on KRTAP1-1. This deletion is visually discernible from long-read sequencing, because the coverage is reduced and half the reads contain the deletion in alignments. However, read-depth-based method failed to detect this deletion with short read data. (g) Genome browser screenshot of the region surrounding the 132-bp deletion on KRTAP1-1, demonstrating the presence of simple tandem repeats and the very high GC content of the deletion

**Figure 3. Novel gene inferred from Iso-Seq long-read RNA sequencing.**
(a) Integrative Genomics Viewer on alignment files generated from Iso-Seq. Over 100 long reads can be mapped to this locus on chr20q13.12 in the GRCh38 assembly. (b) UCSC Genome Browser screenshot on the predicted transcript models. The transcripts are not detected in RNA-Seq data on nine cell lines in ENCODE. This gene is conserved in primates but not in other vertebrate species, and is not in segmental duplication regions or simple repeat regions. (c) PCR validation of the transcript TCONS_0035154 by a primer pair that targeted exons 1 and 5. Several PCR products with different sizes can be detected, representing different isoforms. MC239 is a Caucasian sample and MA296 is an East Asian sample. (d) Sanger sequencing confirmed the splicing events predicted by the Iso-Seq data.

**Figure 4. Functional annotation and analysis of the genomic variants in HX1.**
(a) Average coverage versus GC contents for 100-bp windows in Illumina data and PacBio data, respectively. The mean and s.d. values are shown. (b) Distribution of PacBio coverage for regions that have ≤5 × coverage in Illumina data. (c) Shared SNVs discovered in HX1, AK1, HuRef, NA12878 and YH. (d) Variant reduction pipeline to identify pathogenic variant; although 20 were annotated as ‘pathogenic' in ClinVar, careful analysis failed to support any one.

See this image and copyright information in PMC

References

1. Li R. et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20, 265–272 (2010). - PMC - PubMed
1. Gnerre S. et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc. Natl Acad. Sci. USA 108, 1513–1518 (2011). - PMC - PubMed
1. Alkan C., Sajjadian S. & Eichler E. E. Limitations of next-generation genome sequence assembly. Nat. Methods 8, 61–65 (2011). - PMC - PubMed
1. Chaisson M. J., Wilson R. K. & Eichler E. E. Genetic variation and the de novo assembly of human genomes. Nat. Rev. Genet. 16, 627–640 (2015). - PMC - PubMed
1. Cao H. et al. De novo assembly of a haplotype-resolved human genome. Nat. Biotechnol. 33, 617–622 (2015). - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Molecular Biology Databases
- SILVA
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Long-read sequencing and de novo assembly of a Chinese genome

Affiliations

Long-read sequencing and de novo assembly of a Chinese genome

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Miscellaneous