Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Jan;50(1):223-32.
doi: 10.1165/rcmb.2013-0235OC.

Genome reference and sequence variation in the large repetitive central exon of human MUC5AC

Affiliations

Genome reference and sequence variation in the large repetitive central exon of human MUC5AC

Xueliang Guo et al. Am J Respir Cell Mol Biol. 2014 Jan.

Abstract

Despite modern sequencing efforts, the difficulty in assembly of highly repetitive sequences has prevented resolution of human genome gaps, including some in the coding regions of genes with important biological functions. One such gene, MUC5AC, encodes a large, secreted mucin, which is one of the two major secreted mucins in human airways. The MUC5AC region contains a gap in the human genome reference (hg19) across the large, highly repetitive, and complex central exon. This exon is predicted to contain imperfect tandem repeat sequences and multiple conserved cysteine-rich (CysD) domains. To resolve the MUC5AC genomic gap, we used high-fidelity long PCR followed by single molecule real-time (SMRT) sequencing. This technology yielded long sequence reads and robust coverage that allowed for de novo sequence assembly spanning the entire repetitive region. Furthermore, we used SMRT sequencing of PCR amplicons covering the central exon to identify genetic variation in four individuals. The results demonstrated the presence of segmental duplications of CysD domains, insertions/deletions (indels) of tandem repeats, and single nucleotide variants. Additional studies demonstrated that one of the identified tandem repeat insertions is tagged by nonexonic single nucleotide polymorphisms. Taken together, these data illustrate the successful utility of SMRT sequencing long reads for de novo assembly of large repetitive sequences to fill the gaps in the human genome. Characterization of the MUC5AC gene and the sequence variation in the central exon will facilitate genetic and functional studies for this critical airway mucin.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
MUC5AC genomic region. Current status and research design to cover the MUC5AC genomic gap. (A) Annotation tracks for the MUC5AC region excerpted from the University of California Santa Cruz genome browser (http://genome.ucsc.edu) (GRCh37/hg19) with notes added to emphasize the gap. The current gap in the MUC5AC gene is situated between a set of exons (blue vertical bars along arrowed line) that are 5′ (the 5′ exons are incorrectly annotated to MUC5B) and a set of 3′ exons (that are correctly annotated to MUC5AC). The entire region is in general disarray. (B) Available sequences used to inform the selection of PCR primers to characterize the gene, with a focus on filling the gap. The PCR primer selection was based on the use of the human reference alternative assembly genomic scaffold sequence NW_001838016 and other existing cDNA and genomic sequences. The High Throughput Genomics (http://www.ncbi.nlm.nih.gov/genbank/htgs) working draft sequence FP326773 became available during the course of this work, and it differs from NW_001838016 in length in the regions indicated. Together, these two sequences provide information for the 5′ end of the gap and contribute to the gene model. From previous efforts, there is strong evidence that the 3′ end portion of the gap consists of the MUC5AC large central exon. The available sequences in the large central exon region that were used to inform the PCR strategy consisted of one partial mRNA (AJ298317) and two partial genomic PCR sequences (AJ298318 and AJ298319) (22). These sequences were used in conjunction with NW_001838016 and FP326773 to generate a gene model, which was consistent with previous efforts (17). (C) Schematic representation of the overlapping PCR products used in contig development for de novo assembly of the MUC5AC gene from the African American (AfrAm) subject and the region of focus for white subjects with cystic fibrosis (CauCF1–3). The AfrAm individual was sequenced in two phases (Phase 1 and 2) as described in Materials and Methods (further details are provided in Table E2).
Figure 2.
Figure 2.
MUC5AC gene defined in an AfrAm subject. (A) Schematic representation of de novo assembled sequence contigs produced from Pacific Biosciences sequencing of pooled PCR product sequences from the AfrAm subject. The relative sizes of the contigs are shown roughly to proportion, and the predicted gap size for the reference genome is 22.1 kb based on the location of flanking GRCh37/hg19 sequences. (B) Schematic representation of MUC5AC gene showing exon locations. The large central exon is predicted to be exon 31, containing nine CysD domains and the tandem repeat (PTS-TR) sequences. (C) Schematic MUC5AC mRNA protein translation showing the major protein domains and their relationship to the entire gene and the central exon. The central exon consists of a 5′ region characterized by one Class I CysD domain, duplicated pairs of Class II and Class III CysD domains, and adjacent homologous sequence, which is rich in prolines, threonines, and serines (PTS region). The 3′ half of the central exon has a different structure, which is characterized by Class III CysD domains and adjacent unique sequences separated by PTS-TR units (TR1–4) of 24-bp imperfect repeats. Other protein features shown were previously defined (22). C domain = von Willebrand factor type C domain; CK domain = C-terminal cysteine knot domain; D domain = von Willebrand factor type D domain. The definition of the CysD domain classes is provided in the text.
Figure 3.
Figure 3.
Genetic variants and organization of the MUC5AC central exon. Sequence schematics representing the MUC5AC central exon from the AfrAm subject and three white subjects with cystic fibrosis produced by de novo assembly were compared, as a group and individually, with the central exon model (22), and the results are shown. All four subjects in this study have larger PTS-TR1 (extra 1.9 kb, blue bar) and PTS-TR4 (extra 216 bp, green bar) regions than the draft genome model sequence (Figure 1B). The increase in the PTS-TR lengths, when compared with the previously known model, more specifically shown in the central exon model of this figure, effectively link the previously available genomic fragments (Figure 1) into one unit and complete the central exon sequence. Indels are shown as purple bars, pink bars, or black stars. The three classes of CysD domains are shown (colored ovals). The duplication of the CysD domains in CauCF3, as compared with other subjects, is indicated by CysD4a, CysD5a, and CysD6a. HinfI sites are shown by red arrows, and the small and large HinfI fragment lengths identified by Southern blots are shown in red text, which are very similar to the in silico sizes (blue text).
Figure 4.
Figure 4.
Confirmation of structural variation in subject CauCF3 predicted from de novo assembly. (A) PCR primers located in CysD1 and PTS-TR1 were used to amplify genomic DNA. The expected increase in size (from 2.3 to 3.4 kb), consistent with the addition of a 5′ region conserved duplicon, was observed in CauCF3. (B) Long sequence reads from BbsI-enriched genomic DNA (see Materials and Methods) from CauCF3 were mapped to the de novo consensus contig 1 from CauCF3 using Burrows-Wheeler Aligner with custom parameters (49) and were visualized in the Integrated Genomics Viewer software (50, 51). The black, purple, and red bars (Class I, Class II, and Class III, respectively) indicate location of the 12 CysD domains on the de novo contig produced from the PCR amplification of the region from CauCF3. The gray horizontal bars show the individual BbsI-enriched genomic DNA sequence reads mapped to the contig. The black bars, within the gray sequence reads, indicate regions within the individual sequences that have a high indel content (consequence of SMRT Sequencing errors). Several individual sequence reads show the seven (CysD1–CysD5a) predicted CysD domains at the 5′ region. Several other individual sequence reads demonstrate the duplication of CysD6a in the 3′ region. (C) Although the coverage was not sufficient to produce a de novo assembly free of sequence errors across the entire region, six de novo assembled contigs were generated from the low-coverage genomic DNA reads that mapped to the previously determined MUC5AC gene model (not shown). BLAST of CysD domain sequences to one of these contigs demonstrates the arrangement of CysD domains expected from CysD4a-CysD5a duplication in CauCF3 contigs (purple: CysD2 aligning to Class II; red: CysD3 aligning to Class III). The dotted vertical lines mark the region in (B) and (C) for illustrative purposes.

References

    1. Rose MC, Voynow JA. Respiratory tract mucin genes and mucin glycoproteins in health and disease. Physiol Rev. 2006;86:245–278. - PubMed
    1. Hansson GC. Role of mucus layers in gut infection and inflammation. Curr Opin Microbiol. 2012;15:57–62. - PMC - PubMed
    1. Rodríguez-Piñeiro AM, Bergström JH, Ermund A, Gustafsson JK, Schuette A, Johansson ME, Hansson GC. Gastrointestinal mucus proteome reveals Muc2 and Muc5ac accompanied by a set of core proteins: 2. Studies of mucus in mouse stomach, small intestine, and colon. Am J Physiol Gastrointest Liver Physiol. 2013;305:G348–G356. - PMC - PubMed
    1. Linden SK, Sutton P, Karlsson NG, Korolik V, McGuckin MA. Mucins in the mucosal barrier to infection. Mucosal Immunol. 2008;1:183–197. - PMC - PubMed
    1. Stonebraker JR, Wagner D, Lefensty RW, Burns K, Gendler SJ, Bergelson JM, Boucher RC, O’Neal WK, Pickles RJ. Glycocalyx restricts adenoviral vector access to apical receptors expressed on respiratory epithelium in vitro and in vivo: role for tethered mucins as barriers to lumenal infection. J Virol. 2004;78:13755–13768. - PMC - PubMed

Publication types

MeSH terms