Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Dec 7;186(25):5440-5456.e26.
doi: 10.1016/j.cell.2023.10.028.

SPLASH: A statistical, reference-free genomic algorithm unifies biological discovery

Affiliations

SPLASH: A statistical, reference-free genomic algorithm unifies biological discovery

Kaitlin Chaung et al. Cell. .

Abstract

Today's genomics workflows typically require alignment to a reference sequence, which limits discovery. We introduce a unifying paradigm, SPLASH (Statistically Primary aLignment Agnostic Sequence Homing), which directly analyzes raw sequencing data, using a statistical test to detect a signature of regulation: sample-specific sequence variation. SPLASH detects many types of variation and can be efficiently run at scale. We show that SPLASH identifies complex mutation patterns in SARS-CoV-2, discovers regulated RNA isoforms at the single-cell level, detects the vast sequence diversity of adaptive immune receptors, and uncovers biology in non-model organisms undocumented in their reference genomes: geographic and seasonal variation and diatom association in eelgrass, an oceanic plant impacted by climate change, and tissue-specific transcripts in octopus. SPLASH is a unifying approach to genomic analysis that enables expansive discovery without metadata or references.

Keywords: RNA-seq; computational biology; genetics; genomics; reference-free; single-cell RNA-seq; splicing; statistics.

PubMed Disclaimer

Conflict of interest statement

Declaration of interests K.C., T.Z.B., and J.S. are inventors on provisional patents related to this work.

Figures

Figure 1.
Figure 1.. Overview of SPLASH
A. An anchor is a sequence of length k (k-mer) in a read; its target is the k-mer that follows it after a fixed offset of length R. An anchor may occur with different targets, which can capture many types of variation; examples are depicted schematically, with the anchor as a blue box and the targets as orange or violet boxes. B. SPLASH compiles a table for each anchor, where the columns are samples, the rows are targets, and the entries are the respective occurrence counts. SPLASH tests multiple random splits of the samples, calculating a test statistic that measures the deviations between each sample’s target distribution and the average target distribution over all samples, searching for the most discriminating split. For the best split identified, SPLASH reports a p-value bound. C. Alignment-based pipelines are limited by the need for a reference genome, compute-intensive, and difficult to model statistically due to their complexity. SPLASH detects variation directly from raw reads with rigorous statistics, is compute-efficient, and detects many kinds of variation at once. D. Some considerations of when SPLASH may be most useful, which reflect its design characteristics.
Figure 2.
Figure 2.. SPLASH identifies strain-defining and other variation in SARS-Cov2
In A-C, sets of targets that distinguish SARS-CoV-2 strains are shown; all are in the spike protein (S) gene. Each heatmap has columns for different samples (patients) and rows for different targets; the coloring indicates the fraction of that target observed in that patient. Summary anchor counts are given for rows and columns. Also shown is a map of the categorical metadata of what strains (primary and secondary) were identified in each patient in the original study; this data was not used by SPLASH, but there is evident agreement between the heatmap and the metadata strain assignment. Binomial p-value bounds were computed per STAR Methods. A. Distinguishing at the major strain level: target 1 (no mutation) matches Delta; target 2 contains K417N, found in all Omicron (both BA.1 and BA.2 sub-strains); two patients co-infected with Delta and Omicron show both targets. (p=6.4E-07). B. Distinguishing at the sub-strain level: target 1 (no mutation) matches Delta; target 2 with V213G is specific for BA.2; target 3 with a deletion (NL211I) and insertion (R214REPE) is specific for BA.1. (p=1.0E-13) C. Distinguishing non-strain related mutations: target 1 has P681R, Delta specific; targets 2 and 3 encode Q677H (by different mutations) and P681R; target 4 has N679K and P681H, Omicron specific. (p=4.9E-12) D. Protein domain profiling in SARS-CoV-2. The top four and bottom four Pfam protein domain hits are shown. S1 Receptor binding domain (RBD) and S2 domain show high variation by SPLASH in both datasets. Other Pfam abbreviations: bCoV = beta-coronavirus; CoV = coronavirus, nucleocap = nucleocapsid N = N-terminal domain, SARS = Severe acute respiratory syndrome coronavirus, M, NS7A, NSP1, NSP8, 3b, NSP10 = viral proteins.
Figure 3.
Figure 3.. Cell-type specific expression of paralogs and HLA from single-cell data
Figures A-C show spread plots, each dot representing the relative isoform expression in a single cell; bar marks average relative expression across all cells. A. Human MYL12A and MYL12B lie adjacent on chromosome 18, a region syntenic in mammals, chickens, and reptiles. The sequence alignment shows the two genes are very similar in the coding region, and marks the anchor and two targets for individual P2 (P3 has a distinct anchor). Macrophages express relatively more MYL12A and capillary cells more MYL12B, consistent in two individuals. B. The HLA-DRB locus occurs as several different haplotypes, all containing DRB1 but differing in paralog (DRB3, DRB4, DRB5, or none; hg38 reference has DRB5). The anchor and its targets lie in the 3’ untranslated regions of DRB1 and DRB4; the two individuals have distinct alleles at both DRB1 and DRB4. Macrophages express mainly DRB1 and capillary cells mainly DRB4. C. HLA-DPA1 and HLA-DPB1 overlap in head-to-head arrangement as shown in the UCSC Genome Browser diagram, which also shows the BLAT mapping of the anchor consensuses for DPA1 and DPB1, which lie on opposite strands. This is also depicted in the alignment; the targets are best assigned to opposite strands. An anchor simultaneously reporting on DPA1 and DPB1 was only found for individual P3; its targets show that macrophages exclusively express DPB1, while capillary cells mainly express DPA1. D. The polymorphic HLA-B gene contains many SPLASH anchors from T cells, as depicted in the UCSC Browser diagram. The hg38 reference is the B*07:02 allele, whereas this individual matches best to B*08 and B*51 (consensus sequences 1 and 2). We investigated one HLA-B anchor, which lies in exon 2. In the alignment, differences from hg38 in the anchor and lookahead region are in lowercase. In the scatterplot, cells show a wide range of allele expression ratios, some expressing a single allele. Dashed lines mark a 98% confidence interval for the binomial distribution based on the population average expression (confidence depends on anchor counts); the observed data deviates significantly (binomial p=1.73E-25), and some cells express almost exclusively one allele.
Figure 4.
Figure 4.. B and T cell receptor diversity from human and lemur single cell data
A. The “transcript mapping” plots show the number of anchors that align to a given gene name, for SPLASH on y-axis and Controls on x-axis, with immune receptor genes highlighted in red. For B cells, Ig genes (kappa = IGK and lambda = IGL) predominate among SPLASH anchors, but are not found at all in Control anchors. For T cells, TCR genes (alpha = TRA, beta = TRB) predominate, and are not found in Controls. The inset histograms show that immunoglobulin-type “V-set” and “C1-set” are among the top protein domain annotations identified by Pfam on anchor consensuses (for B cells, the top four and bottom four domains are shown; for T cells, all domains are shown). Mobile element activity is suggested by Pfam domains Tnp_22_dsRBD (“L1 transposable element dsRBD-like domain”) in B cells and RVT_1 (“Reverse transcriptase”) in T cells. B. Targets associated with Ig/TCR anchors are clonotypically expressed, in both human and lemur: heatmaps show that most targets (rows) are expressed only in a single cell (columns). Target sequences are shown as bp color-maps (rows are targets, matching the heatmap; columns are bp positions, colored by base), for quick visualization of sequence diversity. Lemur NKT cells show shared TCR usage – see top two rows; the shared target sequence is different in the two individuals.
Figure 5.
Figure 5.. Discovery of regulated variation in non-model organisms: octopus and eelgrass
A. SPLASH identified alternative transcripts in the O. bimaculoides Myo-VIIa motor protein that are expressed mutually exclusively; the target 2 isoform is only found in statocyst. The transcripts have different first exons; the start codon lies in the shared second exon. The anchor and exon 2 are missing from the O. bimaculoides genome assembly, but are in the closely related O. sinensis genome. Targets 1 and 2 are in both genomes, but the statocyst-specific transcript is not annotated. The O. sinensis assembly has the Myo-VIIa gene in two inverted pieces (broken at the point marked by ‘??’ in protein domain schematic; Data S2). B. The top four and bottom four domains identified by protein domain profiling in eelgrass (Z. marina) are plotted. The SPLASH domains chlorophyll A-B binding protein (Chloroa_b-bind) and silicon transporter (Silic_transp) derive from diatoms, based on BLAST protein alignment (Figure 7; Table S6). The other two top SPLASH domains, actin and ubiquitin, derive neither from diatoms nor eelgrass, so may be from other epiphytic species. C. A Chloroa_b-bind anchor, identified by BLASTP as “fucoxanthin chlorophyll a/c protein” from diatoms (Figure 7C), has several differentially abundant targets. The most common target (top row) is mainly found in France/June samples; three targets that encode the same protein sequence (middle) are found in France/December samples; and one target (bottom row) is only in Norway/December samples. D. An anchor in the eelgrass photosynthetic gene NdhL has four major targets. Targets 1–3 are allelic coding variants. Target 4 represents intron retention and gives a shortened protein. The scatterplot shows that Norway samples of June vs. December (red vs. green) are perfectly segregated by the fraction of target 4 (intron retention). A similar but less marked trend is seen for France samples of June vs. December (yellow vs. blue) – at the right edge, fraction target 4 values are collapsed to one dimension, with averages marked by bars.
Figure 6.
Figure 6.. O. bimaculoides 3’ UTR anchors show tissue-specific expression, related to Figure 5.
In the heatmaps, the parenthetical numbers are summed anchor counts. A. Carboxypeptidase D (CPD). The anchor and targets align to the 3’ UTR of the O. sinensis CPD mRNA (XM_029795433.2), but are not in the O. bimaculoides genome assembly. The NCBI Browser screenshot at lower-right shows that the 3’ UTR of the O. bimaculoides CPD gene (LOC106880679, Ch.25) is entirely missing from the assembly: immediately after the coding region, a run of Ns begins (red box). Target 2 is identical to O. sinensis except for two mismatches; target 1 has a 12-nt deletion relative to target 2. Target 1 is only expressed in dissociated cells from sucker rims, and at a low level in one olfactory organ sample. All other tissues express only target 2. B. Upf2 (regulator of nonsense transcripts 2). The alignment of Upf2 mRNAs from O. bimaculoides (XM_014915650.2) and O. sinensis (XM_036513028.1) shows that they diverge just before the stop codon, with unrelated 3’ UTRs. Our O. bimaculoides anchor-targets map only to O. sinensis Upf2 but not to the O. bimaculoides genome. The alignment also shows the downstream portion of the O. sinensis 3’ UTR where the anchor-targets map. Target 1 and 2 have six and five tandem CAG repeats, respectively. Target 1 is expressed in dissociated cells from sucker rims, and in olfactory organ; the other tissues express target 2. C. Netrin receptor/DCC. Alignment of the O. bimaculoides genome (gene LOC106883766) and O. sinensis mRNA (XM_036508072.1) shows that the two diverge shortly after the stop codon. The O. bimaculoides gene ends in dinucleotide repeats just before the genome becomes a run of Ns. Our O. bimaculoides anchor-targets map to the O. sinensis netrin receptor 3’ UTRbut not to the O. bimaculoides genome. The targets differ at a single nucleotide: target 1 and 2 have G and A, respectively; O. sinensis has a G in this position. If the O. bimaculoides genome encodes A, then target 1 is consistent with A-to-I RNA editing (inosine read as G during reverse transcription). The majority of tissues express target 2 only, while target 1 is only expressed in dissociated cells of sucker rims.
Figure 7.
Figure 7.. Diatom anchors in eelgrass samples show variation with location/season or Day vs Night, related to Figure 5.
A. HMG (high mobility group) box domain. The two targets show several nucleotide differences that result in two coding differences. The translation of the consensus sequence has its best two BLASTP matches to HMG box proteins from diatom species, shown in the inset. Target 1 is found only in Norway/December samples, while target 2 is found only in France/June samples. B. Ferredoxin. The two targets show a silent single nucleotide polymorphism. The translation of the consensus sequence has its best BLASTP matches to ferredoxin from several diatom species, the top two are shown in the inset. Target 1 is found only in France/June samples, while target 2 is found only in Norway/December samples. C. Fucoxanthin chlorophyll a/c protein (FCP). This anchor and its targets are also presented in Figure 5C. At left, the translated consensus sequence has its best protein BLAST matches to several diatom species, two are shown in the inset. The amino acid identity for Phaeodactylum tricornutum is 42/44 (95%). The consensus also BLASTs to the P. tricornutum genome, nucleotide identity 107/132 (81%) (not shown). At right, histogram shows total anchor counts for Night are ~60% lower than for Day, indicating circadian regulation of this gene. All are samples from France in December (where this anchor had both Day and Night representation).

Update of

References

    1. Sherman RM, Forman J, Antonescu V, Puiu D, Daya M, Rafaels N, Boorgula MP, Chavan S, Vergara C, Ortega VE, et al. (2019). Assembly of a pan-genome from deep sequencing of 910 humans of African descent. Nat. Genet 51, 30–35. 10.1038/s41588-018-0273-y. - DOI - PMC - PubMed
    1. Nurk S, Koren S, Rhie A, Rautiainen M, Bzikadze AV, Mikheenko A, Vollger MR, Altemose N, Uralsky L, Gershman A, et al. (2022). The complete sequence of a human genome. Science 376, 44–53. 10.1126/science.abj6987. - DOI - PMC - PubMed
    1. Domingo E, and Perales C. (2019). Viral quasispecies. PLoS Genet. 15, e1008271. 10.1371/journal.pgen.1008271. - DOI - PMC - PubMed
    1. Tettelin H, and Medini D. eds. (2020). The pangenome: diversity, dynamics and evolution of genomes (Springer; ) 10.1007/978-3-030-38281-0. - DOI - PubMed
    1. Castel SE, Levy-Moonshine A, Mohammadi P, Banks E, and Lappalainen T. (2015). Tools and best practices for data processing in allelic expression analysis. Genome Biol. 16, 195. 10.1186/s13059-015-0762-6. - DOI - PMC - PubMed

LinkOut - more resources