Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Sep 5;15(1):7748.
doi: 10.1038/s41467-024-49983-7.

Exploring the structural landscape of DNA maintenance proteins

Affiliations

Exploring the structural landscape of DNA maintenance proteins

Kenneth Bødkter Schou et al. Nat Commun. .

Abstract

Evolutionary annotation of genome maintenance (GM) proteins has conventionally been established by remote relationships within protein sequence databases. However, often no significant relationship can be established. Highly sensitive approaches to attain remote homologies based on iterative profile-to-profile methods have been developed. Still, these methods have not been systematically applied in the evolutionary annotation of GM proteins. Here, by applying profile-to-profile models, we systematically survey the repertoire of GM proteins from bacteria to man. We identify multiple GM protein candidates and annotate domains in numerous established GM proteins, among other PARP, OB-fold, Macro, TUDOR, SAP, BRCT, KU, MYB (SANT), and nuclease domains. We experimentally validate OB-fold and MIS18 (Yippee) domains in SPIDR and FAM72 protein families, respectively. Our results indicate that, surprisingly, despite the immense interest and long-term research efforts, the repertoire of genome stability caretakers is still not fully appreciated.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Overview of the computational survey and summary of results.
a, In step 1 unique GM proteins were compiled by three different approaches. Flow diagram summarizing data collection and subsequent sequence searches. GO terms for GM proteins were compiled in the Amigo database using four search terms across species yielding a total of 28,663 GO terms. Of these, 3635 are unique GM proteins from the seven selected organisms namely H. sapiens, D. melanogaster, C. elegans, A. thaliana, S. cerevisiae, S. pombe, and E. coli (K12). GM physical interactors were retrieved from the IID database yielding a total of 975,877 interactors across species. Among these are 4618 interactors not previously implicated in GM. Of these, only 441 interactor pairs include one established DNA repair protein and one protein not previously linked to the DDR. Among these, 51 interactors were identified as recurrent contaminants in the CRAPome database yielding a final list of 390 unique GM interactors not previously implicated in the DDR. GM gene co-expressed genes were retrieved from the CEMiTool identifying a total of 36,410 GM co-expressed genes. Among these CEMiTool identifies 3523 overlapping co-expressed genes between two different tissues, which upon filtering for housekeeping genes and registered Crapome entries are reduced to 2820 co-expressed gene pairs (of which one gene per pair is an established GM gene). In Step 2 the compiled list of 4395 unique GM proteins were used as search queries in profile-HMM searches. These searches yielded a total of 108 hitherto unknown human domains in established and candidate GM proteins. These were used as seeds for reciprocal profile-HMM searches resulting in 108 validated candidate domains. Finally, the valid candidates were assessed by AlphaFold2 3D structural modeling to structurally validate the predicted evolutionary relationships across protein families. b Summary of identified classes of protein domains in the human proteome. c Validation of profile-HMM methods. Three methods were tested for their efficiency in detecting homologous protein domains in the protein databank (PDB). The three protein domains used as seeds were human RPA1 OB_2, the human BARD1 BRCT, and human MYB (SANT) domains. d, e Examples of predicted 3D structures of two identified candidate domains as judged by AlphaFold modeling. Predicted domains were superimposed with closest paralog domains in Pymol as indicated. f Probability plots of profile-HMM remote homology searches using either the predicted KU core domain of M1AP or the predicted BRCT domain of SMARCC1 as sequence queries. gk Summary of examples of DNA repair candidates identified in the computational survey shown in red. g Summarizes DNA repair protein domain classes. hk Four examples of identified DNA repair candidates as judged by their predicted protein domains. Red nodes represent candidates (at the time of analysis). The length of nodes from the center corresponds to the sequence homology of the signature domain family profile. lp Summary of mitotic candidates identified in the computational survey shown in blue. l Summarizes mitotic protein domain classes. mp Four examples of identified mitotic candidates as judged by their predicted protein domains. Blue nodes represent candidates (at the time of analysis). The length of nodes from the center corresponds to the sequence homology of the signature domain family profile. Source data are provided as a Source Data file.
Fig. 2
Fig. 2. Identified Poly-(ADP-Ribose) catalyzing protein families.
a Diagram of the active site residues of the PARP1-like catalytic domain. Similar to PARP13, TASOR and TASOR2 have lost essential residues required for catalytic activity. b MSA of selected PARP domain sequences including those of TEX15, TASOR, and TASOR2. Conserved residues are shown in orange as assessed by Clustal W with modifications. Predicted secondary structures are shown above the MSA. Boxes indicate alpha-helices, and arrows indicate beta-sheets. c Schematic domain architectures of selected human PARPs including the three PARP candidates TEX15, TASOR, and TASOR2. Phylogenetic trees were calculated from MSA average distances using approximately the maximum-likelihood method in the IQ-Tree v.2.050 program. d Predicted PARP domains of TEX15 and TASOR2 as assessed by AlphaFold. PARP domains of TEX15 and TASOR2 (orange) were superimposed with the PARP domain of PARP1 (white) in Pymol. e Superimposed PARP domains of either TASOR2 and PARP1 (left) or TASOR2 and PARP13 (right). Conserved residues are indicated. f Probability plots of profile-HMM searches using the predicted TEX15 PARP domains (top) or the predicted AKAP11 Macro domain (bottom) as queries. g 3D structures of AKAP family member Macro domains as predicted by AlphaFold. AKAP Macro domains (yellow/orange nuances) are superimposed with the Macro domain in human PARG (white). h PARG domain relationships to human AKAP family proteins. The C-termini of SPHKAP, AKAP3, AKAP4, and AKAP11 show remote homology to the C-terminal portion of PARG comprising the PAR-binding Macro domain. Source data are provided as a Source Data file.
Fig. 3
Fig. 3. Identification of OB fold domains in SPIDR.
a MSA of selected OB fold domain sequences including the outermost C-terminal OB fold of SPIDR. Conserved residues shown in blue were calculated using the Clustal W algorithm with modifications. Predicted secondary structures are shown above the MSA. Arrows indicate beta-sheets, and boxes indicate alpha-helices. b Probability plots of profile-HMM remote homology searches using either the human RPA1_OB4 domain (forward) or the predicted SPIDR OB3 domain (reciprocal) as sequence queries. c The three predicted OB folds in the SPIDR C-terminus. Here, AlphaFold predicted models are superimposed with the solved structure of RPA1_4OB fold DBD-D (PDB: 4GOP), shown in white. Short unstructured coils have been stripped off the SPIDR OB folds for clarity. d The three tandem OB-fold domains in the SPIDR C-terminus resemble that of other OB fold-containing proteins. Searching the overall predicted structure comprising these three SPIDR OB folds against protein structure databases using Foldseek (https://github.com/steineggerlab/foldseek) identifies S. cerevisiae RPA1 as the closest significant match. e Schematic illustration of the bimodular family of IDP and OB-fold family DNA repair proteins. Two causative mutations identified in primary ovarian insufficiency (POI) patients are shown in SPIDR (red). f OB folds of SPIDR binds ssDNA. Cell extracts from HEK293T cells expressing indicated fragments of FLAG-SPIDR were incubated with biotinylated ssDNA and subjected to streptavidin pulldown using streptavidin resin. g Chromatin fractionation of HEK293T cells expressing either FLAG-SPIDR full length, truncated FLAG-SPIDR fragment corresponding to SPIDR containing disease mutation W280*. h FLAG-SPIDR binds ssDNA but not dsDNA. Cell extracts from HEK293T cells expressing FLAG-SPIDR were incubated with either biotinylated ssDNA or dsDNA and biotinylated DNA purified using a streptavidin (strep) resin. i Point mutations introduced into OB-fold domains of SPIDR. j Biotin-ssDNA pulldown analysis of cell extract from HEK293T cells expressing either GFP-SPIDR wildtype or a mutated version with the indicated amino acid substitutions. WCE = sample processing control. k, AlphaFold3 modeling of SPIDR OB3 domain in complex with either DNA or RNA. Immunoblots are representative results of two individual experiments (X = 2). Source data are provided as a Source Data file.
Fig. 4
Fig. 4. FAM72 family proteins bind the RPA complex and is implicated in RPA activation in response to DNA damage.
a MSA of human MIS18 domain sequences. Conserved residues shown in brown were assessed using the Clustal W algorithm. Predicted secondary structures are shown above the MSA. Arrows indicate beta-sheets. b Probability plots of profile-HMM remote homology searches using either the MIS18 domain of MIS18a (forward search) or the predicted MIS18 domain of FAM72B (reciprocal search) as sequence queries. c Family of human MIS18 family proteins. Phylogenetic trees were calculated from MSA average distances using the percentage identity (PID) algorithm. d Tertiary structures of FAM72 family proteins as predicted by AlphaFold. Predicted domains were superimposed in PyMol. e FAM72A complexes with either FAM72B, FAM72C, or FAM72D as predicted by ColabFold. f Gene co-expression GO enrichment analysis result of the co-expression signature profile shown in (g). Combined FAM72A-D gene co-expression signature. The human FAM72 family co-expressed genes are ranked according to Pearson correlation coefficients (PCC) as shown. h Volcano blot showing top interactors of FLAG-FAM72B as assessed by mass spectrometry. i FLAG-FAM72B immunoprecipitation and subsequent immunoblot of eluted immunocomplexes. Proteins were probed with the indicated antibodies. WCE = sample processing control. j Immunoblot of FLAG-FAM72B-expressing U2OS cells chromatin fractions after a thymidine block. Cells were either left untreated or treated for 24 hours with thymidine followed by extensive washing, release in growth medium, and harvested at the indicated time points. The resolved proteins were probed with the indicated antibodies. Sol = soluble fraction, Chromatin = chromatin enriched fraction. k Immunoblot of chromatin fractions from FLAG-FAM72B-expressing cells after exposure to CPT for 1.5 and 3 h. Proteins were probed with the indicated antibodies. Immunoblots are representative results of two individual experiments (X = 2). Source data are provided as a Source Data file.
Fig. 5
Fig. 5. Prediction of the C1ORF146-SHOC1 complex and implications of C1ORF146 in the DDR.
a Family of human ERCC4/XLF nucleases. Phylogenetic trees were calculated from MSA average distances using the percentage identity (PID) algorithm. b ColabFold prediction of C1ORF146 in complex with SHOC1 and yeast SPO16-ZIP2 complex. c MSA of human ERCC4 domain sequences. Conserved residues shown in green were calculated using the Clustal W algorithm. Predicted secondary structures are shown above the MSA. Boxes indicate alpha-helices and arrows indicates beta-sheets. C1ORF146 residues predicted to make contacts to SHOC1 are highlighted in red. Many of these residues are conserved across ERCC4 nucleases (green) d Contact sites (red) between the outermost C-terminus of C1ORF146 (green) and SHOC1 (white). e Contacts (red) between the SPO16 outermost C-terminus (green) and ZIP2 (white). Chromatin retention of GFP-C1ORF146 in RPE1 cells. f The predicted alignment error (PAE) plot of the human C1ORF146-SHOC1 ERCC4 nuclease complex as shown in panel (b). g Immunoblot of chromatin fractions from GFP-C1ORF146-expressing U2OS cells either untreated or treated with DNA damaging agents. The resolved proteins were probed with the indicated antibodies. h Immunoblot of chromatin fractions from GFP-C1ORF146-expressing U2OS cells released from a thymidine block. Cells were either left untreated or treated for 24 hours with thymidine followed by extensive washing, release in growth medium and harvested at the indicated time points. The resolved proteins were probed with the indicated antibodies. Immunoblots are representative results of three individual experiments (X = 3). i Immunofluorescence microscopy images of U2OS cells expressing GFP-C1ORF146 either untreated or treated with cisplatin for 6 hours. After fixing the cells in PFA cells were permeabilized, BSA-blacked stained with the indicated antibody. The micrographs are representative of two individual experiments (X = 2). Source data are provided as a Source Data file. Scale bar in i 10 mm.

References

    1. Aravind, L., Walker, D. R. & Koonin, E. V. Conserved domains in DNA repair proteins and evolution of repair systems. Nucleic Acids Res27, 1223–1242 (1999). 10.1093/nar/27.5.1223 - DOI - PMC - PubMed
    1. Arcas, A., Fernandez-Capetillo, O., Cases, I. & Rojas, A. M. Emergence and evolutionary analysis of the human DDR network: implications in comparative genomics and downstream analyses. Mol. Biol. Evol.31, 940–961 (2014). 10.1093/molbev/msu046 - DOI - PMC - PubMed
    1. Finn, R. D. et al. HMMER web server: 2015 update. Nucleic Acids Res43, W30–W38 (2015). 10.1093/nar/gkv397 - DOI - PMC - PubMed
    1. Zimmermann, L. et al. A completely reimplemented mpi bioinformatics toolkit with a new hhpred server at its core. J. Mol. Biol.430, 2237–2243 (2018). 10.1016/j.jmb.2017.12.007 - DOI - PubMed
    1. Koonin, E. V., Altschul, S. F. & Bork, P. BRCA1 protein products… Functional motifs. Nat. Genet13, 266–268 (1996). 10.1038/ng0796-266 - DOI - PubMed

Publication types