. 2011 Oct 5;12 Suppl 9(Suppl 9):S9.

doi: 10.1186/1471-2105-12-S9-S9.

Detecting genomic regions associated with a disease using variability functions and Adjusted Rand Index

Dunarel Badescu¹, Alix Boc, Abdoulaye Baniré Diallo, Vladimir Makarenkov

Affiliations

PMID: 22151279
PMCID: PMC3271671
DOI: 10.1186/1471-2105-12-S9-S9

Detecting genomic regions associated with a disease using variability functions and Adjusted Rand Index

Dunarel Badescu et al. BMC Bioinformatics. 2011.

. 2011 Oct 5;12 Suppl 9(Suppl 9):S9.

doi: 10.1186/1471-2105-12-S9-S9.

Authors

Dunarel Badescu¹, Alix Boc, Abdoulaye Baniré Diallo, Vladimir Makarenkov

Affiliation

¹ Département d'lnformatique, Université du Quebec a Montreal, CP 8888, Succursale Centre-Ville, Montreal (Quebec), H3C 3P8, Canada.

PMID: 22151279
PMCID: PMC3271671
DOI: 10.1186/1471-2105-12-S9-S9

Abstract

Background: The identification of functional regions contained in a given multiple sequence alignment constitutes one of the major challenges of comparative genomics. Several studies have focused on the identification of conserved regions and motifs. However, most of existing methods ignore the relationship between the functional genomic regions and the external evidence associated with the considered group of species (e.g., carcinogenicity of Human Papilloma Virus). In the past, we have proposed a method that takes into account the prior knowledge on an external evidence (e.g., carcinogenicity or invasivity of the considered organisms) and identifies genomic regions related to a specific disease.

Results and conclusion: We present a new algorithm for detecting genomic regions that may be associated with a disease. Two new variability functions and a bipartition optimization procedure are described. We validate and weigh our results using the Adjusted Rand Index (ARI), and thus assess to what extent the selected regions are related to carcinogenicity, invasivity, or any other species classification, given as input. The predictive power of different hit region detection functions was assessed on synthetic and real data. Our simulation results suggest that there is no a single function that provides the best results in all practical situations (e.g., monophyletic or polyphyletic evolution, and positive or negative selection), and that at least three different functions might be useful. The proposed hit region identification functions that do not benefit from the prior knowledge (i.e., carcinogenicity or invasivity of the involved organisms) can provide equivalent results than the existing functions that take advantage of such a prior knowledge. Using the new algorithm, we examined the Neisseria meningitidis FrpB gene product for invasivity and immunologic activity, and human papilloma virus (HPV) E6 oncoprotein for carcinogenicity, and confirmed some well-known molecular features, including surface exposed loops for N. meningitidis and PDZ domain for HPV.

PubMed Disclaimer

Figures

**Figure 1**
**Sliding window procedure** Sliding window of a fixed width was used to scan the HPV gene E6. The sequences in black belong to the set X (carcinogenic HPV; in this example HPV 16 and 18), all the other sequences belong to the set Y (non-carcinogenic HPV). The HPV type is indicated in the left column.

**Figure 2**
**p-values obtained for monophyletic evolution hit region detection** (a) Positive selection - Variable hit region inside conserved context. Quartile distribution of p-values obtained for the function . Abscissa represents scaling factor of the conserved context in which the variable hit region resides. Values close to 0 represent conservation (maximum discrimination), while values close to 1 represent variability (identical to context). Variable hit region is always maintained at a scaling factor of 1. Ordinate represents p-values in log-scale. Horizontal dashed line represents the significance threshold of 0.05. (b) Lineage specific selection - Heterogeneous hit region inside neutral context. Quartile distribution of p-values obtained for the function . Abscissa represents the difference in scaling factors among the two lineages present in the hit region. Values close to 0 represent homogeneous evolutionnary speed (similar to the neutral context in which it resides), while values close to 1 represent divergence among these lineages. Context is always maintained at a scaling factor of 0.5, simulating neutral evolution. Horizontal dashed line represents the significance threshold of 0.05. In the case of lineage specific selection, the value of the Q′*-tγpe* functions corresponding to 1 on the abscissa scale cannot be computed because it involves a sub-tree with 0 edge lengths.

formula image — **Figure 2**
**p-values obtained for monophyletic evolution hit region detection** (a) Positive selection - Variable hit region inside conserved context. Quartile distribution of p-values obtained for the function . Abscissa represents scaling factor of the conserved context in which the variable hit region resides. Values close to 0 represent conservation (maximum discrimination), while values close to 1 represent variability (identical to context). Variable hit region is always maintained at a scaling factor of 1. Ordinate represents p-values in log-scale. Horizontal dashed line represents the significance threshold of 0.05. (b) Lineage specific selection - Heterogeneous hit region inside neutral context. Quartile distribution of p-values obtained for the function . Abscissa represents the difference in scaling factors among the two lineages present in the hit region. Values close to 0 represent homogeneous evolutionnary speed (similar to the neutral context in which it resides), while values close to 1 represent divergence among these lineages. Context is always maintained at a scaling factor of 0.5, simulating neutral evolution. Horizontal dashed line represents the significance threshold of 0.05. In the case of lineage specific selection, the value of the Q′*-tγpe* functions corresponding to 1 on the abscissa scale cannot be computed because it involves a sub-tree with 0 edge lengths.

**Figure 3**
**p-values obtained for polyphyletic evolution hit region detection** (a) Positive selection - Variable hit region inside conserved context. Quartile distribution of p-values obtained for the function . Variable hit region is always maintained at a scaling factor of 1. Abscissa represents scaling factor of the conserved context in which the variable hit region resides. Values close to 0 represent conservation (maximum discrimination), while values close to 1 represent variability (identical to context). Ordinate represents p-values in log-scale. Horizontal dashed line represents the significance threshold of 0.05. (b) Lineage specific selection - Heterogeneous hit region inside neutral context. Quartile distribution of p-values obtained for the function . Context is always maintained at a scaling factor of 0.5, simulating neutral evolution. Abscissa represents difference in scaling factors among the two lineages present in the hit region. Values close to 0 represent homogeneous evolutionnary speed (similar to the neutral context in which it resides), while values close to 1 represent divergence among these lineages, and from the neutral context. Horizontal dashed line represents significance threshold of 0.05.

**Figure 4**
**N. meningitidis FrpB protein variability zone detection** (a) Topology model of the FrpB protein of *N. meningitidis* strain H44/76. Topology of the β-barrel. Surface-exposed loops (L) and β-strands are numbered. Residues are framed according to their predicted secondary structure: Amino acid residues in β-strands are depicted by diamonds. Amino acid residues present in exposed loops and periplasmic turns are depicted by circles (reproduced from Kortekaas et al., 2007) [20]. (b)-(c) Variability zone detection by the hit region identification Q′*-tγpe* functions, achieved *without prior knowledge* of invasive taxa (case b), and Q″-type functions, *using this prior knowledge* along with the ARI coefficient (case c). Functions and are depicted by a dashed line and functions and are depicted by a continuous line. A non-overlapping sliding window of size 9 nucleotides was used during the scan of the gene FrpB MSA. The abscissa axis represents the window position along the nucleotide MSA. 11 gray zones correspond to extracellular loops. Annotations start at the solid vertical line (near the 400 abscissa mark).

**Figure 5**
**Hit region identification functions for High-Risk HPV** (a) Functions obtained *using prior knowledge* on the taxa carcinogenicity. The hit region identification functions Q₄, depicted by a dashed line, Q₅, depicted by a continuous line, and Q₆, depicted by a dotted line, for the High-Risk HPV (HPV-16 and 18) [11,12], during the scan of the gene E6. (b) Functions computed *without prior knowledge* on the taxa carcinogenicity. The hit region identification functions Q'₄, depicted by a dashed line, Q'₅, depicted by a continuous line, and Q'₆, depicted by a dotted line, during the scan of the gene E6. The abscissa axis represents the window position along the nucleotide multiple sequence alignment. The *PDZ-doirmm* is highlighted in gray. Annotations for the N and C-terminal arms, E6N and E6C domains are represented for HPV16 coordinates, from (Nominé et al., 2006) [30]. Zn²⁺-ligating Cys residues annotations reproduced from Lipari et al., 2001 [28].

**Figure 6**
**Q″-type functions, depending on ARI** (a) Squam HPV dataset. (b) Adeno HPV dataset. Variation of the function Q"₄, depicted by a dashed line, Q"₅, depicted by a continuous line, and Q"₆, depicted by a dotted line, obtained with the non-overlapping sliding window of width 20 nucleotides during the scan of the gene E6. The abscissa axis represents the window position along the nucleotide MSA. The *PDZ*-domain is highlighted in gray. Annotations for the N and C-terminal arms, E6N and E6C domains are represented for HPV16 coordinates, from (Nominé et al., 2006) [30]. Zn²⁺-ligating Cys residues annotations reproduced from Lipari et al., 2001 [28].

See this image and copyright information in PMC

Cited by

A Generalized Bayesian Stochastic Block Model for Microbiome Community Detection.
Lutz KC, Neugent ML, Bedi T, De Nisco NJ, Li Q. Lutz KC, et al. Stat Med. 2025 Feb 10;44(3-4):e10291. doi: 10.1002/sim.10291. Stat Med. 2025. PMID: 39853798 Free PMC article.

References

1. Posada D, Crandall K. Evaluation of methods for detecting recombination from DNA sequences: computer simulations. Proceedings of the National Academy of Sciences of the United States of America. 2001;98(24):13757. doi: 10.1073/pnas.241370698. - DOI - PMC - PubMed
1. Kimura M. The neutral theory of molecular evolution. Cambridge Univ Pr; 1985.
1. Boc A, Philippe H, Makarenkov V. Inferring and validating horizontal gene transfer events using bipartition dissimilarity. Systematic biology. 2010;59(2):195. doi: 10.1093/sysbio/syp103. - DOI - PubMed
1. Moran P. The statistical processes of evolutionary theory. The statistical processes of evolutionary theory. 1962.
1. Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, Weinstock GM, Wilson RK, Gibbs Ra, Kent WJ, Miller W, Haussler D. Evolutionary conserved elements in vertebrate, insect, worm, and yeast genomes. Genome research. 2005;15(8):1034–50. doi: 10.1101/gr.3715005. http://www.ncbi.nlm.nih.gov/pubmed/16024819 - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Detecting genomic regions associated with a disease using variability functions and Adjusted Rand Index

Affiliation

Detecting genomic regions associated with a disease using variability functions and Adjusted Rand Index

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Miscellaneous

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Miscellaneous