Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jan 19;46(4):e20230048.
doi: 10.1590/1678-4685-GMB-2023-0048. eCollection 2024.

Position Weight Matrix or Acyclic Probabilistic Finite Automaton: Which model to use? A decision rule inferred for the prediction of transcription factor binding sites

Affiliations

Position Weight Matrix or Acyclic Probabilistic Finite Automaton: Which model to use? A decision rule inferred for the prediction of transcription factor binding sites

Guilherme Miura Lavezzo et al. Genet Mol Biol. .

Abstract

Prediction of transcription factor binding sites (TFBS) is an example of application of Bioinformatics where DNA molecules are represented as sequences of A, C, G and T symbols. The most used model in this problem is Position Weight Matrix (PWM). Notwithstanding the advantage of being simple, PWMs cannot capture dependency between nucleotide positions, which may affect prediction performance. Acyclic Probabilistic Finite Automata (APFA) is an alternative model able to accommodate position dependencies. However, APFA is a more complex model, which means more parameters have to be learned. In this paper, we propose an innovative method to identify when position dependencies influence preference for PWMs or APFAs. This implied using position dependency features extracted from 1106 sets of TFBS to infer a decision tree able to predict which is the best model - PWM or APFA - for a given set of TFBSs. According to our results, as few as three pinpointed features are able to choose the best model, providing a balance of performance (average precision) and model simplicity.

PubMed Disclaimer

Conflict of interest statement

Conflict of Interest: The authors declare that there is no conflict of interest that could be perceived as prejudicial to the impartiality of the reported research.

Figures

Figure 1 -
Figure 1 -. Example of models trained from a pseudo TFBS sample in order to show how APFA (left) and PWM (right) models estimate their probability parameters and how dependencies are, or not, considered. For the sake of simplicity and with no impact in the comparison, the illustrated APFA and PWM scores were calculated based on equation 1 with pseudocount = 0.1, ie, before the division by the null model probabilities and log calculation. In the APFA, circles represent analysis states that are followed left-to-right during the analysis of an input sequence. Each edge is labeled with a nucleotide or EOS (“end of sequence”), and a probability. The sum of the probabilities of all edges going out of each state is one. Here, the PWM matrix has at position (i, j) the probability of a nucleotide i is present at TFBS position j. The green box in the TFBS sample shows a dependency between nucleotide positions 2 and 3. Whereas APFA was able to represent such dependency, PWM was not able. For instance, the symbol “T” in the third position of the sequence “CAT” has a high probability (0.96) in the APFA once it appeared after “CA” (path shown in red), whereas “C” in the third position of the sequence “CAC” has a low probability (0.01) also due to their previous nucleotides (last part of the path shown in blue). However, PWM ascribes the same probability of 0.49 to each nucleotide “T” or “C” in the third position.
Figure 2 -
Figure 2 -. Summary of the employed workflow.
Figure 3 -
Figure 3 -. Nested 7-fold Cross-Validation. The whole TFBS sample is represented as a bar, and each fold is represented as a disjoint subset (rectangles). In A) the first step is illustrated where, in each iteration, train folds (in blue) are used to train the APFA and PWM models, and the calibration fold (in yellow) is used to calibrate the Learn-APFA hyperparameters and find the optimal classification thresholds for APFA and PWM. In this phase the test fold (in red) is not used. In B) the second step is illustrated, which estimates the performance of the APFA calibrated in the previous step and of the PWM, using the threshold values also calibrated in the previous step. In this phase, calibration fold is included in the training folds and the trained APFA and PWM models are applied to the test fold (positive sample in red, and additional negative samples). The final performance is the average of the performance values calculated in each iteration.
Figure 4 -
Figure 4 -. Principal Component Analysis using the extracted feature for model preference categories.
Figure 5 -
Figure 5 -. Decision tree to choose the best model. The leaves represent model preference based on Cohen’s D (D) measure calculated between APFA and PWM performance for each TFBS sample, where APFA is preferred when D ≥ 0.4 and PWM otherwise. The label “gini” refers to the gini impurity, “samples” is the total number of TFBS samples considered before a decision rule is made, “value” is the number of TFBS samples that reach that node. The final level of a Decision Tree is shown (pruned at level 3), where blue leaves represent the majority of TFBS samples classified as PWM-preferred and red leaves the APFA-preferred, showing overall good results.
Chart 1 -
Chart 1 -. Decision Rules to choose between PWM and APFA. The algorithm shows how to choose between models PWM and APFA based on three extracted features, which were calculated using the TFBS sample.

Similar articles

References

    1. Andersson R, Sandelin A. Determinants of enhancer and promoter activities of regulatory elements. Nat Rev Genet. 2019;21:71–87. - PubMed
    1. Badis G, Berger MF, Philippakis AA, Talukder S, Gehrke AR, Jaeger SA, Chan ET, Vedenko GMA, Chen X, Kuznetsov H, et al. Diversity and complexity in DNA recognition by transcription factors. Science. 2009;324:1720–1723. - PMC - PubMed
    1. Bailey TL. STREME: Accurate and versatile sequence motif discovery. Bioinformatics. 2021;37:2834–2840. - PMC - PubMed
    1. Berger MF, Bulyk ML. Universal protein-binding microarrays for the comprehensive characterization of the DNA-binding specificities of transcription factors. Nat Protoc. 2009;4:393–411. - PMC - PubMed
    1. Boeva V. Analysis of genomic sequence motifs for deciphering transcription factor binding and transcriptional regulation in eukaryotic cells. Front Genet. 2016;7:24. - PMC - PubMed

Internet Resources

    1. [1 July 2020];ENCODE database. ENCODE database , https://www.encodeproject.org .
    1. [1 July 2020];JASPAR database. JASPAR database, https://jaspar.genereg.net/
    1. [27 July 2021];PBM dataset. PBM dataset, https://hugheslab.ccbr.utoronto.ca/supplementary-data/DREAM5/
    1. [10 June 2020];R package ENCODExplorer: A compilation of ENCODE metadata. R package ENCODExplorer: A compilation of ENCODE metadata, https://rdrr.io/bioc/ENCODExplorer/