. 2023 Apr 3:11:e15145.

doi: 10.7717/peerj.15145. eCollection 2023.

Inference of differentially expressed genes using generalized linear mixed models in a pairwise fashion

Douglas Terra Machado¹, Otávio José Bernardes Brustolini¹, Yasmmin Côrtes Martins¹, Marco Antonio Grivet Mattoso Maia², Ana Tereza Ribeiro de Vasconcelos¹

Affiliations

¹ Laboratório de Bioinformática, Laboratório Nacional de Computação Científica, Petrópolis, Rio de Janeiro, Brazil.
² Centro de Estudo em Telecomunicações, Pontifícia Universidade Católica do Rio de Janeiro, Rio de Janeiro, Brazil.

PMID: 37033732
PMCID: PMC10078460
DOI: 10.7717/peerj.15145

Inference of differentially expressed genes using generalized linear mixed models in a pairwise fashion

Douglas Terra Machado et al. PeerJ. 2023.

. 2023 Apr 3:11:e15145.

doi: 10.7717/peerj.15145. eCollection 2023.

Authors

Douglas Terra Machado¹, Otávio José Bernardes Brustolini¹, Yasmmin Côrtes Martins¹, Marco Antonio Grivet Mattoso Maia², Ana Tereza Ribeiro de Vasconcelos¹

Affiliations

¹ Laboratório de Bioinformática, Laboratório Nacional de Computação Científica, Petrópolis, Rio de Janeiro, Brazil.
² Centro de Estudo em Telecomunicações, Pontifícia Universidade Católica do Rio de Janeiro, Rio de Janeiro, Brazil.

PMID: 37033732
PMCID: PMC10078460
DOI: 10.7717/peerj.15145

Abstract

Background: Technological advances involving RNA-Seq and Bioinformatics allow quantifying the transcriptional levels of genes in cells, tissues, and cell lines, permitting the identification of Differentially Expressed Genes (DEGs). DESeq2 and edgeR are well-established computational tools used for this purpose and they are based upon generalized linear models (GLMs) that consider only fixed effects in modeling. However, the inclusion of random effects reduces the risk of missing potential DEGs that may be essential in the context of the biological phenomenon under investigation. The generalized linear mixed models (GLMM) can be used to include both effects.

Methods: We present DEGRE (Differentially Expressed Genes with Random Effects), a user-friendly tool capable of inferring DEGs where fixed and random effects on individuals are considered in the experimental design of RNA-Seq research. DEGRE preprocesses the raw matrices before fitting GLMMs on the genes and the derived regression coefficients are analyzed using the Wald statistical test. DEGRE offers the Benjamini-Hochberg or Bonferroni techniques for P-value adjustment.

Results: The datasets used for DEGRE assessment were simulated with known identification of DEGs. These have fixed effects, and the random effects were estimated and inserted to measure the impact of experimental designs with high biological variability. For DEGs' inference, preprocessing effectively prepares the data and retains overdispersed genes. The biological coefficient of variation is inferred from the counting matrices to assess variability before and after the preprocessing. The DEGRE is computationally validated through its performance by the simulation of counting matrices, which have biological variability related to fixed and random effects. DEGRE also provides improved assessment measures for detecting DEGs in cases with higher biological variability. We show that the preprocessing established here effectively removes technical variation from those matrices. This tool also detects new potential candidate DEGs in the transcriptome data of patients with bipolar disorder, presenting a promising tool to detect more relevant genes.

Conclusions: DEGRE provides data preprocessing and applies GLMMs for DEGs' inference. The preprocessing allows efficient remotion of genes that could impact the inference. Also, the computational and biological validation of DEGRE has shown to be promising in identifying possible DEGs in experiments derived from complex experimental designs. This tool may help handle random effects on individuals in the inference of DEGs and presents a potential for discovering new interesting DEGs for further biological investigation.

Keywords: DEGRE package; Differentially expressed genes; Gene dispersion; Generalized linear mixed model; Preprocessing; Random effects.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Figure 1. Preprocessing steps for RNA-Seq datasets before DEGs’ identification.**
The datasets are subjected to scale correction, followed by a normalization process. The resulting matrices are assessed based on gene expression. Genes with non-null counts are statistically evaluated for possible eliminations. The resulting dataset is directed to DEG inference.

**Figure 2. Results of the number of genes with features found in each data preprocessing step.**
The number of genes in matrices with fixed effects varied from (A) expressed in all samples, (B) not expressed, (C) expressed in some samples, (D) genes with equivalent expression across samples, (E) genes with low-expression, and (F) loss of DEGs (%).

**Figure 3. Normal distributions associated with random effects.**

**Figure 4. Number of equidispersed (in red) and overdispersed (in blue) genes in the counting matrices containing (A) 10%, (B) 20%, and (C) 30% of DEGs before and after the preprocessing step.**
The reduction percentage for each result refers to the number of equidispersed and overdispersed genes before and after the preprocessing application in matrices with fixed effects.

**Figure 5. Number of equidispersed (in red) and overdispersed (in blue) genes in the counting matrices with 10% of DEGs before and after the preprocessing step.**
The SDNDs vary between (A) 100, (B) 300, (C) 600, (D) 900, (E) 1,200, (F) 2,000, and (G) 3,000. The percentage reduction for each result refers to the number of equidispersed and overdispersed genes before and after the preprocessing application in matrices with fixed and random effects.

**Figure 6. Values for the biological coefficient of variation in matrices with only fixed effects before (in yellow) and after (in red) the preprocessing step.**
(A), (B) and (C) respectively represent the coefficients for the cases of 10%, 20% and 30% of DEGs. The percentage of the coefficient reduction before and after the preprocessing step for each replicate is indicated by the black dashed line.

**Figure 7. Biological coefficient of variation for estimated random effects datasets with SDND 100 (A) and SDND 300 (B).**
The black dashed line indicates the percentage of BCV decrease before and after the preprocessing phase for each point on the graph.

**Figure 8. Evaluation of DEGs identification on simulated datasets using GLM models.**
(A), (B) and (C) correspond to the datasets with 10%, 20%, and 30% of DEGs, respectively. The solid line is the precision rate, and the dashed line is the recall rate. Below each figure is shown the accuracy’s mean and standard deviation. The computational tools are DESeq2, and edgeR and the models are the GLMs with the Bonferroni (BON) or Benjamini-Hochberg (BH) P-value correction.

**Figure 9. Identification of DEGs by the computational tools in datasets with random effects.**
(A), (B) and (C) are the datasets with 10%, 20%, and 30% of DEGs, respectively. The solid line is the precision rate, and the dashed line is the recall rate. Below each figure is the mean and standard deviation of the accuracy. The computational tools are DEGRE-BON, DEGRE-BH, DESeq2, and edgeR.

**Figure 10. Number of genes identified as differentially expressed by DEGRE.**
(A) 124 genes were identified as downregulated, and nine were identified as upregulated. (B) The intersection of the 133 genes identified by DEGRE with the findings of Pacifico & Davis (2017).

See this image and copyright information in PMC

References

1. Akbarian F, Tabatabaiefar MA, Shaygannejad V, Shahpouri MM, Badihian N, Sajjadi R, Dabiri A, Jalilian N, Noori-Daloii MR. Upregulation of MTOR, RPS6KB1, and EIF4EBP1 in the whole blood samples of Iranian patients with multiple sclerosis compared to healthy controls. Metabolic Brain Disease. 2020;35(8):1309–1316. doi: 10.1007/s11011-020-00590-7. - DOI - PubMed
1. Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biology. 2010;11(10):R106. doi: 10.1186/gb-2010-11-10-r106. - DOI - PMC - PubMed
1. Barbosa IG, Machado-Vieira R, Soares JC, Teixeira AL. The immunology of bipolar disorder. Neuroimmunomodulation. 2014;21(2–3):117–122. doi: 10.1159/000356539. - DOI - PMC - PubMed
1. Beech RD, Lowthert L, Leffert JJ, Mason PN, Taylor MM, Umlauf S, Lin A, Lee JY, Maloney K, Muralidharan A, Lorberg B, Zhao H, Newton SS, Mane S, Epperson CN, Sinha R, Blumberg H, Bhagwagar Z. Increased peripheral blood expression of electron transport chain genes in bipolar depression. Bipolar Disorders. 2010;12(8):813–824. doi: 10.1111/j.1399-5618.2010.00882.x. - DOI - PMC - PubMed
1. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological) 1995;57(1):289–300. doi: 10.1111/j.2517-6161.1995.tb02031.x. - DOI

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Inference of differentially expressed genes using generalized linear mixed models in a pairwise fashion

Affiliations

Inference of differentially expressed genes using generalized linear mixed models in a pairwise fashion

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources