. 2016 Mar 22:17:255.

doi: 10.1186/s12864-016-2584-7.

EPIG-Seq: extracting patterns and identifying co-expressed genes from RNA-Seq data

Jianying Li^{1

2

3}, Pierre R Bushel^{4

5}

Affiliations

¹ Integrative Bioinformatics Group, National Institute of Environmental Health Sciences, Research Triangle Park, NC, 27709, USA.
² Microarray and Genome Informatics Group, National Institute of Environmental Health Sciences, Research Triangle Park, NC, 27709, USA.
³ Kelly Government Solutions, Research Triangle Park, NC, 27709, USA.
⁴ Microarray and Genome Informatics Group, National Institute of Environmental Health Sciences, Research Triangle Park, NC, 27709, USA. bushel@niehs.nih.gov.
⁵ Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, 111 T.W. Alexander Drive, P.O. Box 12233, Research Triangle Park, NC, 27709, USA. bushel@niehs.nih.gov.

PMID: 27004791
PMCID: PMC4804494
DOI: 10.1186/s12864-016-2584-7

EPIG-Seq: extracting patterns and identifying co-expressed genes from RNA-Seq data

Jianying Li et al. BMC Genomics. 2016.

. 2016 Mar 22:17:255.

doi: 10.1186/s12864-016-2584-7.

Authors

Jianying Li^{1

2

3}, Pierre R Bushel^{4

5}

Affiliations

¹ Integrative Bioinformatics Group, National Institute of Environmental Health Sciences, Research Triangle Park, NC, 27709, USA.
² Microarray and Genome Informatics Group, National Institute of Environmental Health Sciences, Research Triangle Park, NC, 27709, USA.
³ Kelly Government Solutions, Research Triangle Park, NC, 27709, USA.
⁴ Microarray and Genome Informatics Group, National Institute of Environmental Health Sciences, Research Triangle Park, NC, 27709, USA. bushel@niehs.nih.gov.
⁵ Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, 111 T.W. Alexander Drive, P.O. Box 12233, Research Triangle Park, NC, 27709, USA. bushel@niehs.nih.gov.

PMID: 27004791
PMCID: PMC4804494
DOI: 10.1186/s12864-016-2584-7

Abstract

Background: RNA sequencing (RNA-Seq) measures genome-wide gene expression. RNA-Seq data is count-based rendering normal distribution models for analysis inappropriate. Normalization of RNA-Seq data to transform the data has limitations which can adversely impact the analysis. Furthermore, there are a few count-based methods for analysis of RNA-Seq data but they are essentially for pairwise analysis of treatment groups or multiclasses but not pattern-based to identify co-expressed genes.

Results: We adapted our extracting patterns and identifying genes methodology for RNA-Seq (EPIG-Seq) count data. The software uses count-based correlation to measure similarity between genes, quasi-Poisson modelling to estimate dispersion in the data and a location parameter to indicate magnitude of differential expression. EPIG-Seq is different than any other software currently available for pattern analysis of RNA-Seq data in that EPIG-Seq 1) uses count level data and supports cases of inflated zeros, 2) identifies statistically significant clusters of genes that are co-expressed across experimental conditions, 3) takes into account dispersion in the replicate data and 4) provides reliable results even with small sample sizes. EPIG-Seq operates in two steps: 1) extract the pattern profiles from data as seeds for clustering co-expressed genes and 2) cluster the genes to the pattern seeds and compute statistical significance of the pattern of co-expressed genes. EPIG-Seq provides a table of the genes with bootstrapped p-values and profile plots of the patterns of co-expressed genes. In addition, EPIG-Seq provides a heat map and principal component dimension reduction plot of the clustered genes as visual aids. We demonstrate the utility of EPIG-Seq through the analysis of toxicogenomics and cancer data sets to identify biologically relevant co-expressed genes. EPIG-Seq is available at: sourceforge.net/projects/epig-seq.

Conclusions: EPIG-Seq is unlike any other software currently available for pattern analysis of RNA-Seq count level data across experimental groups. Using the EPIG-Seq software to analyze RNA-Seq count data across biological conditions permits the ability to extract biologically meaningful co-expressed genes associated with coordinated regulation.

Keywords: Cancer; Clustering; EPIG-Seq; Gene expression; Pattern analysis; RNA-Seq; Toxicogenomics.

PubMed Disclaimer

Figures

**Fig. 1**
EPIG-Seq workflow. The workflow depicts the main steps of EPIG-Seq. The parameters are used in steps 1 and 2 to extract the patterns and cluster the genes respectively. The output is the statistically significant patterns with co-expressed genes

**Fig. 2**
EPIG-Seq GUI. The EPIG-Seq GUI contains a main panel which allows users to define parameters for steps 1 and 2 of the analysis process. A dialog box displays the processing status and a command window displays the dependent processes running in the background

**Fig. 3**
EPIG-Seq analysis of the toxicogenomics MOA data. a Thumbnail plots of the gene expression profiles that are the representatives (those with the highest PCS) of each of the extracted patterns from the toxicogenomics MOA data. The title of each thumbnail plot indicates the number of the pattern extracted and the gene symbol. MOA groups are color-coded as follows: Control (*green*), AhR 2 (*red*), CAR/PXR (*yellow*), Cytotox (*light blue*), DNA Damage (*blue*) and PPARA (*pink*), with 9 samples (groups of 3 biological replicates per chemical) in each. The y-axis is the log base 2 ratio of each sample data RPM relative to the average of the control. b The heat map representation of the genes clustered to the four extracted patterns from the EPIG-Seq analysis of the toxicogenomics MOA data. The symbols of the genes are shown to the left of the heat map with the 4 colors indicative of the pattern number assigned to. The columns indicate the chemicals within each of the MOA groups. The color scale represents the log base 2 ratio of each sample data relative to the average of the control. c PCA of the toxicogenomics MOA data using the CY_s correlation measures of the genes clustered to the patterns by EPIG-Seq. The groups are color-coded as denoted in the legend. The x-axis is PC1, the y-axis is PC2 and the z-axis is PC3

**Fig. 4**
PCNA expression. a Gene expression of PCNA in TCGA normal and breast cancer samples. The x-axis denotes the breast cancer tumor subtype. The y-axis is the average of the log base 2 ratio of PCNA in each tumor subtype relative to the average of the normal samples. Standard error bars are shown for each data point. b PCNA protein immunohistochemistry staining of normal breast tissue with benign adenomas from a female age 23 (ID: 2773) and using the HPA030522 antibody. c PCNA protein immunohistochemistry staining of breast cancer tissue (ductal carcinoma) from a female age 55 (ID: 2773) and using the HPA030522 antibody

See this image and copyright information in PMC

Cited by

Patterns, Profiles, and Parsimony: Dissecting Transcriptional Signatures From Minimal Single-Cell RNA-Seq Output With SALSA.
Lozoya OA, McClelland KS, Papas BN, Li JL, Yao HH. Lozoya OA, et al. Front Genet. 2020 Oct 9;11:511286. doi: 10.3389/fgene.2020.511286. eCollection 2020. Front Genet. 2020. PMID: 33193599 Free PMC article.
A Leveraged Signal-to-Noise Ratio (LSTNR) Method to Extract Differentially Expressed Genes and Multivariate Patterns of Expression From Noisy and Low-Replication RNAseq Data.
Lozoya OA, Santos JH, Woychik RP. Lozoya OA, et al. Front Genet. 2018 May 16;9:176. doi: 10.3389/fgene.2018.00176. eCollection 2018. Front Genet. 2018. PMID: 29868123 Free PMC article.
Best practices on the differential expression analysis of multi-species RNA-seq.
Chung M, Bruno VM, Rasko DA, Cuomo CA, Muñoz JF, Livny J, Shetty AC, Mahurkar A, Dunning Hotopp JC. Chung M, et al. Genome Biol. 2021 Apr 29;22(1):121. doi: 10.1186/s13059-021-02337-8. Genome Biol. 2021. PMID: 33926528 Free PMC article. Review.
Differential expression analysis using a model-based gene clustering algorithm for RNA-seq data.
Osabe T, Shimizu K, Kadota K. Osabe T, et al. BMC Bioinformatics. 2021 Oct 20;22(1):511. doi: 10.1186/s12859-021-04438-4. BMC Bioinformatics. 2021. PMID: 34670485 Free PMC article.
Temporal Dynamic Methods for Bulk RNA-Seq Time Series Data.
Oh VS, Li RW. Oh VS, et al. Genes (Basel). 2021 Feb 27;12(3):352. doi: 10.3390/genes12030352. Genes (Basel). 2021. PMID: 33673721 Free PMC article. Review.

See all "Cited by" articles

References

1. Wang C, Gong B, Bushel PR, Thierry-Mieg J, Thierry-Mieg D, Xu J, Fang H, Hong H, Shen J, Su Z, et al. The concordance between RNA-seq and microarray data depends on chemical treatment and transcript abundance. Nat Biotechnol. 2014;32(9):926–32. doi: 10.1038/nbt.3001. - DOI - PMC - PubMed
1. Merrick BA, Phadke DP, Auerbach SS, Mav D, Stiegelmeyer SM, Shah RR, Tice RR. RNA-Seq profiling reveals novel hepatic gene expression pattern in aflatoxin B1 treated rats. PLoS ONE. 2013;8(4):e61768. doi: 10.1371/journal.pone.0061768. - DOI - PMC - PubMed
1. Raghavachari N, Barb J, Yang Y, Liu P, Woodhouse K, Levy D, O'Donnell CJ, Munson PJ, Kato GJ. A systematic comparison and evaluation of high density exon arrays and RNA-seq technology used to unravel the peripheral blood transcriptome of sickle cell disease. BMC Med Genet. 2012;5:28. - PMC - PubMed
1. Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11(10):R106. doi: 10.1186/gb-2010-11-10-r106. - DOI - PMC - PubMed
1. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139–40. doi: 10.1093/bioinformatics/btp616. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

Intramural NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

EPIG-Seq: extracting patterns and identifying co-expressed genes from RNA-Seq data

Affiliations

EPIG-Seq: extracting patterns and identifying co-expressed genes from RNA-Seq data

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources