Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Mar 22:17:255.
doi: 10.1186/s12864-016-2584-7.

EPIG-Seq: extracting patterns and identifying co-expressed genes from RNA-Seq data

Affiliations

EPIG-Seq: extracting patterns and identifying co-expressed genes from RNA-Seq data

Jianying Li et al. BMC Genomics. .

Abstract

Background: RNA sequencing (RNA-Seq) measures genome-wide gene expression. RNA-Seq data is count-based rendering normal distribution models for analysis inappropriate. Normalization of RNA-Seq data to transform the data has limitations which can adversely impact the analysis. Furthermore, there are a few count-based methods for analysis of RNA-Seq data but they are essentially for pairwise analysis of treatment groups or multiclasses but not pattern-based to identify co-expressed genes.

Results: We adapted our extracting patterns and identifying genes methodology for RNA-Seq (EPIG-Seq) count data. The software uses count-based correlation to measure similarity between genes, quasi-Poisson modelling to estimate dispersion in the data and a location parameter to indicate magnitude of differential expression. EPIG-Seq is different than any other software currently available for pattern analysis of RNA-Seq data in that EPIG-Seq 1) uses count level data and supports cases of inflated zeros, 2) identifies statistically significant clusters of genes that are co-expressed across experimental conditions, 3) takes into account dispersion in the replicate data and 4) provides reliable results even with small sample sizes. EPIG-Seq operates in two steps: 1) extract the pattern profiles from data as seeds for clustering co-expressed genes and 2) cluster the genes to the pattern seeds and compute statistical significance of the pattern of co-expressed genes. EPIG-Seq provides a table of the genes with bootstrapped p-values and profile plots of the patterns of co-expressed genes. In addition, EPIG-Seq provides a heat map and principal component dimension reduction plot of the clustered genes as visual aids. We demonstrate the utility of EPIG-Seq through the analysis of toxicogenomics and cancer data sets to identify biologically relevant co-expressed genes. EPIG-Seq is available at: sourceforge.net/projects/epig-seq.

Conclusions: EPIG-Seq is unlike any other software currently available for pattern analysis of RNA-Seq count level data across experimental groups. Using the EPIG-Seq software to analyze RNA-Seq count data across biological conditions permits the ability to extract biologically meaningful co-expressed genes associated with coordinated regulation.

Keywords: Cancer; Clustering; EPIG-Seq; Gene expression; Pattern analysis; RNA-Seq; Toxicogenomics.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
EPIG-Seq workflow. The workflow depicts the main steps of EPIG-Seq. The parameters are used in steps 1 and 2 to extract the patterns and cluster the genes respectively. The output is the statistically significant patterns with co-expressed genes
Fig. 2
Fig. 2
EPIG-Seq GUI. The EPIG-Seq GUI contains a main panel which allows users to define parameters for steps 1 and 2 of the analysis process. A dialog box displays the processing status and a command window displays the dependent processes running in the background
Fig. 3
Fig. 3
EPIG-Seq analysis of the toxicogenomics MOA data. a Thumbnail plots of the gene expression profiles that are the representatives (those with the highest PCS) of each of the extracted patterns from the toxicogenomics MOA data. The title of each thumbnail plot indicates the number of the pattern extracted and the gene symbol. MOA groups are color-coded as follows: Control (green), AhR 2 (red), CAR/PXR (yellow), Cytotox (light blue), DNA Damage (blue) and PPARA (pink), with 9 samples (groups of 3 biological replicates per chemical) in each. The y-axis is the log base 2 ratio of each sample data RPM relative to the average of the control. b The heat map representation of the genes clustered to the four extracted patterns from the EPIG-Seq analysis of the toxicogenomics MOA data. The symbols of the genes are shown to the left of the heat map with the 4 colors indicative of the pattern number assigned to. The columns indicate the chemicals within each of the MOA groups. The color scale represents the log base 2 ratio of each sample data relative to the average of the control. c PCA of the toxicogenomics MOA data using the CYs correlation measures of the genes clustered to the patterns by EPIG-Seq. The groups are color-coded as denoted in the legend. The x-axis is PC1, the y-axis is PC2 and the z-axis is PC3
Fig. 4
Fig. 4
PCNA expression. a Gene expression of PCNA in TCGA normal and breast cancer samples. The x-axis denotes the breast cancer tumor subtype. The y-axis is the average of the log base 2 ratio of PCNA in each tumor subtype relative to the average of the normal samples. Standard error bars are shown for each data point. b PCNA protein immunohistochemistry staining of normal breast tissue with benign adenomas from a female age 23 (ID: 2773) and using the HPA030522 antibody. c PCNA protein immunohistochemistry staining of breast cancer tissue (ductal carcinoma) from a female age 55 (ID: 2773) and using the HPA030522 antibody

Similar articles

Cited by

References

    1. Wang C, Gong B, Bushel PR, Thierry-Mieg J, Thierry-Mieg D, Xu J, Fang H, Hong H, Shen J, Su Z, et al. The concordance between RNA-seq and microarray data depends on chemical treatment and transcript abundance. Nat Biotechnol. 2014;32(9):926–32. doi: 10.1038/nbt.3001. - DOI - PMC - PubMed
    1. Merrick BA, Phadke DP, Auerbach SS, Mav D, Stiegelmeyer SM, Shah RR, Tice RR. RNA-Seq profiling reveals novel hepatic gene expression pattern in aflatoxin B1 treated rats. PLoS ONE. 2013;8(4):e61768. doi: 10.1371/journal.pone.0061768. - DOI - PMC - PubMed
    1. Raghavachari N, Barb J, Yang Y, Liu P, Woodhouse K, Levy D, O'Donnell CJ, Munson PJ, Kato GJ. A systematic comparison and evaluation of high density exon arrays and RNA-seq technology used to unravel the peripheral blood transcriptome of sickle cell disease. BMC Med Genet. 2012;5:28. - PMC - PubMed
    1. Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11(10):R106. doi: 10.1186/gb-2010-11-10-r106. - DOI - PMC - PubMed
    1. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139–40. doi: 10.1093/bioinformatics/btp616. - DOI - PMC - PubMed

Publication types

LinkOut - more resources