Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Dec 15;27(24):3333-40.
doi: 10.1093/bioinformatics/btr570. Epub 2011 Oct 12.

Pyicos: a versatile toolkit for the analysis of high-throughput sequencing data

Affiliations

Pyicos: a versatile toolkit for the analysis of high-throughput sequencing data

Sonja Althammer et al. Bioinformatics. .

Abstract

Motivation: High-throughput sequencing (HTS) has revolutionized gene regulation studies and is now fundamental for the detection of protein-DNA and protein-RNA binding, as well as for measuring RNA expression. With increasing variety and sequencing depth of HTS datasets, the need for more flexible and memory-efficient tools to analyse them is growing.

Results: We describe Pyicos, a powerful toolkit for the analysis of mapped reads from diverse HTS experiments: ChIP-Seq, either punctuated or broad signals, CLIP-Seq and RNA-Seq. We prove the effectiveness of Pyicos to select for significant signals and show that its accuracy is comparable and sometimes superior to that of methods specifically designed for each particular type of experiment. Pyicos facilitates the analysis of a variety of HTS datatypes through its flexibility and memory efficiency, providing a useful framework for data integration into models of regulatory genomics.

Availability: Open-source software, with tutorials and protocol files, is available at http://regulatorygenomics.upf.edu/pyicos or as a Galaxy server at http://regulatorygenomics.upf.edu/galaxy

Contact: eduardo.eyras@upf.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Properties of candidate peaks. Cumulative plots of the fraction of Pyicos peaks with a motif along the ranking selected by Poisson P-value cut-offs for peaks with and without subtraction for (a) CEBPA and (b) NRSF. Cumulative plot of the fraction of peaks with motifs along the ranking for the top 3000 peaks predicted by Pyicos, MACS, FindPeaks and USeq, for (c) PR and (d) CTCF. (e) Memory performance of the same four methods on the CEBPA ChIP-Seq data.
Fig. 2.
Fig. 2.
Prediction of DE genes. (a) ROC curves for the benchmarking against the microarray data (Marioni et al., 2008) for DESeq, Pyicos, edgeR and DEGseq using read counts and replicated data. (b) Precision–recall curves for the benchmarking against microarray data for the same four methods using read counts and replicated data. (c) Memory performance of the same four methods on the EA of CEBPA ChIP-Seq dataset on the promoter region (Section 2). DESeq and edgeR are run in combination with BEDTools. (d) ROC curves of the different normalization methods: read counts (Counts), TMM-normalized counts (TMM counts), RPKMs and TRPKs, for the microarray benchmarking. (e) Absolute differences of the medians from the length distributions of DE and non-DE genes calculate with Pyicos using counts, TMM-normalized counts, RPKMs, TRPKs, and the corresponding value from the microarray data.
Fig. 3.
Fig. 3.
Detecting significant clusters in CLIP-Seq. (a) Genes with at least one significant cluster using Pyicos CLIP-Seq protocol (red) and the results published in Xue et al. (2009) (blue). (b) Beanplots (Kampstra, 2008) showing the distribution of heights for three subsets of read clusters: the significant clusters exclusively detected in Xue et al. (2009) and not by Pyicos (Only Xue et al.), the significant clusters exclusively found by Pyicos (Only Pyicos) and all the significant clusters found by Pyicos (All Pyicos).

References

    1. Anders S., Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11:R106. - PMC - PubMed
    1. Bullard J.H., et al. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics. 2010;11:94. - PMC - PubMed
    1. ENCODE Consortium. A User's Guide to the Encyclopedia of DNA Elements (ENCODE) PLoS Biol. 2011;9:e1001046. - PMC - PubMed
    1. Fejes A.P., et al. FindPeaks 3.1: a tool for identifying areas of enrichment from massively parallel short-read sequencing technology. Bioinformatics. 2008;24:1729–1730. - PMC - PubMed
    1. Flicek P., et al. Ensembl 2011. Nucleic Acids Res. 2011;39:D800–D806. - PMC - PubMed

Publication types

MeSH terms