Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Sep;38(17):e171.
doi: 10.1093/nar/gkq667. Epub 2010 Aug 3.

A practical, bioinformatic workflow system for large data sets generated by next-generation sequencing

Affiliations

A practical, bioinformatic workflow system for large data sets generated by next-generation sequencing

Cinzia Cantacessi et al. Nucleic Acids Res. 2010 Sep.

Abstract

Transcriptomics (at the level of single cells, tissues and/or whole organisms) underpins many fields of biomedical science, from understanding the basic cellular function in model organisms, to the elucidation of the biological events that govern the development and progression of human diseases, and the exploration of the mechanisms of survival, drug-resistance and virulence of pathogens. Next-generation sequencing (NGS) technologies are contributing to a massive expansion of transcriptomics in all fields and are reducing the cost, time and performance barriers presented by conventional approaches. However, bioinformatic tools for the analysis of the sequence data sets produced by these technologies can be daunting to researchers with limited or no expertise in bioinformatics. Here, we constructed a semi-automated, bioinformatic workflow system, and critically evaluated it for the analysis and annotation of large-scale sequence data sets generated by NGS. We demonstrated its utility for the exploration of differences in the transcriptomes among various stages and both sexes of an economically important parasitic worm (Oesophagostomum dentatum) as well as the prediction and prioritization of essential molecules (including GTPases, protein kinases and phosphatases) as novel drug target candidates. This workflow system provides a practical tool for the assembly, annotation and analysis of NGS data sets, also to researchers with a limited bioinformatic expertise. The custom-written Perl, Python and Unix shell computer scripts used can be readily modified or adapted to suit many different applications. This system is now utilized routinely for the analysis of data sets from pathogens of major socio-economic importance and can, in principle, be applied to transcriptomics data sets from any organism.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Bioinformatic analyses of the Oesophagostomum dentatum data sets. Stars indicate analyses performed using custom-written Perl, Python and/or Unix shell computer scripts, accessible via http://research.vet.unimelb.edu.au/gasserlab/index.html. [1] Individual and combined expressed sequence tags (EST) data sets are assembled using CAP3 (compiled Linux 64-bit executable) to generate consensus sequences. [2] Assembled contigs with high similarity (cut-off: <1E-15) to nucleotide sequences of the vertebrate host (Sus scrofa) are eliminated. [3] Database similarity searches (for individual or combined data sets) are carried out using BLASTn and BLASTx (compiled Linux 64-bit executable; 42), embedded in custom-built Unix shell scripts. [4] Sequences (from the individually and combined assembled data sets) are conceptually translated into peptide sequences using ESTScan (compiled Linux 64-bit executable with a Perl wrapper). [5] Domains/motifs within translated peptides are identified via InterProScan (Perl wrapper) and linked to biological pathways in C. elegans using KOBAS (stand-alone Python application; 44). Functional annotation of the predicted peptides is performed by gene ontology (Perl wrapper; 27). [6] The individually assembled data sets are subtracted from one another (in both directions) using a BLASTn algorithm (42) embedded in a custom-built Unix shell script; proteins inferred from subtracted transcripts are assigned parental (i.e. level 1) InterPro terms and subtracted from one another using a BLASTp algorithm, embedded in a custom-built Unix shell script. [7] Potential drug target candidates for each of the individually assembled and/or in silico subtracted data sets are predicted and ranked according to the ‘severity’ of the non-wild-type RNAi phenotypes observed for the corresponding C. elegans orthologues/homologues (custom-built Unix shell scripts). [8] Probabilistic interaction networks among C. elegans orthologues of subtracted molecules are predicted (command lines).

References

    1. McKay SJ, Johnsen R, Khattra J, Asano J, Baillie DL, Chan S, Dube N, Fang L, Goszczynski B, Ha E, et al. Gene expression profiling of cells, tissues, and developmental stages of the nematode C. elegans. Cold Spring Harb. Symp. Quant. Biol. 2003;68:159–169. - PubMed
    1. Portman DS. Profiling C. elegans gene expression with DNA microarrays. WormBook. 2006;20:1–11. - PMC - PubMed
    1. Golden TR, Melov S. Gene expression changes associated with aging in C. elegans. WormBook. 2007;12:1–12. - PMC - PubMed
    1. Stathopoulos A, Levine M. Whole-genome expression profiles identify gene batteries in Drosophila. Dev. Cell. 2002;3:464–465. - PubMed
    1. Gupta V, Oliver B. Drosophila microarray platforms. Brief. Funct. Genomic Proteomic. 2003;2:97–105. - PubMed

Publication types

Substances