Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Nov 1;467(1-2):41-51.
doi: 10.1016/j.gene.2010.07.009. Epub 2010 Aug 5.

Detecting novel genes with sparse arrays

Affiliations

Detecting novel genes with sparse arrays

Mikko Arvas et al. Gene. .

Abstract

Species-specific genes play an important role in defining the phenotype of an organism. However, current gene prediction methods can only efficiently find genes that share features such as sequence similarity or general sequence characteristics with previously known genes. Novel sequencing methods and tiling arrays can be used to find genes without prior information and they have demonstrated that novel genes can still be found from extensively studied model organisms. Unfortunately, these methods are expensive and thus are not easily applicable, e.g., to finding genes that are expressed only in very specific conditions. We demonstrate a method for finding novel genes with sparse arrays, applying it on the 33.9 Mb genome of the filamentous fungus Trichoderma reesei. Our computational method does not require normalisations between arrays and it takes into account the multiple-testing problem typical for analysis of microarray data. In contrast to tiling arrays, that use overlapping probes, only one 25 mer microarray oligonucleotide probe was used for every 100b. Thus, only relatively little space on a microarray slide was required to cover the intergenic regions of a genome. The analysis was done as a by-product of a conventional microarray experiment with no additional costs. We found at least 23 good candidates for novel transcripts that could code for proteins and all of which were expressed at high levels. Candidate genes were found to neighbour ire1 and cre1 and many other regulatory genes. Our simple, low-cost method can easily be applied to finding novel species-specific genes without prior knowledge of their sequence properties.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
33_421 with probe signal and EST data. On top, a ruler with genomic coordinates in scaffold 33. Below a panel that shows positions of old genes (green), candidate gene (yellow) and ESTs (blue) as arrows. Arrows point in the direction of transcription. The bottom 6 panels show average signals of the three repeats of each condition from individual probes as vectical bars, before and after GC% scaling. Bars are positioned at the genomic location of the probes as specified by the top ruler. Bars for probes whose signal value is above the 75th percentile of signals of all intergenic probes are colored in red. The location of the candidate gene is based on a simple ORF prediction that does not take into account splicing nor UTRs.
Fig. 2
Fig. 2
Histograms of p-score distributions. The distribution of the p-scores for each ORF is shown in the original data, as well as their averages in 100 randomizations. The results are for the experimental condition HD03, computed according to the 75th percentile.
Fig. 3
Fig. 3
Length distribution of verified UTRs and candidate genes as UTRs. The distribution of the length of the candidate genes plus each distance to the nearest stop or start codon, which ever was closer, of an old gene, i.e. the length of candidate gene if it would be a UTR, and the distribution of experimentally verified UTRs in fungi. Plus signs indicate mid values of bins (500 b bins) and lines connect them. The counts of candidate genes are shown for each bin. Candidate genes that overlapped genes in version 2.0 have been excluded. The Y axis shows the percentage of values in a bin for the four categories.
Fig. 4
Fig. 4
Scatterplot of gene expression signals. X axis shows the mean of log2 of gene expression signals for the three conditions and Y axis the respective standard deviation. Small black dots show the signals from old genes. Values for candidate genes are colored based on the evidence found for them: either no other evidence was found, ESTs were found, a homologue was found in another organism or the gene was successfully predicted in version 2.0 of the genome. White plus signs indicate the Top candidate genes which are particularly likely to be true novel genes based on analysis of UTR sequences.

Similar articles

Cited by

References

    1. Altschul S, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. - PMC - PubMed
    1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–410. - PubMed
    1. Amaral PP, Dinger ME, RMT, Mattick JS. The eukaryotic genome as an RNA machine. Science. 2008;319:1787–1789. - PubMed
    1. Arvas M, et al. Comparison of protein coding gene contents of the fungal phyla Pezizomycotina and Saccharomycotina. BMC Genomics. 2007:8. - PMC - PubMed
    1. Arvas M, et al. Common features and interesting differences in transcriptional responses to secretion stress in the fungi Trichoderma reesei and Saccharomyces cerevisiae. BMC Genomics. 2006;7:32. - PMC - PubMed

Publication types