Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Dec 8;13(1):21770.
doi: 10.1038/s41598-023-41085-6.

Ambiguous genes due to aligners and their impact on RNA-seq data analysis

Affiliations

Ambiguous genes due to aligners and their impact on RNA-seq data analysis

Alicja Szabelska-Beresewicz et al. Sci Rep. .

Abstract

The main scope of the study is ambiguous genes, i.e. genes whose expression is difficult to estimate from the data produced by next-generation sequencing technologies. We focused on the RNA sequencing (RNA-Seq) type of experiment performed on the Illumina platform. It is crucial to identify such genes and understand the cause of their difficulty, as these genes may be involved in some diseases. By giving misleading results, they could contribute to a misunderstanding of the cause of certain diseases, which could lead to inappropriate treatment. We thought that the ambiguous genes would be difficult to map because of their complex structure. So we looked at RNA-seq analysis using different mappers to find genes that would have different measurements from the aligners. We were able to identify such genes using a generalized linear model with two factors: mappers and groups introduced by the experiment. A large proportion of ambiguous genes are pseudogenes. High sequence similarity of pseudogenes to functional genes may indicate problems in alignment procedures. In addition, predictive analysis verified the performance of difficult genes in classification. The effectiveness of classifying samples into specific groups was compared, including the expression of difficult and not difficult genes as covariates. In almost all cases considered, ambiguous genes have less predictive power.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1
Figure 1
Barplot of gene abundance measured by the mean value of counts for all genes and excluding DGs for each dataset and mapper. The colours represent different levels of abundance of gene counts.
Figure 2
Figure 2
Number of read counts of exemplary DGs for each dataset. For each exemplary gene and dataset, read counts are presented for each mapper. The genes presented have the most significant statistics due to mappers and groups.
Figure 3
Figure 3
Coverage of two exemplary DGs from dataset GSE22260. Each line represents the average coverage between samples from the considered groups. Colours are linked to the mappers and shades represent 95% confidence intervals. Part (a) shows a pseudogene with only one transcript and one exon. Part (b) shows a coding protein gene with a more complex structure.
Figure 4
Figure 4
Percentage of misclassified samples for each dataset. For each dataset and number of predictors equal to 1/3, 1/2 and 2/3 of the number of samples, violin plots are drawn for 10 simulations of the joint classifier “ensemble”. The basic classifiers used in the “ensemble” classifier were: support vector machine, random forest, neural networks and rpart. The color represents difficulty cases: green color means “no” case—considering DEG that are not difficult; red color means “yes” case—considering DEG that are difficult.
Figure 5
Figure 5
Pipeline for the procedure of seeking DGs.

References

    1. Mardis ER. Next-generation dna sequencing methods. Annu. Rev. Genom. Hum. Genet. 2008;9:387–402. doi: 10.1146/annurev.genom.9.081307.164359. - DOI - PubMed
    1. Park PJ. Chip-seq: Advantages and challenges of a maturing technology. Nat. Rev. Genet. 2009;10:669–680. doi: 10.1038/nrg2641. - DOI - PMC - PubMed
    1. Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. Rna-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008;18:1509–1517. doi: 10.1101/gr.079558.108. - DOI - PMC - PubMed
    1. Ozsolak F, Milos PM. Rna sequencing: Advances, challenges and opportunities. Nat. Rev. Genet. 2011;12:87–98. doi: 10.1038/nrg2934. - DOI - PMC - PubMed
    1. Wang Z, Gerstein M, Snyder M. Rna-seq: A revolutionary tool for transcriptomics. Nat. Rev. Genet. 2009;10:57–63. doi: 10.1038/nrg2484. - DOI - PMC - PubMed