Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 1996 Jan 15;24(2):316-20.
doi: 10.1093/nar/24.2.316.

Cleaning the GenBank Arabidopsis thaliana data set

Affiliations

Cleaning the GenBank Arabidopsis thaliana data set

P G Korning et al. Nucleic Acids Res. .

Abstract

Data driven computational biology relies on the large quantities of genomic data stored in international sequence data banks. However, the possibilities are drastically impaired if the stored data is unreliable. During a project aiming to predict splice sites in the dicot Arabidopsis thaliana, we extracted a data set from the A.thaliana entries in GenBank. A number of simple 'sanity' checks, based on the nature of the data, revealed an alarmingly high error rate. More than 15% of the most important entries extracted did contain erroneous information. In addition, a number of entries had directly conflicting assignments of exons and introns, not stemming from alternative splicing. In a few cases the errors are due to mere typographical misprints, which may be corrected by comparison to the original papers, but errors caused by wrong assignments of splice sites from experimental data are the most common. It is proposed that the level of error correction should be increased and that gene structure sanity checks should be incorporated--also at the submitter level--to avoid or reduce the problem in the future. A non-redundant and error corrected subset of the data for A.thaliana is made available through anonymous FTP.

PubMed Disclaimer

References

    1. Nucleic Acids Res. 1990 Aug 25;18(16):4797-801 - PubMed
    1. Nature. 1990 Jan 11;343(6254):123 - PubMed
    1. Plant Cell. 1995 Apr;7(4):447-61 - PubMed
    1. Plant Physiol. 1994 Sep;106(1):401-2 - PubMed
    1. Plant Mol Biol. 1995 Oct;29(1):167-71 - PubMed

Publication types

Associated data