Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Jan;38(Database issue):D52-6.
doi: 10.1093/nar/gkp960. Epub 2009 Nov 1.

Non-redundant patent sequence databases with value-added annotations at two levels

Affiliations

Non-redundant patent sequence databases with value-added annotations at two levels

Weizhong Li et al. Nucleic Acids Res. 2010 Jan.

Abstract

The European Bioinformatics Institute (EMBL-EBI) provides public access to patent data, including abstracts, chemical compounds and sequences. Sequences can appear multiple times due to the filing of the same invention with multiple patent offices, or the use of the same sequence by different inventors in different contexts. Information relating to the source invention may be incomplete, and biological information available in patent documents elsewhere may not be reflected in the annotation of the sequence. Search and analysis of these data have become increasingly challenging for both the scientific and intellectual-property communities. Here, we report a collection of non-redundant patent sequence databases, which cover the EMBL-Bank nucleotides patent class and the patent protein databases and contain value-added annotations from patent documents. The databases were created at two levels by the use of sequence MD5 checksums. Sequences within a level-1 cluster are 100% identical over their whole length. Level-2 clusters were defined by sub-grouping level-1 clusters based on patent family information. Value-added annotations, such as publication number corrections, earliest publication dates and feature collations, significantly enhance the quality of the data, allowing for better tracking and cross-referencing. The databases are available format: http://www.ebi.ac.uk/patentdata/nr/.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Data growth of EMBL-Bank patent class. The curve indicates the number of entries in the EMBL-Bank patent class has increased dramatically during the past decade.
Figure 2.
Figure 2.
Steps to create non-redundant patent sequence databases at two levels. Squares of the same colour represent level-1 sequences, 100% identical over the whole length. Squares of the same colour and pattern represent level-2 sequences, which are identical and belong to the same invention (i.e. patent family).
Figure 3.
Figure 3.
Publication number error types detected in the sequence data set (both nucleotide and protein). ‘KC only’ represents the errors of incorrect KC only; ‘KC completeness only’ represents the errors of incomplete KC only; ‘KC + PN’ represents the errors of wrong PN and wrong KC; ‘KC completeness + PN’ represents the errors of incomplete KC and wrong PN; ‘PN only’ represents the errors of wrong PN only; ‘Publication Level only’ represents errors of publication level only; and ‘Pending’ represents publication numbers which currently cannot be resolved and are pending for corrections.

References

    1. Thangaraj H. Information from patent office could aid replication. Nature. 2007;447:638. - PubMed
    1. Seeber F. Patent searches as a complement to literature searches in the life sciences—a ‘how-to’ tutorial. Nat. Protoc. 2007;2:2418–2428. - PubMed
    1. Degtyarenko K, de Matos P, Ennis M, Hastings J, Zbinden M, McNaught A, Alcántara R, Darsow M, Guedj M, Ashburner M. ChEBI: a database and ontology for chemical entities of biological interest. Nucleic Acids Res. 2008;36:D344–D350. - PMC - PubMed
    1. Andree PJ, Harper MF, Nauche S, Poolman RA, Shaw J, Swinkels JC, Wycherley S. A comparative study of patent sequence databases. World Pat. Inform. 2008;30:300–308.
    1. Kulikova T, Akhtar R, Aldebert P, Althorpe N, Andersson M, Baldwin A, Bates K, Bhattacharyya S, Bower L, Browne P, et al. EMBL Nucleotide Sequence Database in 2006. Nucleic Acids Res. 2007;35:D16–D20. - PMC - PubMed

Publication types