Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Aug;1(8):e217.
doi: 10.1002/cpz1.217.

Exploring Chemical Information in PubChem

Affiliations

Exploring Chemical Information in PubChem

Sunghwan Kim. Curr Protoc. 2021 Aug.

Abstract

PubChem (https://pubchem.ncbi.nlm.nih.gov) is a public chemical database that serves scientific communities as well as the general public. This database collects chemical information from hundreds of data sources and organizes them into multiple data collections, including Substance, Compound, BioAssay, Protein, Gene, Pathway, and Patent. These collections are interlinked with each other, allowing users to discover related records in the various collections (e.g., drugs targeting a protein or genes modulated by a chemical). PubChem can be searched by keyword (e.g., a chemical, protein, or gene name) as well as by chemical structure. The input structure can be provided using popular line notations or drawn with the PubChem Sketcher. PubChem supports various types of structure searches, including identity search, 2-D and 3-D similarity searches, and substructure and superstructure searches. Results from multiple searches can be combined using Boolean operators (i.e., AND, OR, and NOT) to formulate complex queries. PubChem allows the user to quickly retrieve a list of records annotated with a particular classification or ontological term. This paper provides step-by-step instructions on how to explore PubChem data with examples of commonly requested tasks. © 2021. This article is a U.S. Government work and is in the public domain in the USA. Current Protocols published by Wiley Periodicals LLC. Basic Protocol 1: Finding genes and proteins that interact with a given compound Basic Protocol 2: Finding drug-like compounds similar to a query compound through a two-dimensional (2-D) similarity search Basic Protocol 3: Finding compounds similar to a query compound through a three-dimensional (3-D) similarity search Support Protocol: Computing similarity scores between compounds Basic Protocol 4: Getting the bioactivity data for the hit compounds from substructure search Basic Protocol 5: Finding drugs that target a particular gene Basic Protocol 6: Getting bioactivity data of all chemicals tested against a protein. Basic Protocol 7: Finding compounds annotated with classifications or ontological terms Basic Protocol 8: Finding stereoisomers and isotopomers of a compound through identity search.

Keywords: PubChem; chemical structure search; cheminformatics; drug discovery; molecular similarity; public database.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
Searching PubChem using a text query. When a text query is provided (1), PubChem searches multiple collections for relevant records, and the hits from each collection can be viewed by clicking the corresponding tab (indicated in the purple box). When possible, PubChem suggests the best hit at the top of the search results. For example, when the chemical name losartan is used as a query, PubChem suggests CID 3961 as the best hit. Clicking this record or one of the hits found in the Compound collection directs the user to its compound page (2).
Figure 2
Figure 2
Navigating the Compound Summary page of losartan (CID 3961) (https://pubchem.ncbi.nlm.nih.gov/compound/3961). The user can navigate the Compound Summary page using the Table of Contents (1), available in the right column. One may find the macromolecules that losartan interacts with by clicking the “DrugBank Interactions” (2) from the Table of Contents. The information presented in each section can be downloaded by clicking the “Download” button (3). When there is too much information to present in a section of the Summary page, only the first few pieces of information are shown. To view all information available for the section, the user should click the full‐screen view button (4). All information presented on the Compound Summary page can be downloaded through the “Download” button available at the top‐right corner of the Compound Summary page (indicated in the purple box).
Figure 3
Figure 3
Performing a similarity search using a hit compound returned from a previous search. Each hit compound is presented with links that allow the user to access commonly requested data or services relevant to the compound. Among them is the “Similar Structures Search” link (1). Clicking this link will invoke multiple structure searches [including 2‐D similarity search (2)] using the compound as a query and present the search results. The user can rerun the 2‐D similarity search with a different similarity threshold (3) and apply filters (4) to refine the hit compounds based on several molecular properties. The hit compound list can be downloaded using the “Download” button (5). The result for the 3‐D similarity search can be viewed by clicking the “3D similarity” tab (indicated in the purple box).
Figure 4
Figure 4
The Settings button available for the 3‐D similarity search and the download button for compound records. The Settings button (1) allows users to select the compound tiers against which the 3‐D similarity search is performed (see the main text for the three‐tiered 3‐D structure search). The download button (2) allows for downloading compound records in various file formats. To download up to 10 conformers per compound in a compressed structure‐data file (SDF) format, select “3D” for coordinate type (3), “10” for the number of conformers (4), and “gzip” for compression (5), and click the “SDF” button (6).
Figure 5
Figure 5
Computing similarity scores between compounds, using the PubChem Score Matrix Service (https://pubchem.ncbi.nlm.nih.gov/score_matrix/). One of three score types (2‐D similarity as well as shape‐ and feature‐optimized 3‐D similarities) can be selected through a dropdown menu (1). Additional options (2) are available for 3‐D similarity score computation. The list(s) of CIDs for similarity score computation can be provided in a text box or uploaded in a file (3). The output format (4) and compression method (5) can be selected through dropdown menus. Clicking the “Submit Job” button starts the similarity score computation.
Figure 6
Figure 6
The concept of the substructure and superstructure. The structure of CID 15207492 (substructure) appears as a part of CID 3961 (superstructure).
Figure 7
Figure 7
Using the PubChem Sketcher to provide a query structure for chemical structure searches. The PubChem sketcher can be accessed from the PubChem homepage through the “Draw Structure” button (1). The query structure can be drawn manually or converted from a line notation like a SMILES or InChI string (2). Clicking the “Search for This Structure” button (3) initiates the structure searches.
Figure 8
Figure 8
Retrieving bioactivity data for the compounds returned from substructure search. When the input structure is provided, PubChem performs multiple types of structure search. The results of the substructure search can be viewed by clicking the “Substructure” tab (1). By default, structure search stops when it finds 1000 hit compounds. If the user wants to find more than 1000 hit compounds, it is necessary to check the “Search All” box (2). The bioactivity data for the hit compounds can be retrieved by clicking the “Linked Data Sets” button available on the right column (3) and then selecting “Bioactivities” from the popup menu (4). The bioactivity data can be downloaded through the “Download” button (5).
Figure 9
Figure 9
Search by gene/protein name using “type 1 angiotensin II receptor” as an example. When a gene/protein name is used as a query (1), multiple collections are searched. Clicking the “Genes” tab shows the gene records returned from the search (2). To view hit protein records, click the “Proteins” tab (indicated in the purple box). The filter (3) allows for selecting only the human gene records (4). Clicking the human AGTR1 gene (5) directs the user to its Summary page. Note that gene records may have associated bioassay and/or pathway records in PubChem (as indicated in the blue box).
Figure 10
Figure 10
Using the Gene Summary page for the human type‐1 angiotensin II receptor (https://pubchem.ncbi.nlm.nih.gov/gene/185) to find drugs targeting the gene (or the proteins that it encodes). The Table of Contents on the right column (1) can be used to navigate the Gene Summary page. Clicking the “DrugBank Drugs” (2) directs the user to the section that contains information on drugs targeting the gene, curated by DrugBank. The information presented in this section can be downloaded (3). The Full‐screen view button (4) presents additional information in a full‐screen view mode. For each drug, there are links to the corresponding records in the PubChem Compound and DrugBank (indicated in the yellow and blue boxes, respectively) as well as links to the PubMed records that provide the evidence of the drug‐target information (indicated in the purple box).
Figure 11
Figure 11
Using the Protein Summary page for the human type‐1 angiotensin II receptor (https://pubchem.ncbi.nlm.nih.gov/protein/P30556) to find compounds tested against the protein and its rat orthologs. This page can be navigated using the Table of Contents on the right column (1). Clicking the “Tested Compounds” (2) directs the user to the “tested compound” section. The bioactivity data for these compounds against the target protein can be downloaded through the “Download” button (3), and additional information can be viewed by clicking the “Full‐screen view” button (4). A list of the orthologs of the protein can be accessed by clicking the “Orthologous Proteins” section (5). Clicking “P29089 (Norway rat)” in this section (6) leads to its Protein Summary page, where information on tested compounds against the rat orthologs can be found.
Figure 12
Figure 12
Finding records annotated with classification and ontological terms, using the PubChem Classification Browser (https://pubchem.ncbi.nlm.nih.gov/classification/). The classification browser can also be accessed by clicking the “Browse Data” button (1), available on the PubChem homepage. To find compounds annotated with the Medical Subject Headings (MeSH) terms “Antihypertensive Agents”, select “MeSH” for classification (2), “Compound” for data type counts to display (3), and type “Antihypertensive Agent” in the search box (4). Clicking the compound record count (5) for the MeSH term will show the relevant records (see Fig. 13). Note that MeSH terms are organized in a hierarchical (tree) structure (as indicated in the blue box). The view type menu (indicated in the purple box) allows the user to select to view the returned MeSH terms in a list or tree view.
Figure 13
Figure 13
Saving a search result for later use. A search can be saved by clicking the “Save for Later” button (1) and giving an alias to it (2). When it is saved successfully, the “Saved Search” button appears above the search box.
Figure 14
Figure 14
Combining saved searches to perform a complex search. Clicking the “Saved Searches” button (1) presents a dialog box in which saved searches can be combined using Boolean operators (AND, OR, and NOT). In this screenshot, two saved searches “MySearch1” and “MySearch2” are combined with the AND operator (2) and added to the list of saved searches. The resulting hits can be viewed by clicking the “View Results” button (3).
Figure 15
Figure 15
Performing an identity search. The query “CID 60846 structure” (1) initiates various types of structure searches using the structure of 60846 as a query. The result of the identity search can be viewed by the “Identity” tab (2). The Settings button allows users to select one of the several definitions of chemical identity (3).
Figure 16
Figure 16
Compounds with conflicting and nonconflicting stereocenters.

References

Literature Cited

    1. Armstrong, J. F., Faccenda, E., Harding, S. D., Pawson, A. J., Southan, C., Sharman, J. L. … Nc, I. (2020). The IUPHAR/BPS Guide to PHARMACOLOGY in 2020: Extending immunopharmacology content and introducing the IUPHAR/MMV Guide to MALARIA PHARMACOLOGY. Nucleic Acids Research, 48(D1), D1006–D1021. doi: 10.1093/nar/gkz951. - DOI - PMC - PubMed
    1. Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., … Gene Ontology, C. (2000). Gene Ontology: Tool for the unification of biology. Nature Genetics, 25(1), 25–29. doi: 10.1038/75556. - DOI - PMC - PubMed
    1. Bolton, E. E., Chen, J., Kim, S., Han, L. Y., He, S. Q., Shi, W. Y. … Bryant, S. H. (2011). PubChem3D: A new resource for scientists. Journal of Cheminformatics, 3, 32. doi: 10.1186/1758-2946-3-32. - DOI - PMC - PubMed
    1. Bolton, E. E., Kim, S., & Bryant, S. H. (2011a). PubChem3D: Conformer generation. Journal of Cheminformatics, 3, 4. doi: 10.1186/1758-2946-3-4. - DOI - PMC - PubMed
    1. Bolton, E. E., Kim, S., & Bryant, S. H. (2011b). PubChem3D: Similar conformers. Journal of Cheminformatics, 3, 13. doi: 10.1186/1758-2946-3-13. - DOI - PMC - PubMed

LinkOut - more resources