Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Jun;38(6):e1800082.
doi: 10.1002/minf.201800082. Epub 2019 Mar 7.

PySpark and RDKit: Moving towards Big Data in Cheminformatics

Affiliations

PySpark and RDKit: Moving towards Big Data in Cheminformatics

Mario Lovrić et al. Mol Inform. 2019 Jun.

Abstract

The authors present an implementation of the cheminformatics toolkit RDKit in a distributed computing environment, Apache Hadoop. Together with the Apache Spark analytics engine, wrapped by PySpark, resources from commodity scalable hardware can be employed for cheminformatic calculations and query operations with basic knowledge in Python programming and understanding of the resilient distributed datasets (RDD). Three use cases of cheminfomatical computing in Spark on the Hadoop cluster are presented; querying substructures, calculating fingerprint similarity and calculating molecular descriptors. The source code for the PySpark-RDKit implementation is provided. The use cases showed that Spark provides a reasonable scalability depending on the use case and can be a suitable choice for datasets too big to be processed with current low-end workstations.

Keywords: Apache Spark; Hadoop; Python; QSAR; pandas.

PubMed Disclaimer

Publication types

LinkOut - more resources