Shuan Chen

PhD Student in KAIST CBE

0%

RDKit - the python package for in-silico molecule programming

RDKit (https://www.rdkit.org/) is an important python package for anyone who wants to use machine learning or just simply parse molecule in-silico.
This is the tutorial of how to install RDKit in your computer.

Installation

You can install RDKit by simply run

1
conda install -c conda-forge rdkit

Or create a new environment for this package and activate it when you need to use RDKit:
1
2
conda create -c conda-forge -n rdenv rdkit
conda activate rdenv

Molecule representation in text (string)

SMILES is one of the most common way to represent a molecule by a string.
For instance, Ethanol (C2H5OH) can be written as OCC in SMILES.
You can also draw molecules on ZINC and see the SMILES you get to have better idea on this representation.

To let the computer understand what OCC really means (computer has no idea what is OCC of course, either ethanol or C2H5OH), we need to use RDKit to transform SMILES to MOL

Basic RDKit tutorial

From SMILES to MOL

In RDkit, it is very easy to transform SMILES to MOL by a sinlge funciton Chem.MolFromSmiles()
For instance, you can write

1
2
3
from rdkit import Chem
smiles = 'OCC'
mol = Chem.MolFromSmiles(smiles)

If you are using Jupyter notebook, you can view the mol by simply type mol in the cell and you can see this

You can return the SMILES back from mol by the funciton Chem.MolToSmiles()
1
2
from rdkit import Chem
smiles = Chem.MolToSmiles(mol) #'OCC'

Extended Connectivity Fingerprint (ECFP)

In machine learining, we usually need a feature vector as an input to train the model.
ECFP is an easy and efficient (especially for biochemistry) feature vector that contains substructure information to represent the molecule. See here for detail information about ECFP.
The ECFP of an molecule can ge obtained by AllChem.GetMorganFingerprintAsBitVect

1
2
from rdkit.Chem import AllChem
fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius = 2, nBits = 1024)

Where radius is the atom neighbor radius of Morgan Fingerprints and nBits is the length of the vector you can specify, usually poeple use radius = 2 and nBits = 1024.

You can convert the Fingerprint to numpy array by

1
2
3
4
import numpy as np
from rdkit.Chem import DataStructs
arr = np.zeros((1,))
DataStructs.ConvertToNumpyArray(fp, arr)