RDKit (https://www.rdkit.org/) is an important python package for anyone who wants to use machine learning or just simply parse molecule in-silico.
This is the tutorial of how to install RDKit in your computer.
Installation
You can install RDKit by simply run1
conda install -c conda-forge rdkit
Or create a new environment for this package and activate it when you need to use RDKit:1
2conda create -c conda-forge -n rdenv rdkit
conda activate rdenv
Molecule representation in text (string)
SMILES is one of the most common way to represent a molecule by a string.
For instance, Ethanol (C2H5OH) can be written as OCC
in SMILES.
You can also draw molecules on ZINC and see the SMILES you get to have better idea on this representation.
To let the computer understand what OCC
really means (computer has no idea what is OCC
of course, either ethanol or C2H5OH), we need to use RDKit to transform SMILES to MOL
Basic RDKit tutorial
From SMILES to MOL
In RDkit, it is very easy to transform SMILES to MOL by a sinlge funciton Chem.MolFromSmiles()
For instance, you can write1
2
3from rdkit import Chem
smiles = 'OCC'
mol = Chem.MolFromSmiles(smiles)
If you are using Jupyter notebook, you can view the mol by simply type mol
in the cell and you can see this
You can return the SMILES back from mol by the funciton Chem.MolToSmiles()
1
2from rdkit import Chem
smiles = Chem.MolToSmiles(mol) #'OCC'
Extended Connectivity Fingerprint (ECFP)
In machine learining, we usually need a feature vector as an input to train the model.
ECFP is an easy and efficient (especially for biochemistry) feature vector that contains substructure information to represent the molecule. See here for detail information about ECFP.
The ECFP of an molecule can ge obtained by AllChem.GetMorganFingerprintAsBitVect
1
2from rdkit.Chem import AllChem
fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius = 2, nBits = 1024)
Where radius is the atom neighbor radius of Morgan Fingerprints and nBits is the length of the vector you can specify, usually poeple use radius = 2 and nBits = 1024.
You can convert the Fingerprint to numpy array by1
2
3
4import numpy as np
from rdkit.Chem import DataStructs
arr = np.zeros((1,))
DataStructs.ConvertToNumpyArray(fp, arr)