Machine Learning with PyTorch for chemistry

Machine learning

In most of the cases, there are 3 steps to perform machine learning.

Load or Collect Data
Data Preprocessing
Train, Validate and Test your model

For machine learning in chemistry, data loading and preprocessing (step 1 and 2) is usually the most difficult part because it needs high domain knowledge.
Once you have your data prepared, you can make your own model by simply implement the developed python package (PyTorch ro Tensorflow).

Install PyTorch

PyTorch is an open source machine learning library based on the Torch library primarily developed by Facebook’s AI Research lab (FAIR).
Compared to Keras and Tensorflow, I found PyTorch much easier and flexible to use so I use pyTorch to most of my research now.

Strongly recommended to install conda before the PyTorch installation!
According to the official page of PyTorch, you can easily install PyTorch by

1	conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch

Remember to install the right cuda version that matches your computer otherwise it may not work.

Load data

For this practice, we are going to make a machine learning model to predict whether a compound will inhibit CYP450 (in particular CYP2C19 enzyme) or not.
Please have a look at RDKit if you don’t know what it is.
This dataset is curated from PubChem BioAssay and saved in my repository .

import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from torch.utils.data import TensorDataset
from sklearn.model_selection import train_test_split

from rdkit import Chem, DataStructs
from rdkit.Chem import AllChem

target_enzyme = '2c19' # total five enzymes in my repo [1a2, 2c19, 2c9, 2d6, 3a4]
data_path = 'https://raw.githubusercontent.com/shuan4638/MCdropout-CYP450Classifcation/master/data/CYP_%s.csv' % target_enzyme
data = pd.read_csv(data_path, header = None)
data.columns = ['SMILES', 'label']

Data Preprocessing

After loading the compound (SMILES) and their label (1 for activate 0 for inactive), we transform the SMILES string to Morgan Fingerprint (ECFP4).
And then we split the data into train/validation/test dataset by the partition of 80%/10%/10%.

def smiles2fp(smiles):
    mol = Chem.MolFromSmiles(smiles)
    fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=1024)
    arr = np.zeros((1,))
    DataStructs.ConvertToNumpyArray(fp, arr)
    return arr

data["FPs"] = data.SMILES.apply(smiles2fp)
X = np.stack(data.FPs.values)
y = np.stack(data.label.values)

Then we need to send these data into DataLoader for pytorch network to load and train/test.

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
batch_size = 128

def numpy2tensor(array, device):
    return torch.tensor(array, device = device).float()

X_train, X_test, y_train, y_test = train_test_split(X, y,  test_size=0.10)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train,  test_size=0.11)

train_dataset = TensorDataset(numpy2tensor(X_train, device), numpy2tensor(y_train, device))
val_dataset = TensorDataset(numpy2tensor(X_val, device), numpy2tensor(y_val, device))
test_dataset = TensorDataset(numpy2tensor(X_test, device), numpy2tensor(y_test, device))

train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                                          batch_size=batch_size,
                                          shuffle=True)
val_loader = torch.utils.data.DataLoader(dataset=val_dataset,
                                          batch_size=batch_size,
                                          shuffle=False)
test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
                                          batch_size=batch_size,
                                          shuffle=False)

Train, validation and train

Now, we define a network with very simple architechture (input-hidden-output) to fit the data:
In PyTorch module, you need to define all the layers you want to use in the _init_ session and call these layers in foward session to proceed your data.
It might be hard to understand at the beggining, but you will realize it is a very smart design after doing more projects.

class Net(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size) 
        self.relu = nn.ReLU()
        self.Dropout = nn.Dropout(0.2)
        self.fc2 = nn.Linear(hidden_size, 2)  
    
    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.Dropout(out)
        out = self.fc2(out)
        return out
    
input_size = X_train.shape[-1]
hidden_size = 512
model = Net(input_size, hidden_size)
model = model.to(device)

Define two functions that used to train and evaluate/test the model:

def train_an_epoch(epoch, model, data_loader, loss_criterion, optimizer):
    model.train()
    total_loss = 0
    total_correct = 0
    for batch_id, batch_data in enumerate(data_loader):
        fps, labels = batch_data
        optimizer.zero_grad() # Clean the gradient
        output = model(fps) # Forward pass of the mini-batch
        
        _, preds = torch.max(output.data, 1) # Get prediction result
        correct = torch.sum(preds == labels.data.long()) # Get # of correct prediction
        total_correct += correct.data
        
        loss = loss_criterion(output, labels.long()) # Compute the loss
        loss.backward() # Calculate the gradient by backpropagation
        optimizer.step() # Fit the weights according to the calculated gradient
        total_loss += loss.item()
        
        if batch_id % 20 == 0:
            print('\repoch %d/%d, batch %d/%d, loss %.3f, accuracy %.3f' % (epoch[0] + 1, epoch[1], batch_id + 1, len(data_loader), loss.item(), correct.cpu().numpy()/len(labels)), end='', flush=True)
    print('\nepoch %d/%d, Training loss %.3f, accuracy: %.3f' % (epoch[0] + 1, epoch[1], total_loss, total_correct.cpu().numpy()/len(data_loader.dataset)))
    return
            
def test_an_epoch(epoch, model, data_loader, loss_criterion):
    model.eval()
    total_loss = 0
    total_correct = 0
    with torch.no_grad(): # Does not need to calculate gradient in testing
        for batch_id, batch_data in enumerate(data_loader):
            fps, labels = batch_data
            output = model(fps)
            
            _, preds = torch.max(output.data, 1) 
            correct = torch.sum(preds == labels.data.long())
            total_correct += correct.data
            
            loss = loss_criterion(output, labels.long())
            total_loss += loss.item()
            _, preds = torch.max(output.data, 1)
    print('epoch %d/%d, Evaluation loss %.3f, accuracy %.3f\n' % (epoch[0] + 1, epoch[1], total_loss, total_correct.cpu().numpy()/len(data_loader.dataset)))
    return

Lastly, we set the hyperparameters and train the model by using the funciton you have defined and see how the training goes:

num_epochs = 10
learning_rate = 0.0001

loss_criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

print ('Training...')
for epoch in range(num_epochs):
    epoch_set = (epoch, num_epochs)
    train_an_epoch(epoch_set, model, train_loader, loss_criterion, optimizer)
    test_an_epoch(epoch_set, model, val_loader, loss_criterion)
    
print ('Testing...')
test_an_epoch(epoch_set, model, test_loader, loss_criterion)

Here is my result:

Training...
epoch 1/10, batch 61/71, loss 0.614, accuracy 0.680
epoch 1/10, Training loss 46.415, accuracy: 0.665
epoch 1/10, Evaluation loss 5.398, accuracy 0.741

epoch 2/10, batch 61/71, loss 0.553, accuracy 0.695
epoch 2/10, Training loss 39.931, accuracy: 0.743
epoch 2/10, Evaluation loss 4.650, accuracy 0.772

epoch 3/10, batch 61/71, loss 0.444, accuracy 0.805
epoch 3/10, Training loss 35.514, accuracy: 0.770
epoch 3/10, Evaluation loss 4.259, accuracy 0.784

epoch 4/10, batch 61/71, loss 0.457, accuracy 0.789
epoch 4/10, Training loss 32.929, accuracy: 0.790
epoch 4/10, Evaluation loss 4.062, accuracy 0.804

epoch 5/10, batch 61/71, loss 0.416, accuracy 0.828
epoch 5/10, Training loss 31.190, accuracy: 0.806
epoch 5/10, Evaluation loss 3.999, accuracy 0.795

epoch 6/10, batch 61/71, loss 0.377, accuracy 0.844
epoch 6/10, Training loss 30.048, accuracy: 0.814
epoch 6/10, Evaluation loss 3.901, accuracy 0.811

epoch 7/10, batch 61/71, loss 0.401, accuracy 0.836
epoch 7/10, Training loss 29.080, accuracy: 0.824
epoch 7/10, Evaluation loss 3.882, accuracy 0.811

epoch 8/10, batch 61/71, loss 0.450, accuracy 0.805
epoch 8/10, Training loss 28.316, accuracy: 0.830
epoch 8/10, Evaluation loss 3.856, accuracy 0.813

epoch 9/10, batch 61/71, loss 0.403, accuracy 0.812
epoch 9/10, Training loss 27.545, accuracy: 0.835
epoch 9/10, Evaluation loss 3.889, accuracy 0.809

epoch 10/10, batch 61/71, loss 0.383, accuracy 0.828
epoch 10/10, Training loss 26.917, accuracy: 0.839
epoch 10/10, Evaluation loss 3.853, accuracy 0.819

Testing...
epoch 10/10, Evaluation loss 3.908, accuracy 0.815

Improve the model

As we see from the result, the accuracy on test set is around 81.5%, which is not satisfied enough (usually we consider a model accurate if the accuracy reach over 90% or even 95%).
There are several things you can do to improve the result:

Change the hyperparameters, such as batch size or number of epochs (obviously we are training too few epochs).
Change the model architecture. You can add more layer or add some tricks like regularizer or MC-dropout (like what I did in my repo ).
Or even you can try other network such as graph neural network (GNN), however I found Morgan Fingerprints does not work worst then GNN when predicting biochemistry property (stongly based on functional group).

Test your own data

If you find your network accurate and want to test on your own data, here is the function you can use.
In this case I predict the activity of phenol and the model suggest that it is not active toward the CYP2C19 enzyme.

def predict_activity(smiles, model, device):
    fp =smiles2fp(smiles).reshape(1,-1)
    fp_tensor = torch.tensor(fp, device=device).float()
    prediction = nn.Softmax(1)(model(fp_tensor))
    return prediction.data.cpu().numpy()[0][1]

target_smiles = 'Oc1ccccc1' # Phenol
predicted_activity =  predict_activity(target_smiles, model, device)
print (predicted_activity) # 0.3101124