Toy experiment: Molecular mechanics (MM) fitting on subsampled PhAlkEthOH dataset.
Open in Google Colab: http://data.wangyq.net/esp_notebooks/phalkethoh_mm_small.ipynb
This notebook is intended to recover the MM fitting behavior in https://arxiv.org/abs/2010.01196
To assess how well Espaloma can learn to reproduce an MM force field from a limited amount of data, we selected a chemical dataset of limited complexity—PhAlkEthOH—which consists of linear and cyclic molecules containing phenyl rings, small alkanes, ethers, and alcohols composed of only the elements carbon, oxygen, and hydrogen. We generated a set of conformational snapshots for each molecule using short high-temperature molecular dynamics simulations at 300~K initiated from multiple conformations to ensure adequate sampling of conformers. The AlkEthOH dataset was randomly partitioned (by molecules) into 80% training, 10% validation, and 10% test molecules, with 100 snapshots/molecule, and an Espaloma model was trained with early stopping via monitoring for a decrease in accuracy in the validation set.
Installation and imports
# install conda
! pip install -q condacolab
import condacolab
condacolab.install()
⏬ Downloading https://github.com/jaimergp/miniforge/releases/latest/download/Mambaforge-colab-Linux-x86_64.sh...
📦 Installing...
📌 Adjusting configuration...
🩹 Patching environment...
⏲ Done in 0:00:34
🔁 Restarting kernel...
%%capture
! mamba install --yes --strict-channel-priority --channel jaimergp/label/unsupported-cudatoolkit-shim --channel omnia --channel omnia/label/cuda100 --channel dglteam --channel numpy openmm openmmtools openmmforcefields rdkit openff-toolkit dgl-cuda10.0 qcportal
! git clone https://github.com/choderalab/espaloma.git
Cloning into 'espaloma'...
remote: Enumerating objects: 7812, done.[K
remote: Counting objects: 100% (3634/3634), done.[K
remote: Compressing objects: 100% (1649/1649), done.[K
remote: Total 7812 (delta 2714), reused 2639 (delta 1900), pack-reused 4178[K
Receiving objects: 100% (7812/7812), 13.50 MiB | 11.77 MiB/s, done.
Resolving deltas: 100% (5538/5538), done.
import torch
import sys
sys.path.append("/content/espaloma")
import espaloma as esp
Warning: importing 'simtk.openmm' is deprecated. Import 'openmm' instead.
Warning: Unable to load toolkit 'OpenEye Toolkit'. The Open Force Field Toolkit does not require the OpenEye Toolkits, and can use RDKit/AmberTools instead. However, if you have a valid license for the OpenEye Toolkits, consider installing them for faster performance and additional file format support: https://docs.eyesopen.com/toolkits/python/quickstart-python/linuxosx.html OpenEye offers free Toolkit licenses for academics: https://www.eyesopen.com/academic-licensing
Load dataset
Here we load the PhAlKeThoh dataset and shuffle before splitting into training, validation, and test (80%:10%:10%)
%%capture
! wget http://data.wangyq.net/esp_dataset/phalkethoh_mm_small.zip
! unzip phalkethoh_mm_small.zip
ds = esp.data.dataset.GraphDataset.load("phalkethoh")
ds.shuffle(seed=2666)
ds_tr, ds_vl, ds_te = ds.split([8, 1, 1])
DGL backend not selected or invalid. Assuming PyTorch for now.
Using backend: pytorch
Setting the default backend to "pytorch". You can change it in the ~/.dgl/config.json file or export the DGLBACKEND environment variable. Valid options are: pytorch, mxnet, tensorflow (all lowercase)
A training dataloader is constructed with batch_size=100
ds_tr_loader = ds_tr.view(batch_size=100, shuffle=True)
g_tr = next(iter(ds_tr.view(batch_size=len(ds_tr))))
g_vl = next(iter(ds_vl.view(batch_size=len(ds_vl))))
/usr/local/lib/python3.7/site-packages/dgl/base.py:45: DGLWarning: From v0.5, DGLHeteroGraph is merged into DGLGraph. You can safely replace dgl.batch_hetero with dgl.batch
return warnings.warn(message, category=category, stacklevel=1)
Define model
Define Espaloma stage I: graph -> atom latent representation
representation = esp.nn.Sequential(
layer=esp.nn.layers.dgl_legacy.gn("SAGEConv"), # use SAGEConv implementation in DGL
config=[128, "relu", 128, "relu", 128, "relu"], # 3 layers, 128 units, ReLU activation
)
Define Espaloma stage II and III: atom latent representation -> bond, angle, and torsion representation and parameters. And compose all three Espaloma stages into an end-to-end model.
readout = esp.nn.readout.janossy.JanossyPooling(
in_features=128, config=[128, "relu", 128, "relu", 128, "relu"],
out_features={ # define modular MM parameters Espaloma will assign
1: {"e": 1, "s": 1}, # atom hardness and electronegativity
2: {"log_coefficients": 2}, # bond linear combination, enforce positive
3: {"log_coefficients": 2}, # angle linear combination, enforce positive
4: {"k": 6}, # torsion barrier heights (can be positive or negative)
},
)
espaloma_model = torch.nn.Sequential(
representation, readout, esp.nn.readout.janossy.ExpCoefficients(),
esp.mm.geometry.GeometryInGraph(),
esp.mm.energy.EnergyInGraph(),
esp.mm.energy.EnergyInGraph(suffix="_ref"),
esp.nn.readout.charge_equilibrium.ChargeEquilibrium(),
)
if torch.cuda.is_available():
espaloma_model = espaloma_model.cuda()
Loss function is specified as the MSE between predicted and reference energy.
loss_fn = esp.metrics.GraphMetric(
base_metric=torch.nn.MSELoss(), # use mean-squared error loss
between=['u', "u_ref"], # between predicted and QM energies
level="g", # compare on graph level
)
Define optimizer
optimizer = torch.optim.Adam(espaloma_model.parameters(), 1e-4)
Train it!
for idx_epoch in range(10000):
for g in ds_tr_loader:
optimizer.zero_grad()
if torch.cuda.is_available():
g = g.to("cuda:0")
g = espaloma_model(g)
loss = loss_fn(g)
loss.backward()
optimizer.step()
torch.save(espaloma_model.state_dict(), "%s.th" % idx_epoch)
/usr/local/lib/python3.7/site-packages/dgl/base.py:45: DGLWarning: From v0.5, DGLHeteroGraph is merged into DGLGraph. You can safely replace dgl.batch_hetero with dgl.batch
return warnings.warn(message, category=category, stacklevel=1)
/usr/local/lib/python3.7/site-packages/dgl/base.py:45: DGLWarning: dgl.to_homo is deprecated. Please use dgl.to_homogeneous
return warnings.warn(message, category=category, stacklevel=1)
Inspect
inspect_metric = esp.metrics.GraphMetric(
base_metric=torch.nn.L1Loss(), # use mean-squared error loss
between=['u', "u_ref"], # between predicted and QM energies
level="g", # compare on graph level
)
if torch.cuda.is_available():
g_vl = g_vl.to("cuda:0")
g_tr = g_tr.to("cuda:0")
loss_tr = []
loss_vl = []
for idx_epoch in range(10000):
espaloma_model.load_state_dict(
torch.load("%s.th" % idx_epoch)
)
espaloma_model(g_tr)
loss_tr.append(inspect_metric(g_tr).item())
espaloma_model(g_vl)
loss_vl.append(inspect_metric(g_vl).item())
/usr/local/lib/python3.7/site-packages/dgl/base.py:45: DGLWarning: dgl.to_homo is deprecated. Please use dgl.to_homogeneous
return warnings.warn(message, category=category, stacklevel=1)
import numpy as np
loss_tr = np.array(loss_tr) * 627.5
loss_vl = np.array(loss_vl) * 627.5
from matplotlib import pyplot as plt
plt.plot(loss_tr, label="train")
plt.plot(loss_vl, label="valid")
plt.yscale("log")
plt.legend()
<matplotlib.legend.Legend at 0x7fd8f0eebd90>