Note
Go to the end to download the full example code.
Long-distance Equivariants: a tutorial¶
- Authors:
Philip Loche @PicoCentauri, Kevin Huguenin-Dumittan @kvhuguenin
This tutorial explains how Long range equivariant descriptors can be constructed using rascaline and the resulting descriptors be used to construct a linear model with equisolve
First, import all the necessary packages
import ase.io
import matplotlib.pyplot as plt
import metatensor
import numpy as np
from equisolve.numpy.models.linear_model import Ridge
from equisolve.utils.convert import ase_to_tensormap
from rascaline import AtomicComposition, LodeSphericalExpansion, SphericalExpansion
from rascaline.utils import PowerSpectrum
Step 0: Prepare Data Set¶
Get structures¶
We take a small subset of the dimer dataset from A. Grisafi et al., 2021 for which we additionally calculated the forces. Each structure in the dataset contains two small organic molecules which are extended along a certain direction in the subsequent structures.
For speeding up the calculations we already selected the first 130
structures
of the charge-charge molecule
pairs.
frames = ase.io.read("charge-charge.xyz", ":")
Convert target properties to metatensor format¶
If we want to train models using the equisolve package, we need to convert the target properties (in this case, the energies and forces) into the appropriate format #justequistorethings
y = ase_to_tensormap(frames, energy="energy", forces="forces")
Step 1: Compute short-range and LODE features¶
Define hypers and get the expansion coefficients \(\langle anlm | \rho_i \rangle\) and \(\langle anlm | V_i \rangle\)
The short-range and long-range descriptors have very similar hyperparameters. We highlight the differences below.
We first define the hyperparameters for the short-range (SR) part. These will be used to create SOAP features.
SR_HYPERS = {
"cutoff": 3.0,
"max_radial": 6,
"max_angular": 2,
"atomic_gaussian_width": 0.3,
"center_atom_weight": 1.0,
"radial_basis": {"Gto": {}},
"cutoff_function": {"ShiftedCosine": {"width": 0.5}},
}
And next the hyperparaters for the LODE / long-range (LR) part
LR_HYPERS = {
# Cutoff on which to project potential density
"cutoff": 3.0,
# keep max_radial slightly smaller than for SR part
"max_radial": 3,
# max_angular should be <= 4, more precisely, max_angular + potential_exponent < 10
"max_angular": 2,
# keep at >=1, WARNING: CUBIC SCALING, do not use values <0.5
"atomic_gaussian_width": 3.0,
"center_atom_weight": 1.0,
"radial_basis": {"Gto": {}},
# the exponent p that determines the 1/r^p potential
"potential_exponent": 1,
}
We then use the above defined hyperparaters to define the per atom short range (sr) and long range (sr) descriptors.
calculator_sr = SphericalExpansion(**SR_HYPERS)
calculator_lr = LodeSphericalExpansion(**LR_HYPERS)
Note that LODE requires periodic systems. Therefore, if the data set does not come with periodic boundary conditions by default you can not use the data set and you will face an error if you try to compute the features.
As you notices the calculation of the long range features takes significant more time compared to the sr features.
Taking a look at the output we find that the resulting
metatensor.TensorMap
are quite similar in their structure. The short range
metatensor.TensorMap
contains more blocks due to the higher
max_angular
paramater we choosed above.
Generate the rotational invariants (power spectra)¶
Rotationally invariant features can be obtained by taking two of the calculators that were defines above.
For the short-range part, we use the SOAP vector which is obtained by computing the invariant combinations of the form \(\rho \otimes \rho\).
ps_calculator_sr = PowerSpectrum(calculator_sr, calculator_sr)
ps_sr = ps_calculator_sr.compute(frames, gradients=["positions"])
We calculate gradients with respect to pistions by providing the
gradients=["positions"]
option to the
rascaline.calculators.CalculatorBase.compute()
method.
For the long-range part, we combine the long-range descriptor \(V\) with one a short-range density \(\rho\) to get \(\rho \otimes V\) features.
ps_calculator_lr = PowerSpectrum(calculator_sr, calculator_lr)
ps_lr = ps_calculator_lr.compute(systems=frames, gradients=["positions"])
Step 2: Building a Simple Linear SR + LR Model with energy baselining¶
Preprocessing (model dependent)¶
For our current model, we do not wish to treat the individual center and
neighbor species separately. Thus, we move the "species_center"
key
into the sample
direction, over which we will later sum over.
ps_sr = ps_sr.keys_to_samples("species_center")
ps_lr = ps_lr.keys_to_samples("species_center")
For linear models only: Sum features up over atoms (samples
) in the same
structure.
sample_names_to_sum = ["center", "species_center"]
ps_sr = metatensor.sum_over_samples(ps_sr, sample_names=sample_names_to_sum)
ps_lr = metatensor.sum_over_samples(ps_lr, sample_names=sample_names_to_sum)
Initialize tensormaps for energy baselining¶
We add a simple extra descriptor rascaline.AtomicComposition
that stores
how many atoms of each chemical species are contained in the structures. This is used
for energy baselining.
calculator_co = AtomicComposition(per_structure=False)
descriptor_co = calculator_co.compute(frames, gradients=["positions"])
co = descriptor_co.keys_to_properties(["species_center"])
co = metatensor.sum_over_samples(co, sample_names=["center"])
The rascaline.AtomicComposition
calculator also allows to directly perform
the the sum over center atoms by using the following lines.
descriptor_co = AtomicComposition(per_structure=True).compute(**compute_args)
co = descriptor_co.keys_to_properties(["species_center"])
Stack all the features together for linear model¶
A linear model on SR + LR features can be thought of as a linear model built on a feature vector that is simply the concatenation of the SR and LR features.
Furthermore, energy baselining can be performed by concatenating the information about
chemical species as well. There is an metatensor function called
metatensor.join()
for this purpose. Formally, we can write for the SR
model.
X_sr: \(1 \oplus \left(\rho \otimes \rho\right)\)
X_sr = metatensor.join([co, ps_sr], axis="properties")
We used the axis="properties"
parameter since we want to concatenate along the
features/properties dimensions.
For the long range model we can formerly write
X_lr: \(1 \oplus \left(\rho \otimes \rho\right) \oplus \left(\rho \otimes V\right)\)
X_lr = metatensor.join([co, ps_sr, ps_lr], axis="properties")
The features are now ready! Let us now perform some actual learning. Below we
initialize two instances of the equisolve.numpy.models.linear_model.Ridge
class. equisolve.numpy.models.linear_model.Ridge
will perform a regression
with respect to "values"
(energies) and "positions"
gradients (forces).
If you only want a fit with respect to energies you can remove the gradients with
metatensor.remove_gradients()
clf_sr = Ridge()
clf_lr = Ridge()
Split training and target data into train and test dat¶
Split the training and the test data by the distance \(r_{\rm train}=6\,\mathrm{Å}\) between the center of mass of the two molecules. A structure with a \(r_{\rm train}<6 {\rm Å}\) is used for training.
r_cut = 6.0
We calculate the indices from the dataset by list comprehension. The center of mass
distance is stored in the "distance""
attribute.
idx_train = [i for i, f in enumerate(frames) if f.info["distance"] < r_cut]
idx_test = [i for i, f in enumerate(frames) if f.info["distance"] >= r_cut]
For doing the split we define two Labels
instances and combine them in a
List
.
samples_train = metatensor.Labels(["structure"], np.reshape(idx_train, (-1, 1)))
samples_test = metatensor.Labels(["structure"], np.reshape(idx_test, (-1, 1)))
grouped_labels = [samples_train, samples_test]
That we use as input to the metatensor.split()
function
X_sr_train, X_sr_test = metatensor.split(
X_sr, axis="samples", selections=grouped_labels
)
X_lr_train, X_lr_test = metatensor.split(
X_lr, axis="samples", selections=grouped_labels
)
y_train, y_test = metatensor.split(y, axis="samples", selections=grouped_labels)
Fit the model¶
For this model, we use a very simple regularization scheme where all features are
regularized in the same way (the amount being controlled by the parameter alpha
).
For more advanced regularization schemes (regularizing energies and forces differently
and/or the SR and LR parts differently), see further down.
clf_sr.fit(X_sr_train, y_train, alpha=1e-6)
clf_lr.fit(X_lr_train, y_train, alpha=1e-6)
<equisolve.numpy.models.linear_model.NumpyRidge object at 0x7f01cbcbdc10>
Evaluation¶
For evaluating the model we calculate the RMSEs using the score()
method. With the
parameter_key
parameter we select which RMSE should be calculated.
print(
"SR: RMSE energies = "
f"{clf_sr.score(X_sr_test, y_test, parameter_key='values')[0]:.3f} eV"
)
print(
"SR: RMSE forces = "
f"{clf_sr.score(X_sr_test, y_test, parameter_key='positions')[0]:.3f} eV/Å"
)
print(
"LR: RMSE energies = "
f"{clf_lr.score(X_lr_test, y_test, parameter_key='values')[0]:.3f} eV"
)
print(
"LR: RMSE forces = "
f"{clf_lr.score(X_lr_test, y_test, parameter_key='positions')[0]:.3f} eV/Å"
)
SR: RMSE energies = 0.557 eV
SR: RMSE forces = 0.188 eV/Å
LR: RMSE energies = 0.158 eV
LR: RMSE forces = 0.178 eV/Å
We find that the RMSE of the energy and the force of the LR model is smaller compared to the SR model. From this we conclude that the LR model performs better for the selection of the dataset.
We additionally, can plot of the binding energy as a function of the distance. For the plot we select some properties from the dataset
dist = np.array([f.info["distance"] for f in frames])
energies = np.array([f.info["energy"] for f in frames])
monomer_energies = np.array([f.info["energyA"] + f.info["energyB"] for f in frames])
and select only the indices corresponding to our test set.
Next we calculate the predicted SR and LR TensorMaps
.
y_sr_pred = clf_sr.predict(X_sr)
y_lr_pred = clf_lr.predict(X_lr)
And, finally perform the plot.
plt.scatter(
dist, y.block().values[:, 0] - monomer_energies, label="target data", color="black"
)
plt.scatter(
dist,
y_sr_pred.block().values[:, 0] - monomer_energies,
label="short range model",
marker="x",
)
plt.scatter(
dist,
y_lr_pred.block().values[:, 0] - monomer_energies,
label="long range model",
marker="s",
facecolor="None",
edgecolor="orange",
)
plt.xlabel("center of mass distance in Å")
plt.ylabel(r"$E - E_\mathrm{monomer}$ in eV")
plt.axvline(r_cut, c="red", label=r"$r_\mathrm{train}$")
plt.legend()
plt.tight_layout()
plt.show()
Total running time of the script: (0 minutes 9.475 seconds)