Periodic Hamiltonian learning

Authors:

Paolo Pegolo @ppegolo, Jigyasa Nigam @curiosity54

This tutorial explains how to train a machine learning model for the electronic Hamiltonian of a periodic system. Even though we focus on periodic systems, the code and techniques presented here can be directly transferred to molecules.

First, import the necessary packages

import os
import warnings
import zipfile

import matplotlib.image as mpimg
import matplotlib.pyplot as plt
import numpy as np
import requests
import torch
from matplotlib.animation import FuncAnimation
from mlelec.data.derived_properties import compute_eigenvalues
from mlelec.data.mldataset import MLDataset
from mlelec.data.qmdataset import QMDataset
from mlelec.models.equivariant_lightning import LitEquivariantModel, MSELoss
from mlelec.utils.pbc_utils import blocks_to_matrix
from mlelec.utils.plot_utils import plot_bands_frame


warnings.filterwarnings("ignore")
torch.set_default_dtype(torch.float64)

# sphinx_gallery_thumbnail_number = 3
/home/runner/work/atomistic-cookbook/atomistic-cookbook/.nox/periodic-hamiltonian/lib/python3.11/site-packages/pyscf/dft/libxc.py:772: UserWarning: Since PySCF-2.3, B3LYP (and B3P86) are changed to the VWN-RPA variant, the same to the B3LYP functional in Gaussian and ORCA (issue 1480). To restore the VWN5 definition, you can put the setting "B3LYP_WITH_VWN5 = True" in pyscf_conf.py
  warnings.warn('Since PySCF-2.3, B3LYP (and B3P86) are changed to the VWN-RPA variant, '

Get Data and Prepare Data Set

The data set contains 35 distorted graphene unit cells containing 2 atoms. The reference density functional theory (DFT) calculations are performed with CP2K using a minimal STO-3G basis and the PBE functional. The Kohn-Sham equations are solved on a Monkhorst-Pack grid of \(15\times 15\times 1\) points in the Brillouin zone of the crystal.

Obtain structures and DFT data

Generating training structures requires running a suitable DFT code, and converting the output data in a format that can be processed by the ML library mlelec. Given that it takes some time to run even these small calculations, we provide pre-computed data, but you can also find instructions on how to generate data from scratch.

Run your own cp2k calculations

If you have computational resources, you can run the DFT calculations needed to produce the data set. This other tutorial in the atomistic cookbook can help you set up the CP2K calculations for this data set, using the reftraj_hamiltonian.cp2k file provided in data/. To do the same for another data set, adapt the reftraj file. We will provide here some of the functions in the batch-cp2k tutorial that need to be adapted to the current data set. Note however you will have to modify these and combine them with other tutorials to actually generate the data.

Start by importing all the modules from the batch-cp2k tutorial and run the cell to install CP2K. Run also the cells up to the one defining write_cp2k_in. The following code snippet defines a slighly modified version of that function, allowing for non-orthorombic supercell, and accounting for the reftraj file name change.

def write_cp2k_in(
    fname: str,
    project_name: str,
    last_snapshot: int,
    cell_a: List[float],
    cell_b: List[float],
    cell_c: List[float],
) -> None:
    """Writes a cp2k input file from a template.

    Importantly, it writes the location of the basis set definitions,
    determined from the path of the system CP2K install to the input file.
    """

    cp2k_in = open("reftraj_hamiltonian.cp2k", "r").read()

    cp2k_in = cp2k_in.replace("//PROJECT//", project_name)
    cp2k_in = cp2k_in.replace("//LAST_SNAPSHOT//", str(last_snapshot))
    cp2k_in = cp2k_in.replace("//CELL_A//", " ".join([f"{c:.6f}" for c in cell_a]))
    cp2k_in = cp2k_in.replace("//CELL_B//", " ".join([f"{c:.6f}" for c in cell_b]))
    cp2k_in = cp2k_in.replace("//CELL_C//", " ".join([f"{c:.6f}" for c in cell_c]))

    with open(fname, "w") as f:
        f.write(cp2k_in)

Unlike the batch-cp2k tutorial, the current data set includes a single stoichiometry, \(\mathrm{C_2}\). Therefore, you can run this cell to set the calculation scripts up.

project_name = 'graphene'
frames = ase_read('C2.xyz', index=':')
os.makedirs(project_name, exist_ok=True)
os.makedirs(f"{project_name}/FOCK", exist_ok=True)
os.makedirs(f"{project_name}/OVER", exist_ok=True)

write_cp2k_in(
        f"{project_name}/in.cp2k",
        project_name=project_name,
        last_snapshot=len(frames),
        cell_a=frames[0].cell.array[0],
        cell_b=frames[0].cell.array[1],
        cell_c=frames[0].cell.array[2],
    )

ase_write(f"{project_name}/init.xyz", frames[0])
write_reftraj(f"{project_name}/reftraj.xyz", frames)
write_cellfile(f"{project_name}/reftraj.cell", frames)

The CP2K calculations can be simply run using:

subprocess.run((
    f"cp2k.ssmp -i {project_name}/in.cp2k "
    "> {project_name}/out.cp2k"
    ),
    shell=True)

Once the calculations are done, we can parse the results with:

from scipy.sparse import csr_matrix

nao = 10
ifr = 1
fock = []
over = []
with open(f"{project_name}/out.cp2k", "r") as outfile:
    T_lists = []  # List to hold all T_list instances
    while True:
        line = outfile.readline()
        if not line:
            break
        if line.strip().split()[:3] != ["KS", "CSR", "write|"]:
            continue
        else:
            nT = int(line.strip().split()[3])
            outfile.readline()  # Skip the next line if necessary
            T_list = []  # Initialize a new T_list for this block
            for _ in range(nT):
                line = outfile.readline()
                if not line:
                    break
                T_list.append([np.int32(j) for j in line.strip().split()[1:4]])
            T_list = np.array(T_list)
            T_lists.append(T_list)  # Append the T_list to T_lists
            fock_ = {}
            over_ = {}
            for iT, T in enumerate(
                T_list
            ):  # Loop through the translations and load matrices
                T = T.tolist()
                r, c, data = np.loadtxt(
                    (
                        f"{project_name}/FOCK/{project_name}"
                        f"-KS_SPIN_1_R_{iT+1}-1_{ifr}.csr"
                    ),
                    unpack=True,
                )
                r = np.int32(r - 1)
                c = np.int32(c - 1)
                fock_[tuple(T)] = csr_matrix(
                    (data, (r, c)), shape=(nao, nao)
                ).toarray()

                r, c, data = np.loadtxt(
                    (
                        f"{project_name}/OVER/{project_name}"
                        f"-S_SPIN_1_R_{iT+1}-1_{ifr}.csr"
                    ),
                    unpack=True,
                )
                r = np.int32(r - 1)
                c = np.int32(c - 1)
                over_[tuple(T)] = csr_matrix(
                    (data, (r, c)), shape=(nao, nao)
                ).toarray()
            fock.append(fock_)
            over.append(over_)
            ifr += 1

You can now save the matrices to .npy files, and a file with the k-grids used in the calculations.

os.makedirs("data", exist_ok=True)
# Save the Hamiltonians
np.save("data/graphene_fock.npy", fock)
# Save the overlaps
np.save("data/graphene_ovlp.npy", over)
# Write a file with the k-grids, one line per structure
np.savetxt('data/kmesh.dat', [[15,15,1]]*len(frames), fmt='%d')

Download precomputed data

For the sake of simplicity, you can also download precomputed data and run just the machine learning part of the notebook using these data.

filename = "precomputed.zip"
if not os.path.exists(filename):
    url = (
        "https://github.com/curiosity54/mlelec/raw/"
        "tutorial_periodic/examples/periodic_tutorial/precomputed.zip"
    )
    response = requests.get(url)
    response.raise_for_status()
    with open(filename, "wb") as f:
        f.write(response.content)

with zipfile.ZipFile(filename, "r") as zip_ref:
    zip_ref.extractall("./")

Periodic Hamiltonians in real and reciprocal space

The DFT calculations for the dataset above were performed using a minimal STO-3G basis. The basis set is specified for each species using three quantum numbers, \(n\), \(l\), \(m\). \(n\) is usually a natural number relating to the radial extent or resolution whereas \(l\) and \(m\) specify the angular components determining the shape of the orbital and its orientation in space. For example, \(1s\) orbitals correspond to \(n=1\), \(l=0\) and \(m=0\), while a \(2p_z\) orbital corresponds to \(n=2\), \(l=1\) and \(m=0\). For the STO-3G basis-set, these quantum numbers for Carbon (identified by its atomic number) are given as follows.

basis = "sto-3g"
orbitals = {
    "sto-3g": {6: [[1, 0, 0], [2, 0, 0], [2, 1, -1], [2, 1, 0], [2, 1, 1]]},
}

For each frame which of either train and test structures, the QM data comprises the configuration, along with the corresponding overlap and Hamiltonian (used interchangeably with Fock) matrices in the basis specified above, as well as the \(k\)-point grid that was used for the calculation.

Note that we are currently specifying these matrices in real-space, \(\mathbf{H}(\mathbf{t})\) , such that the element \(\langle \mathbf{0} i nlm| \hat{H}| \mathbf{t} i' n'l'm'\rangle\) indicates the interaction between orbital \(nlm\) on atom \(i\) in the undisplaced cell (denoted by the null lattice translation, \(\mathbf{t}=\mathbf{0}\)) and orbital \(n'l'm'\) on atom \(i'\) in a periodic copy of the unit cell translated by \(\mathbf{t}\). A short-hand notation for \(\langle \mathbf{0} i nlm| \hat{H}| \mathbf{t} i' n'l'm'\rangle\) is \(H_{\small\substack{i,nlm\\i',n'l'm'}}(\mathbf{t})\)

Representation of a graphene unit cell and some replicas.

Representation of a graphene unit cell and its \(3 \times 3 \times 1\) replicas in real space. The central cell is denoted by \(\mathbf{t}=(0,0,0)\), while the cells translated by a single lattice vector along directions 1 and 2 are denoted by \(\mathbf{t}=(1,0,0)\) and \(\mathbf{t}=(0,1,0)\), respectively. The Hamiltonian matrix element between the \(1s\) orbital on atom \(i\) in the central unit cell and the \(2p_z\) orbital on atom \(i'\) in the \(\mathbf{t}=(1,0,0)\) cell is schematically represented.

Alternatively, we can provide the matrices in reciprocal (or Fourier, \(k\)) space. These are related to the real-space matrices by a Bloch sum,

\[\mathbf{H}(\mathbf{k})=\sum_{\mathbf{t}}\ e^{i\mathbf{k}\cdot\mathbf{t}} \mathbf{H}(\mathbf{t}).\]

In the case the input matrices are in reciprocal space, there should be one matrix per \(k\)-point in the grid.

A QMDataset to store the DFT data

The QMDataset class holds all the relevant data obtained from a quantum-mechanical (in this case, DFT) calculation, combining information from the files containing structures, Hamiltonians and overlap matrices, and \(k\)-point mesh.

qmdata = QMDataset.from_file(
    # File containing the atomistic structures
    frames_path="data/C2.xyz",
    # File containing the Hamiltonian (of Fock) matrices
    fock_realspace_path="graphene_fock.npy",
    # File containing the overlap matrices
    overlap_realspace_path="graphene_ovlp.npy",
    # File containing the k-point grids used for the DFT calculations
    kmesh_path="kmesh.dat",
    # Physical dimensionality of the system. Graphene is a 2D material
    dimension=2,
    # Device where to run the calculations
    # (can be 'cpu' or 'cuda', if GPUs are available)
    device="cpu",
    # Name of the basis set used for the calculations
    orbs_name=basis,
    # List of quantum numbers associated with the basis set orbitals
    orbs=orbitals[basis],
)

Quantities stored in QMDataset can be accessed as attributes, e.g. qmdata.fock_realspace is a list (one element per structure) of dictionaries labeled by the indices of the unit cell real-space translations containing torch.Tensor.

structure_idx = 0
realspace_translation = 0, 0, 0
print(f"The real-space Hamiltonian matrix for structure {structure_idx} labeled by")
print(f"translation T={realspace_translation} is:")
print(f"{qmdata.fock_realspace[structure_idx][realspace_translation]}")
The real-space Hamiltonian matrix for structure 0 labeled by
translation T=(0, 0, 0) is:
tensor([[-9.7210e+00, -2.6676e+00,  2.9317e-04,  3.2255e-04,  5.7009e-04,
         -8.1646e-05, -4.6611e-01, -3.6897e-01, -1.2792e-02,  6.7660e-01],
        [-2.6676e+00, -1.3880e+00,  3.2649e-03,  7.2985e-03,  1.0359e-02,
         -4.6611e-01, -5.9615e-01, -2.5023e-01, -8.9866e-03,  4.5874e-01],
        [ 2.9317e-04,  3.2649e-03, -4.2835e-01, -9.5326e-05, -1.2642e-02,
          3.6897e-01,  2.5023e-01, -1.0482e-01,  3.3322e-03, -1.6016e-01],
        [ 3.2255e-04,  7.2985e-03, -9.5326e-05, -1.4549e-01, -2.8526e-04,
          1.2792e-02,  8.9865e-03,  3.3322e-03, -1.4956e-01, -6.0943e-03],
        [ 5.7009e-04,  1.0359e-02, -1.2642e-02, -2.8526e-04, -4.4633e-01,
         -6.7660e-01, -4.5874e-01, -1.6016e-01, -6.0944e-03,  1.0121e-01],
        [-8.1646e-05, -4.6611e-01,  3.6897e-01,  1.2792e-02, -6.7660e-01,
         -9.7210e+00, -2.6676e+00, -2.9113e-04, -3.2414e-04, -5.7176e-04],
        [-4.6611e-01, -5.9615e-01,  2.5023e-01,  8.9865e-03, -4.5874e-01,
         -2.6676e+00, -1.3880e+00, -3.2673e-03, -7.2983e-03, -1.0360e-02],
        [-3.6897e-01, -2.5023e-01, -1.0482e-01,  3.3322e-03, -1.6016e-01,
         -2.9113e-04, -3.2673e-03, -4.2835e-01, -9.5331e-05, -1.2644e-02],
        [-1.2792e-02, -8.9866e-03,  3.3322e-03, -1.4956e-01, -6.0944e-03,
         -3.2414e-04, -7.2983e-03, -9.5331e-05, -1.4549e-01, -2.8525e-04],
        [ 6.7660e-01,  4.5874e-01, -1.6016e-01, -6.0943e-03,  1.0121e-01,
         -5.7176e-04, -1.0360e-02, -1.2644e-02, -2.8525e-04, -4.4633e-01]])

Machine learning data set

Symmetries of the Hamiltonian matrix

The data stored in QMDataset can be transformed into a format that is optimal for machine learning modeling by leveraging the underlying physical symmetries that characterize the atomistic structure, the basis set, and their associated matrices.

The Hamiltonian matrix is a complex learning target, indexed by two atoms and the orbitals centered on them. Each \(\mathbf{H}(\mathbf{k})\) is a Hermitian matrix, while in real space, periodicity introduces a symmetry over translation pairs such that \(\mathbf{H}(-\mathbf{t}) = \mathbf{H}(\mathbf{t})^\dagger\), where the dagger, \(\dagger\), denotes Hermitian conjugation.

To address the symmetries associated with swapping atomic indices or orbital labels, we divide the matrix into blocks labeled by pairs of atom types.

  • block_type = 0, or on-site blocks, consist of elements corresponding to the interaction of orbitals on the same atom, \(i = i'\).

  • block_type = 2, or cross-species blocks, consist of elements corresponding to orbitals centered on atoms of distinct species. Since the two atoms can be distinguished, they can be consistently arranged in a predetermined order.

  • block_type = 1, -1, or same-species blocks, consist of elements corresponding to orbitals centered on distinct atoms of the same species. As these atoms are indistinguishable and cannot be ordered definitively, the pair must be symmetrized for permutations. We construct symmetric and antisymmetric combinations \((\mathbf{H}_{\small\substack{i,nlm\\i',n'l'm'}}(\mathbf{t})\pm\ \mathbf{H}_{\small\substack{i',nlm\\i,n'l'm'}}(\mathbf{-t}))\) that correspond to block_type \(+1\) and \(-1\), respectively.

Equivariant structure of the Hamiltonians

Even though the Hamiltonian operator under consideration is invariant, its representation transforms under the action of structural rotations and inversions due to the choice of the basis functions. Each of the blocks has elements of the form \(\langle\mathbf{0}inlm|\hat{H}|\mathbf{t}i'n'l'm'\rangle\), which are in an uncoupled representation and transform as a product of (real) spherical harmonics, \(Y_l^m \otimes Y_{l'}^{m'}\).

This product can be decomposed into a direct sum of irreducible representations (irreps) of \(\mathrm{SO(3)}\),

\[\lambda \mu:\lambda \in [|l_1-l_2|,l_1+l_2],\mu \in [-\lambda,\lambda],\]

which express the Hamiltonian blocks in terms of contributions that rotate independently and can be modeled using a feature that geometrically describes the pair of atoms under consideration and shares the same symmetry. We use the notation \(H_{ii';nn'll'}^{\lambda\mu}\) to indicate the elements of the Hamiltonian in this coupled basis.

The resulting irreps form a coupled representation, each of which transforms as a spherical harmonic \(Y^\mu_\lambda\) under \(\mathrm{SO(3)}\) rotations, but may exhibit more complex behavior under inversions. For example, spherical harmonics transform under inversion, \(\hat{i}\), as polar tensors:

\[\hat{i}Y^\mu_\lambda = (-1)^\lambda Y^\mu_\lambda.\]

Some of the coupled basis terms transform like \(Y^\mu_\lambda\), while others instead transform as pseudotensors,

\[\hat{i}H^{\lambda\mu}=(-1)^{\lambda+1}H^{\lambda\mu}\]

where we omit for simplicity the indices that are not directly associated with inversion and rotation symmetry. For more details about the block decomposition, please refer to Nigam et al., J. Chem. Phys. 156, 014115 (2022).

The following is an animation of a trajectory along a Lissajous curve in 3D space, alongside a colormap representing the values of the real-space Hamiltonian matrix elements of the graphene unit cell in a minimal STO-3G basis. From the animation, it is evident how invariant elements, such as those associated with interactions between \(s\) orbitals, do not change under structural rotations. On the other hand, interactions allowing for equivariant components, such as the \(s\)-\(p\) block, change under rotations.

image_files = sorted(
    [
        f"frames/{f}"
        for f in os.listdir("./frames")
        if f.startswith("rot_") and f.endswith(".png")
    ]
)
images = [mpimg.imread(img) for img in image_files]
fig, ax = plt.subplots()
img_display = ax.imshow(images[0])
ax.axis("off")


def update(frame):
    img_display.set_data(images[frame])
    return [img_display]


ani = FuncAnimation(fig, update, frames=len(images), interval=20, blit=True)