.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "examples/sample-selection/sample-selection.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_examples_sample-selection_sample-selection.py: Sample and Feature Selection with FPS and CUR ============================================= :Authors: Davide Tisi `@DavideTisi `_ In this tutorial we generate descriptors using rascaline, then select a subset of structures using both the farthest-point sampling (FPS) and CUR algorithms implemented in scikit-matter. Finally, we also generate a selection of the most important features using the same techniques. First, import all the necessary packages .. GENERATED FROM PYTHON SOURCE LINES 16-28 .. code-block:: Python import ase.io import chemiscope import metatensor import numpy as np from equisolve.numpy import feature_selection, sample_selection from matplotlib import pyplot as plt from rascaline import SoapPowerSpectrum from sklearn.decomposition import PCA from skmatter import feature_selection as skfeat_selection .. GENERATED FROM PYTHON SOURCE LINES 29-34 Load molecular data ------------------- Load 500 example BTO structures from file, reading them using `ASE `_. .. GENERATED FROM PYTHON SOURCE LINES 34-39 .. code-block:: Python # Load a subset of :download:`structures ` of the example dataset n_frames = 500 frames = ase.io.read("input-fps.xyz", f":{n_frames}", format="extxyz") .. GENERATED FROM PYTHON SOURCE LINES 40-44 Compute SOAP descriptors using rascaline ---------------------------------------- First, define the rascaline hyperparameters used to compute SOAP. .. GENERATED FROM PYTHON SOURCE LINES 44-82 .. code-block:: Python # rascaline hyperparameters hypers = { "cutoff": 6.0, "max_radial": 8, "max_angular": 6, "atomic_gaussian_width": 0.3, "cutoff_function": {"ShiftedCosine": {"width": 0.5}}, "radial_basis": {"Gto": {"accuracy": 1e-6}}, "radial_scaling": {"Willatt2018": {"exponent": 4, "rate": 1, "scale": 3.5}}, "center_atom_weight": 1.0, } # Generate a SOAP power spectrum calculator = SoapPowerSpectrum(**hypers) rho2i = calculator.compute(frames) # Makes a dense block atom_soap = rho2i.keys_to_properties(["species_neighbor_1", "species_neighbor_2"]) atom_soap_single_block = atom_soap.keys_to_samples(keys_to_move=["species_center"]) # Sum over atomic centers to compute structure features struct_soap = metatensor.sum_over_samples( atom_soap_single_block, sample_names=["center", "species_center"] ) print("atom feature descriptor shape:", atom_soap.block(0).values.shape) print( "atom feature descriptor (all in one block) shape:", atom_soap_single_block.block(0).values.shape, ) print("structure feature descriptor shape:", struct_soap.block(0).values.shape) .. rst-class:: sphx-glr-script-out .. code-block:: none atom feature descriptor shape: (12000, 2688) atom feature descriptor (all in one block) shape: (20000, 2688) structure feature descriptor shape: (500, 2688) .. GENERATED FROM PYTHON SOURCE LINES 83-99 Perform atomic environment (i.e. sample) selection --------------------------------------------------- Using FPS and CUR algorithms, we can perform selection of atomic environments. These are implemented in equisolve, which provides a wrapper around scikit-matter to allow for interfacing with data stored in the metatensor format. Suppose we want to select the 10 most diverse environments for each chemical species. First, we can use the `keys_to_properties` operation in metatensor to move the neighbour species indices to the properties of the TensorBlocks. The resulting descriptor will be a TensorMap comprised of three blocks, one for each chemical species, where the chemical species indices are solely present in the keys. .. GENERATED FROM PYTHON SOURCE LINES 99-108 .. code-block:: Python print("----Atomic environment selection-----") # Define the number of structures to select using FPS/CUR n_envs = 25 print(atom_soap) print(atom_soap.block(0)) .. rst-class:: sphx-glr-script-out .. code-block:: none ----Atomic environment selection----- TensorMap with 3 blocks keys: species_center 8 22 56 TensorBlock samples (12000): ['structure', 'center'] components (): [] properties (2688): ['species_neighbor_1', 'species_neighbor_2', 'l', 'n1', 'n2'] gradients: None .. GENERATED FROM PYTHON SOURCE LINES 109-110 select 10 atomic environments for each chemical species. .. GENERATED FROM PYTHON SOURCE LINES 110-131 .. code-block:: Python # Define the number of structures *per block* to select using FPS n_envs = 10 # FPS sample selection selector_atomic_fps = sample_selection.FPS(n_to_select=n_envs, initialize="random").fit( atom_soap ) # Print the selected envs for each block print("atomic envs selected with FPS:\n") for key, block in selector_atomic_fps.support.items(): print("species_center:", key, "\n(struct_idx, atom_idx)\n", block.samples.values) selector_atomic_cur = sample_selection.CUR(n_to_select=n_envs).fit(atom_soap) # Print the selected envs for each block print("atomic envs selected with CUR:\n") for key, block in selector_atomic_cur.support.items(): print("species_center:", key, "\n(struct_idx, atom_idx)\n", block.samples.values) .. rst-class:: sphx-glr-script-out .. code-block:: none atomic envs selected with FPS: species_center: LabelsEntry(species_center=8) (struct_idx, atom_idx) [[ 68 18] [113 36] [140 21] [285 20] [339 16] [339 23] [339 24] [341 17] [347 37] [436 22]] species_center: LabelsEntry(species_center=22) (struct_idx, atom_idx) [[ 19 11] [ 55 13] [166 15] [198 12] [216 8] [285 9] [324 8] [341 12] [433 13] [466 9]] species_center: LabelsEntry(species_center=56) (struct_idx, atom_idx) [[ 40 7] [140 2] [238 3] [289 6] [339 3] [341 4] [407 0] [407 7] [436 6] [451 7]] atomic envs selected with CUR: species_center: LabelsEntry(species_center=8) (struct_idx, atom_idx) [[ 55 21] [ 68 20] [ 77 30] [198 36] [267 32] [336 33] [339 24] [339 36] [341 17] [436 19]] species_center: LabelsEntry(species_center=22) (struct_idx, atom_idx) [[ 10 39] [ 40 10] [ 70 10] [130 10] [166 15] [170 14] [216 8] [285 9] [326 10] [466 10]] species_center: LabelsEntry(species_center=56) (struct_idx, atom_idx) [[ 40 7] [ 77 3] [172 3] [219 7] [289 6] [296 2] [339 5] [339 6] [407 0] [436 2]] .. GENERATED FROM PYTHON SOURCE LINES 132-141 Selecting from a combined pool of atomic environments ----------------------------------------------------- One can also select from a combined pool of atomic environments and structures, instead of selecting an equal number of atomic environments for each chemical species. In this case, we can move the 'species_center' key to samples such that our descriptor is a TensorMap consisting of a single block. Upon sample selection, the most diverse atomic environments will be selected, regardless of their chemical species. .. GENERATED FROM PYTHON SOURCE LINES 141-163 .. code-block:: Python print("----All atomic environment selection-----") print("keys", atom_soap.keys) print("blocks", atom_soap[0]) print("samples in first block", atom_soap[0].samples) # Using the original SOAP descriptor, move all keys to properties. # Define the number of structures to select using FPS n_envs = 10 # FPS sample selection selector_atomic_fps = sample_selection.FPS(n_to_select=n_envs, initialize="random").fit( atom_soap_single_block ) print( "atomic envs selected with FPS: \n (struct_idx, atom_idx, species_center) \n", selector_atomic_fps.support.block(0).samples.values, ) .. rst-class:: sphx-glr-script-out .. code-block:: none ----All atomic environment selection----- keys Labels( species_center 8 22 56 ) blocks TensorBlock samples (12000): ['structure', 'center'] components (): [] properties (2688): ['species_neighbor_1', 'species_neighbor_2', 'l', 'n1', 'n2'] gradients: None samples in first block Labels( structure center 0 16 0 17 ... 499 30 499 31 ) atomic envs selected with FPS: (struct_idx, atom_idx, species_center) [[ 68 12 22] [ 77 30 8] [140 21 8] [166 15 22] [216 8 22] [289 6 56] [339 18 8] [407 0 56] [460 4 56] [466 34 8]] .. GENERATED FROM PYTHON SOURCE LINES 164-175 Perform structure (i.e. sample) selection with FPS/CUR --------------------------------------------------------- Instead of atomic environments, one can also select diverse structures. We can use the `sum_over_samples` operation in metatensor to define features in the structural basis instead of the atomic basis. This is done by summing over the atomic environments, labeled by the 'center' index in the samples of the TensorMap. Alternatively, one could use the `mean_over_samples` operation, depending on the specific inhomogeneity of the size of the structures in the training set. .. GENERATED FROM PYTHON SOURCE LINES 175-205 .. code-block:: Python print("----Structure selection-----") # Define the number of structures to select *per block* using FPS n_structures = 10 # FPS structure selection selector_struct_fps = sample_selection.FPS( n_to_select=n_structures, initialize="random" ).fit(struct_soap) struct_fps_idxs = selector_struct_fps.support.block(0).samples.values.flatten() print("structures selected with FPS:\n", struct_fps_idxs) # CUR structure selection selector_struct_cur = sample_selection.CUR(n_to_select=n_structures).fit(struct_soap) struct_cur_idxs = selector_struct_cur.support.block(0).samples.values.flatten() print("structures selected with CUR:\n", struct_cur_idxs) # Slice structure descriptor along axis 0 to contain only the selected structures struct_soap_fps = struct_soap.block(0).values[struct_fps_idxs, :] struct_soap_cur = struct_soap.block(0).values[struct_cur_idxs, :] assert struct_soap_fps.shape == struct_soap_cur.shape print("Structure descriptor shape before selection ", struct_soap.block(0).values.shape) print("Structure descriptor shape after selection (FPS)", struct_soap_fps.shape) print("Structure descriptor shape after selection (CUR)", struct_soap_cur.shape) .. rst-class:: sphx-glr-script-out .. code-block:: none ----Structure selection----- structures selected with FPS: [ 15 71 140 167 172 257 317 326 356 496] structures selected with CUR: [ 39 40 68 110 140 289 326 386 398 438] Structure descriptor shape before selection (500, 2688) Structure descriptor shape after selection (FPS) (10, 2688) Structure descriptor shape after selection (CUR) (10, 2688) .. GENERATED FROM PYTHON SOURCE LINES 206-213 Visualize selected structures ----------------------------- sklearn can be used to perform PCA dimensionality reduction on the SOAP descriptors. The resulting PC coordinates can be used to visualize the the data alongside their structures in a chemiscope widget. .. GENERATED FROM PYTHON SOURCE LINES 213-219 .. code-block:: Python # Generate a structure PCA struct_soap_pca = PCA(n_components=2).fit_transform(struct_soap.block(0).values) assert struct_soap_pca.shape == (n_frames, 2) .. GENERATED FROM PYTHON SOURCE LINES 220-225 Plot the PCA map ~~~~~~~~~~~~~~~~ Notice how the selected points avoid the densely-sampled area, and cover the periphery of the dataset .. GENERATED FROM PYTHON SOURCE LINES 225-242 .. code-block:: Python # Matplotlib plot fig, ax = plt.subplots(1, 1, figsize=(6, 4)) scatter = ax.scatter(struct_soap_pca[:, 0], struct_soap_pca[:, 1], c="red") ax.plot( struct_soap_pca[struct_cur_idxs, 0], struct_soap_pca[struct_cur_idxs, 1], "ko", fillstyle="none", label="FPS selection", ) ax.set_xlabel("PCA[1]") ax.set_ylabel("PCA[2]") ax.legend() fig.show() .. image-sg:: /examples/sample-selection/images/sphx_glr_sample-selection_001.png :alt: sample selection :srcset: /examples/sample-selection/images/sphx_glr_sample-selection_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 243-246 Creates a chemiscope viewer ~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. GENERATED FROM PYTHON SOURCE LINES 246-294 .. code-block:: Python # Selected level selection_levels = [] for i in range(len(frames)): level = 0 if i in struct_cur_idxs: level += 1 if i in struct_fps_idxs: level += 2 if level == 0: level = "Not selected" elif level == 1: level = "CUR" elif level == 2: level = "FPS" else: level = "FPS+CUR" selection_levels.append(level) properties = chemiscope.extract_properties(frames) properties.update( { "PC1": struct_soap_pca[:, 0], "PC2": struct_soap_pca[:, 1], "selection": np.array(selection_levels), } ) widget = chemiscope.show( frames, properties=properties, settings={ "map": { "x": {"property": "PC1"}, "y": {"property": "PC2"}, "color": {"property": "energy"}, "symbol": "selection", "size": {"factor": 50}, }, "structure": [{"unitCell": True}], }, ) widget.save("sample-selection.json.gz") # display, if in notebook or sphinx widget .. chemiscope:: _datasets/fig_sample-selection_001.json.gz :mode: default .. rst-class:: sphx-glr-script-out .. code-block:: none /home/runner/work/atomistic-cookbook/atomistic-cookbook/.nox/sample-selection/lib/python3.11/site-packages/chemiscope/structures/_ase.py:131: UserWarning: the following structure properties are only defined for a subset of frames: ['stress']; they will be ignored all_names = _ase_list_structure_properties(frames) /home/runner/work/atomistic-cookbook/atomistic-cookbook/.nox/sample-selection/lib/python3.11/site-packages/chemiscope/structures/__init__.py:78: UserWarning: the following structure properties are only defined for a subset of frames: ['stress']; they will be ignored return _ase_list_structure_properties(frames) .. raw:: html


.. GENERATED FROM PYTHON SOURCE LINES 295-301 Perform feature selection ------------------------- Now perform feature selection. In this example we will go back to using the descriptor decomposed into atomic environments, as opposed to the one decomposed into structure environments, but only use FPS for brevity. .. GENERATED FROM PYTHON SOURCE LINES 301-328 .. code-block:: Python print("----Feature selection-----") # Define the number of features to select n_features = 200 # FPS feature selection feat_fps = feature_selection.FPS(n_to_select=n_features, initialize="random").fit( atom_soap_single_block ) # Slice atomic descriptor along axis 1 to contain only the selected features # atom_soap_single_block_fps = atom_soap_single_block.block(0).values[:, feat_fps_idxs] atom_soap_single_block_fps = metatensor.slice( atom_soap_single_block, axis="properties", labels=feat_fps.support.block(0).properties, ) print( "atomic descriptor shape before selection ", atom_soap_single_block.block(0).values.shape, ) print( "atomic descriptor shape after selection ", atom_soap_single_block_fps.block(0).values.shape, ) .. rst-class:: sphx-glr-script-out .. code-block:: none ----Feature selection----- atomic descriptor shape before selection (20000, 2688) atomic descriptor shape after selection (20000, 200) .. GENERATED FROM PYTHON SOURCE LINES 331-337 Perform feature selection (skmatter) ------------------------------------ Now perform feature selection. In this example we will go back to using the descriptor decomposed into atomic environments, as opposed to the one decomposed into structure environments, but only use FPS for brevity. .. GENERATED FROM PYTHON SOURCE LINES 337-360 .. code-block:: Python print("----Feature selection (skmatter)-----") # Define the number of features to select n_features = 200 # FPS feature selection feat_fps = skfeat_selection.FPS(n_to_select=n_features, initialize="random").fit( atom_soap_single_block.block(0).values ) feat_fps_idxs = feat_fps.selected_idx_ print("Feature indices obtained with FPS ", feat_fps_idxs) # Slice atomic descriptor along axis 1 to contain only the selected features atom_dscrptr_fps = atom_soap_single_block.block(0).values[:, feat_fps_idxs] print( "atomic descriptor shape before selection ", atom_soap_single_block.block(0).values.shape, ) print("atomic descriptor shape after selection ", atom_dscrptr_fps.shape) .. rst-class:: sphx-glr-script-out .. code-block:: none ----Feature selection (skmatter)----- Feature indices obtained with FPS [2607 0 920 464 3 411 1 488 1824 921 5 465 29 1179 27 2240 468 1344 4 2244 602 923 489 274 1825 45 916 1180 28 1308 475 470 731 924 46 283 1828 474 450 30 491 2241 730 412 2084 12 477 1835 1636 859 476 1827 1796 496 413 931 936 21 453 1345 899 466 939 940 11 1816 1324 934 900 948 861 928 1490 19 860 925 2 1170 2276 456 1347 1819 1181 1310 13 858 1380 36 210 1372 1812 926 929 2195 947 458 1196 451 1307 492 2267 472 2212 915 462 449 733 725 922 18 37 2523 2532 1365 932 1746 347 473 868 912 502 275 927 1362 2246 1172 485 54 504 484 285 866 395 31 499 1348 480 1316 2092 38 1805 901 913 1941 723 944 2213 898 1187 1939 941 402 907 1364 6 1947 469 1309 467 490 1940 10 905 498 2242 1637 908 904 452 2083 942 1820 937 724 1171 500 933 338 420 2524 1682 2254 483 2221 22 1792 1764 494 7 738 949 9 2268 1832 1618 1323 478 482 950 461] atomic descriptor shape before selection (20000, 2688) atomic descriptor shape after selection (20000, 200) .. _sphx_glr_download_examples_sample-selection_sample-selection.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download :download:`Download Conda environment file: environment.yml ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: sample-selection.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: sample-selection.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: sample-selection.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_