As we continue to explore the best ways to generate data for future iterations of SPICE, it would be useful to apply a variety of dataset generation strategies to a simplified chemical space to enable experiments that can help identify the most useful strategies.
OpenFF has made extensive use of the AlkEthOH dataset that contains only three elements (C, H, O), making it feasible to relatively exhaustively explore the relevant chemical space. The name "AlkEthOH" refers to "alkanes, ethers, and alcohols (OH)".
A few subsets have already been generated at the OpenFF default level of theory by QCFractal here:
- AlkEthOH chain molecules : 1303 molecules
- AlkEthOH ring-containing molecules : 1156 molecules
- PhEthOH (AlkEthOH with phenyl substituents) : 5082 molecules
Examples are below:
AlkEthOH chain molecules
AlkEthOH_chain.pdf

AlkEthOH with rings
AlkEthOH_rings.pdf

PhAlkEthOH
PhEthOH.pdf

We could generate several kinds of datasets:
- MD snapshots generated with an MM force field (e.g. OpenFF or GAFF)
- MD snapshots generated with GFN2-xTB
- An
OptimizationDataset from RDKit-enumerated conformers
- MD snapshots generated with an MM force field but used in an
OptimizationDataset with the number of minimization steps limited to 3-4 steps (requires an argument to geomeTRIC to register success even if the convergence tolerance is not met)
- A
TorsionDriveDataset
- etc.
As we continue to explore the best ways to generate data for future iterations of SPICE, it would be useful to apply a variety of dataset generation strategies to a simplified chemical space to enable experiments that can help identify the most useful strategies.
OpenFF has made extensive use of the AlkEthOH dataset that contains only three elements (C, H, O), making it feasible to relatively exhaustively explore the relevant chemical space. The name "AlkEthOH" refers to "alkanes, ethers, and alcohols (OH)".
A few subsets have already been generated at the OpenFF
defaultlevel of theory by QCFractal here:Examples are below:
AlkEthOH chain molecules

AlkEthOH_chain.pdf
AlkEthOH with rings

AlkEthOH_rings.pdf
PhAlkEthOH

PhEthOH.pdf
We could generate several kinds of datasets:
OptimizationDatasetfrom RDKit-enumerated conformersOptimizationDatasetwith the number of minimization steps limited to 3-4 steps (requires an argument to geomeTRIC to register success even if the convergence tolerance is not met)TorsionDriveDataset