Skip to content

Explore various dataset generation strategies on simplified chemical space #89

@jchodera

Description

@jchodera

As we continue to explore the best ways to generate data for future iterations of SPICE, it would be useful to apply a variety of dataset generation strategies to a simplified chemical space to enable experiments that can help identify the most useful strategies.

OpenFF has made extensive use of the AlkEthOH dataset that contains only three elements (C, H, O), making it feasible to relatively exhaustively explore the relevant chemical space. The name "AlkEthOH" refers to "alkanes, ethers, and alcohols (OH)".

A few subsets have already been generated at the OpenFF default level of theory by QCFractal here:

  • AlkEthOH chain molecules : 1303 molecules
  • AlkEthOH ring-containing molecules : 1156 molecules
  • PhEthOH (AlkEthOH with phenyl substituents) : 5082 molecules

Examples are below:

AlkEthOH chain molecules
AlkEthOH_chain.pdf
image

AlkEthOH with rings
AlkEthOH_rings.pdf
image

PhAlkEthOH
PhEthOH.pdf
image

We could generate several kinds of datasets:

  • MD snapshots generated with an MM force field (e.g. OpenFF or GAFF)
  • MD snapshots generated with GFN2-xTB
  • An OptimizationDataset from RDKit-enumerated conformers
  • MD snapshots generated with an MM force field but used in an OptimizationDataset with the number of minimization steps limited to 3-4 steps (requires an argument to geomeTRIC to register success even if the convergence tolerance is not met)
  • A TorsionDriveDataset
  • etc.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions