This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
A development workspace for two sibling Agent Skills:
ase-chemist— atomistic simulation orchestration on top of ASE / tblite-xTB / EMT / TIP3P, plus MACE foundation models in v1.2+, a small Amber GAFF2 small-molecule MD carve-out in v1.3+ (plainmin → heat → density → prodNPT only), and Gaussian DFT (SP / Opt / Freq) in v1.4+.amber-chemist— Amber-native sibling skill, MD-first. Ships single-replica MD (configurable stages, restart, extend, restraints, barostat options, explicit or implicit GB solvent), T-REMD as a v1.0 first-class capability, plus add-ons for cpptraj-driven analysis (amber_analyze.py), single-point energies (amber_sp.py), and MMPBSA endpoint scoring (amber_score.py). ff19SB/OL21 biopolymer prep is pre-wired and lands in v1.1.
The repo contains both skills, a Layer B test harness that runs
trigger / no-trigger prompts through claude -p in fresh sessions,
and the input fixtures those prompts reference.
This is not an application — there is no library to import and no service
to run. Work here means editing the skills (SKILL.md, scripts, references) and
re-running the trigger tests to check whether changes regress activation or
method-selection behavior. v2 phase sequencing for ase-chemist is tracked
in PLAN.md.
chemist/
├── ase-chemist/ # development source (edit here)
│ ├── SKILL.md, README.md
│ ├── scripts/
│ │ ├── _calc.py # shared: build_calculator() factory used by optimize/run_md/single_point
│ │ ├── check_env.py # backends + CUDA + Amber detection, capability summary
│ │ ├── optimize.py # BFGS/FIRE/LBFGS; emt/lj/tip3p/xtb/mace (delegates to _calc)
│ │ ├── run_md.py # NVE/NVT; auto cross-validation w/ MACE (uses Validator from validate_ml_md)
│ │ ├── single_point.py # E + dipole/charges/HOMO-LUMO via tblite (delegates to _calc)
│ │ ├── analyze_traj.py # RMSD/RMSF/energy drift/RDF
│ │ ├── ml_calculator.py # v1.2 — MACE factory (element-set routing, GPU detect)
│ │ ├── validate_ml_md.py # v1.2 — Validator class (persistent ref_atoms; SCF restart between calls); post-hoc cross-validation vs GFN2-xTB
│ │ ├── parameterize_gaff2.py # v1.3 FROZEN SUBSET — antechamber AM1-BCC -> parmchk2 -> tleap (canonical at amber-chemist/scripts/amber_prep.py)
│ │ ├── run_amber.py # v1.3 FROZEN SUBSET — min/heat/density/prod via pmemd.cuda/pmemd/sander (canonical at amber-chemist/scripts/amber_md.py; amber_run.py is the easy-mode pipeline wrapper that also includes prep)
│ │ ├── gaussian_sp.py # v1.4 — DFT SP via ase.calculators.gaussian.Gaussian; also hosts shared helpers (add_common_gaussian_args, detect_gaussian_binary, scrf_kwarg) imported by gaussian_opt/freq
│ │ ├── gaussian_opt.py # v1.4 — DFT Opt via GaussianOptimizer (L103)
│ │ ├── gaussian_freq.py # v1.4 — DFT Freq + thermochem via in-house log parser
│ │ └── _gaussian_log.py # v1.4 — regex helper for Gaussian .log fields ASE doesn't cover. Parsers take pre-read text; use load_log() to read once and pass through
│ ├── references/
│ │ ├── ase_core.md, xtb.md, analysis.md # v1 references
│ │ ├── ml_potentials.md # v1.2 reference — MACE method-selection + cross-validation contract
│ │ ├── amber.md # v1.3 reference — GAFF2 small-mol; protein/NA -> v2.3
│ │ └── gaussian.md # v1.4 reference — DFT SP/Opt/Freq; TS/IRC/NBO/post-HF -> v3+
│ └── evals/evals.json # 5 prompts; v1 has NO programmatic assertions
├── amber-chemist/ # development source for the second skill
│ ├── SKILL.md, README.md
│ ├── scripts/
│ │ ├── _amber.py # shared: pick_engine, mdin renderers, tleap deck factory, mdout/rem.log parsers, groupfile builder, GB_MAP, require_binaries, infer_input_format. mdout_succeeded tail-reads (4 KB) for large mdouts
│ │ ├── check_env.py # AmberTools / MPI / cpptraj / MMPBSA / ParmEd detection
│ │ ├── amber_run.py # easy mode: --mode {standard, remd, implicit}
│ │ ├── amber_prep.py # GAFF2 prep (ff19SB/OL21 raise NotImplementedError in v1.0)
│ │ ├── amber_md.py # MD core: stages, --restart, --extend, --implicit-solvent
│ │ ├── amber_remd.py # T-REMD with auto temperature ladder + exchange-rate report
│ │ ├── amber_sp.py # add-on: SP (snapshot or trajectory)
│ │ ├── amber_analyze.py # add-on: cpptraj-driven RMSD/RMSF/RDF/hbond/radgyr; --demux-remd
│ │ └── amber_score.py # add-on: MMPBSA / MMGBSA wrapper
│ ├── references/ # 14 topic-scoped files + index README (md_core, remd, force_fields, ...)
│ └── evals/evals.json # 5 prompts, no programmatic assertions in v1.0
├── .claude/skills/ase-chemist/ # project-scoped copy of ase-chemist
├── .claude/skills/amber-chemist/ # project-scoped copy of amber-chemist
├── PLAN.md # v2 phase sequencing for ase-chemist
├── test-inputs/ # fixtures generated by generate_test.py
├── results/ # per-run logs + .status from run_tests.sh
├── generate_test.py # one-shot fixture generator
└── run_tests.sh # 43-prompt test harness (ase + amber + research)
There are three copies of each skill on this machine. For
ase-chemist:
ase-chemist/(this repo, dev source).claude/skills/ase-chemist/(project skill — Claude Code loads this when invoked from this repo)~/.claude/skills/ase-chemist/(user skill — currently identical to #2)
For amber-chemist:
amber-chemist/(this repo, dev source).claude/skills/amber-chemist/(project skill)~/.claude/skills/amber-chemist/(user skill)
#1 is the source of truth and may drift ahead of #2/#3 between
syncs. Tests in run_tests.sh invoke claude -p, which loads #2 or
#3 — not #1. After editing the dev source, sync both loaded
copies before re-running tests:
# ase-chemist
rsync -a --delete ase-chemist/ .claude/skills/ase-chemist/
rsync -a --delete ase-chemist/ ~/.claude/skills/ase-chemist/
diff -rq ase-chemist .claude/skills/ase-chemist # confirm parity
diff -rq ase-chemist ~/.claude/skills/ase-chemist
# amber-chemist
rsync -a --delete amber-chemist/ .claude/skills/amber-chemist/
rsync -a --delete amber-chemist/ ~/.claude/skills/amber-chemist/
diff -rq amber-chemist .claude/skills/amber-chemist
diff -rq amber-chemist ~/.claude/skills/amber-chemistPLAN.md §"Sequencing rules" says wait until trigger tests pass against
dev before syncing — but because the harness only sees the loaded copy,
practically the order is "edit dev → sync → test → fix in dev → sync →
test" until clean.
# Regenerate input fixtures (test-inputs/caffeine.xyz, cluster.xyz, ar108.xyz, md.traj)
python generate_test.py
# Sanity-check the simulation environment (calculators present, tblite loadable)
python ase-chemist/scripts/check_env.py
# Run the full trigger-test suite (writes results/<id>.{log,status})
bash run_tests.sh
# Run with a longer per-prompt timeout (default 180s)
TIMEOUT_SECS=300 bash run_tests.sh
# Skim outcomes
for f in results/*.status; do printf "%-22s %s\n" "$(basename "$f" .status)" "$(cat "$f")"; done
# Re-run a single prompt manually (mirrors what run_tests.sh does)
claude -p "Optimize test-inputs/caffeine.xyz with GFN2-xTB ..." --verbose --output-format stream-jsonThe harness exits non-zero if any run timed out or errored; status values are
ok, timeout, or error:<rc> (124/137 are GNU timeout codes).
run_tests.sh runs 43 prompts, each in a fresh claude -p
session with a 180 s wall-clock cap. Prompts are tagged with one of three
expected behaviors:
trigger— the skill should activate and produce a correct script (ASE forase-chemist, Amber foramber-chemist).no_trigger— generic prompts (boiling point of water, Python utility) that should not invoke the skill.borderline— definitional questions ("explain NVT vs NPT") or graceful- deferral cases (protein MD on a v1.3 box that only ships GAFF2) where either response is defensible; logs are for human review.
Coverage groups:
ase-chemist (p*, 17 prompts):
p1–p5— v1.0/v1.1 baseline (xTB / EMT / LJ / TIP3P, build, analyze).p10_mace_named,p11_size_cliff— v1.2 MACE foundation-model trigger phrases and the size-cliff method-selection rule.p12_gaff2_named,p13_antechamber— v1.3 GAFF2 / antechamber trigger phrases and the "production explicit-solvent MD" task entry.p14_protein_md— v1.3 borderline; checks that the skill triggers on protein MD but explains the v2.3 deferral honestly.p15–p17— v1.4 Gaussian DFT: single-point (p15), the opt→freq thermochem pipeline (p16), and thep17no-method-given borderline that must refuse silent method/basis defaults.p6–p9— original no_trigger and borderline cases.
amber-chemist (a1–a14): single-replica MD (a1), restart/extend
(a2), T-REMD and ladder tuning (a3, a11), implicit GB (a4),
demux (a5), MMPBSA / alanine scan (a6, a7), cpptraj and esander
analysis (a8, a9), the ff19SB biopolymer deferral (a10), a
trigger-collision case shared with p12 (a12), and the aMD /
no-MPI-REMD borderlines (a13, a14).
research / cross-skill (r1–r10, rX1–rX2): task-framed prompts
that name no method, so the skill must pick one. r1–r5 route to
ase-chemist (conformer DFT ranking, IR / solvation-free-energy DFT,
CO/Pt(111) adsorption, QSAR descriptors); r6–r10 route to
amber-chemist (MMGBSA rescoring, hbond lifetime, ligand stability,
per-residue decomposition, radius of gyration); rX1/rX2 are
borderlines (tool choice, FEP deferral).
The 180 s budget is intentionally too short to actually run a simulation — the test asks "did Claude write the right code?", not "did the code finish?". All prompts include "don't execute it" for that reason. If you change a prompt, preserve that instruction.
evals/evals.json is a separate, richer set of five evaluation prompts with
free-form expected outputs for manual review. There are deliberately no
programmatic assertions in v1 — adding stable ones (file presence, energy
ranges, drift sign) is iteration 2's job.
These are the load-bearing decisions baked into SKILL.md. Touching them will
move the eval results, so know what you're changing:
- Always run
check_env.pyfirst on non-trivial tasks, and recommend a method the environment actually supports rather than asking the user to install xTB when EMT or LJ would do. - Method selection is a 3-step walk (task → calculator → install check), with explicit "why" reasons per rule. EMT for EMT-supported metals, TIP3P for pure water, GFN2-xTB for organics / main-group, GFN1-xTB as the d-block fallback, MACE past the xTB size cliff (~1k atoms), and GAFF2 + AM1-BCC for production-length explicit-solvent MD on a single small organic (v1.3+). Never silently substitute a wrong-physics calculator (the classic failure mode is EMT on an organic — it returns plausible nonsense).
tbliteis the supported xTB path;xtb-pythonis deprecated. The pip wheel is libgfortran-fragile on HPC —check_env.pyreports[BROKEN]in that case and the install hint isconda install -c conda-forge tblite-python.- The MACE cross-validation contract is non-negotiable (v1.2+).
run_md.py --calculator macevalidates against GFN2-xTB every 1 ps by default and aborts atMAE_F > 100 meV/Å. This is the basis on which the skill recommends MACE at all — ML potentials produce plausible-but-wrong PESs that users cannot spot.--no-validateis a per-run choice, never a default. If you change the threshold or cadence, document why inreferences/ml_potentials.mdfirst. - Amber uses shell-out, not ASE-Calculator (v1.3+) — but this is a
performance choice, not forced. ASE ships two Amber calculators:
(1)
ase.calculators.amber.Amber(FileIOCalculator, subprocess per call — unusable for MD), and (2)ase.calculators.amber.SANDER(pysander in-process bindings — works fine for ASE-driven MD viaLangevin/VelocityVerlet). v1.3 declined the SANDER path because pysander binds only to CPU sander (no pmemd.cuda), giving up ~10–50× throughput on production-sized systems. A fourth option exists and is on the table: build a proper ASE Calculator around pmemd / pmemd.cuda (long-lived subprocess wrapper, or contribute pmemd Python bindings upstream) — gets both ASE-coherence and pmemd.cuda speed, at the cost of real engineering work. If you're tempted to "fix" the carve-out by switching to SANDER, readreferences/ amber.md§1 first — the trade is documented and the decision is reviewable inPLAN.md§Phase 3, not load-bearing. - The SKILL.md
descriptionfield is the trigger contract. It deliberately enumerates user phrases ("relax this molecule", "thermalize at 300 K", "build a Pt(111) slab", "use MACE", "GAFF2 in water", "antechamber", etc.). If trigger reliability regresses, optimize that field first. - Inline ASE code is fine when it's more honest than a script. Don't add
scripts for one-shot tasks (e.g., a 5-line single-point or a single
ase.buildcall). amber.md,ml_potentials.md, andgaussian.mddescribe shipping implementations (v1.3, v1.2, v1.4 respectively), not stubs.ml_potentials.mdandgaussian.mdare indexes that link to topic-scoped siblings (failure modes, method selection, force-field choices, etc.);amber.mdis a single self-contained file (§1–§5) because the v1.3 carve-out is small enough not to need splitting, and the deep Amber surface lives in the siblingamber-chemistskill. Do not revert any of them to stub framing or flesh them out into workflow recipes beyond what the implementation actually supports. Future scope (Gaussian TS/IRC/NBO/post-HF, biopolymer Amber, ML potentials beyond MACE) is gated onPLAN.mddecisions, not single prompts.
What v1.x ships today:
- v1.0/v1.1 — ASE built-ins (EMT, LJ, TIP3P), tblite (GFN1/GFN2-xTB),
optimization / MD / single-point / trajectory analysis, structure
building.
scripts/single_point.py,scripts/optimize.py,scripts/run_md.py,scripts/analyze_traj.py. - v1.2 — MACE-MP-0 (89-element materials) and MACE-OFF (10-element
organics) foundation models with element-set auto-routing and the
cross-validation contract (above).
scripts/ml_calculator.py,scripts/validate_ml_md.py. - v1.3 — Amber GAFF2 small-molecule MD via the antechamber AM1-BCC
→ parmchk2 → tleap → pmemd pipeline.
scripts/parameterize_gaff2.py,scripts/run_amber.py. - v1.4 — Gaussian DFT SP / Opt / Freq through
ase.calculators. gaussian.Gaussian. No method/basis defaults; SMD as documented water-solvent default; thermochem parsing via in-housescripts/_gaussian_log.pyhelper (no third-party parser).scripts/gaussian_sp.py,scripts/gaussian_opt.py,scripts/gaussian_freq.py. Goes through ASE — no carve-out, no cclib.
What's deferred (do not add without raising scope first):
- v2.2+ — CHGNet (charge-aware materials), Orb-v3 (built-in
confidence head), committee-uncertainty heads on a frozen MACE backbone.
Documented in
references/ml_potentials.md§6. - v2.3 — Amber for biopolymers: ff19SB+OPC (proteins), OL21
(nucleic acids), full tleap-from-PDB system prep with pdb4amber and
disulfide handling. v1.3 ships only GAFF2; do not adapt the v1.3
mdindefaults for proteins. - v3+ — Gaussian transition-state (
Opt=TS, QST2/QST3, IRC), anharmonic frequencies, NBO/NPA (NBO output has its own format that needs its own parser), post-HF (CCSD/MP2/CASSCF), excited states (TDDFT/CIS/EOM-CCSD). v1.4 ships SP/Opt/Freq/SMD; the rest is out of scope perreferences/gaussian.md§7. - No v2 plan — VASP, Quantum ESPRESSO (community CP2K / FHI-aims
bridges may land in v3), free-energy methods (TI / FEP / MBAR),
enhanced sampling (REMD, metadynamics, umbrella sampling), QM/MM,
constant-pH MD, RESP charges via Gaussian, SLURM/HPC submission
templates, web GUIs. These are listed in
ase-chemist/README.mdandPLAN.mdPhase 3.
scripts/check_env.py reports detection status for v2.2+ ML potentials
(CHGNet, M3GNet, SevenNet, Orb) in a [v2 preview] block so users see
what is on their box without the skill pretending to drive it. Items
previously in [v2 preview] (MACE, Amber, Gaussian) have been promoted
to supported [OK]/[MISSING] lines.