You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Watch MD.hdf5 download. Run ls -lh data/misato/MD.hdf5 every ~10 min. DO NOT keep retrying dry-run — h5py refuses truncated files.
B
Verify label distribution: python -c "import json; d=json.load(open('data/targets.json')); vals=sorted(v['affinity_kcal_mol'] for v in d.values()); print(f'min={vals[0]:.2f} p10={vals[len(vals)//10]:.2f} median={vals[len(vals)//2]:.2f} p90={vals[len(vals)*9//10]:.2f} max={vals[-1]:.2f}')". Median should be ~-6 to -8 kcal/mol.
C
git clone repo locally + pip install streamlit torch numpy. Run streamlit run demo/app.py — mock data loads. Iterate on layout/visual polish. Push improvements.
D
Open AWS Workshop Studio. Enable Claude Haiku 4.5 in Bedrock console (us-east-1). Draft pitch slides 1-3: Problem, Architecture, Agent loop.
When MD.hdf5 = 132G: run python scripts/dry_run.py --misato-h5 data/misato/MD.hdf5 --pdb-id 5WIJ. Report PASS / FAIL with output. If PASS: kick off preprocessing in background: nohup python scripts/preprocess_features.py --misato-h5 data/misato/MD.hdf5 --limit 2000 > preprocess.log 2>&1 &
B
Once featurized.h5 exists (~20-30 min after preprocess starts): python scripts/train_gbm_baseline.py. This is the R1 gate. Report Pearson r.
C
Continue demo polish. Add a "loading…" state so the UI works during model inference.
D
Pitch slides 4-5: Why OpenTSLM (judge's own model) + Why Chronos (the encoder the OpenTSLM team recommended).
Hour 2 (T+1h30min → T+3h): wiring stage
Person
Task
A
Test loading juncliu checkpoint into the freshly-instantiated model: python -c "import torch; from opentslm.model.llm.OpenTSLMFlamingo import OpenTSLMFlamingo; m = OpenTSLMFlamingo(device='cuda', llm_id='meta-llama/Llama-3.2-1B', encoder_type='chronos2'); s = torch.load('/home/user/dmonopoli/.cache/huggingface/hub/models--juncliu--llama-3.2-1b-ecg-flamingo-epoch-35/snapshots/cfcdf8f7141b729ae50da4e1ef4e3bdc2b638674/best_model.pt', map_location='cpu', weights_only=False); m.model.load_state_dict(s, strict=False); print('loaded')"
B
If R1 PASS (r ≥ 0.3): green-light A to start training. If r < 0.1: flag for adding more channels.
C
Wire a stub agent.run_agent() call into Streamlit (replace mock_report()). Will be slow without trained checkpoint but the path is exercised.
D
Pitch slide 6: Eval methodology (Pearson r + abstention precision/recall).
Critical: Overfit a single batch to near-zero loss. Use tslm_md/train_stage6.py config with max_steps: 200. If loss < 0.05 at step 200 → architecture is sound. If not → debug 1-2 h max, then fall back to --encoder-type cnn.
B
Prepare configs/stage6_md_cot.yaml for the real training run: batch_size: 1, grad_accum_steps: 16, max_steps: 3000, ckpt_every_steps: 500.
C
Test demo with the half-overfit checkpoint — should produce semi-sensible outputs.
D
Pitch slides 7-8: Business case (CHF 2-3B/drug, MD signal discarded) + AWS architecture diagram.
Hour 4-12 (T+5h → T+13h): TRAINING (8 hours)
Person
Task
A
Launch real training: python tslm_md/train_stage6.py --config configs/stage6_md_cot.yaml --starting-checkpoint <juncliu_path>/best_model.pt. Watch the loss curve. Save ckpt every 500 steps. Sleep / stay productive — don't babysit.
B
Build the evaluation script. Test it against random predictions first. Confirm Pearson computation is correct.
C
Wire the latest checkpoint into the demo every 30 min: pull, restart Streamlit, verify outputs improve.
D
Pitch slides 9-10: Future work + closing. Run through full deck twice for timing (target 4-5 min).
Hour 12 (T+13h): CONVERGENCE GATE
If val loss decreasing AND val Pearson > 0.15
✅ Continue training another 2 h, then stop and evaluate. Proceed to hour 14 plan.
Loss flat or val Pearson ≈ 0
❌ STOP. Switch demo to use the pretrained juncliu checkpoint with our prompts (no fine-tune). Pitch the architecture, not the convergence.
NaN / diverging
❌ Reduce LR 5×, restart from juncliu checkpoint, lose ~30 min. If still diverging by hour 13 → fall back as above.
Hour 14-17 (T+15h → T+18h): final eval + polish
Person
Task
A
Run final eval on held-out test set. Save numbers: Pearson r, MAE, abstention rate, abstention precision/recall.
B
Pick the CONFIRMED case study (low disagreement, accurate prediction) and the INCONCLUSIVE case study (high disagreement, ambiguous binding) — these are the demo highlights.
C
Final demo testing on 5 unseen PDB ids. Fix any UI bugs. Confirm the abstention badge shows correctly.
D
Record a backup demo video (~3 min) in case live demo breaks. Insert eval numbers + case studies into slides.
Hour 17-19 (T+18h → T+19h): rehearsal
All four: rehearse the pitch end-to-end ≥ 2 times. Aim for 4:30 spoken (so you have 30 sec slack on a 5-min pitch).
Run order: Problem (D) → Architecture (D) → Live demo (C) → Results (D) → AWS productionization (D) → Vision (D)
One person mans the demo laptop, one person speaks, two people answer Q&A.
Critical decision points
Hour
Gate
Pass criterion
Fail action
2
Dry-run + R1 baseline
dry-run ✅ + GBM Pearson r ≥ 0.3
r < 0.1: add ch7=H-bonds, ch8=contact entropy; lose 1 h
4
Wiring
Single batch overfits to loss < 0.05
Try --encoder-type cnn; if still broken: pitch C-MAPSS instead
12
Convergence
Val Pearson > 0.15 AND loss decreasing
Stop, use juncliu pretrained ckpt + our prompts
16
Backup video
Live demo works on 3 unseen ids
Record video, plan to play it if live fails
What the demo MUST do (the moment of truth)
Story arc:"A computational chemist pastes a PDB id and gets back a verified affinity prediction in seconds."
Sidebar: PDB id input + Analyse button (already in demo/app.py)
Top row: 6 sparkline charts of the feature trajectories
One CONFIRMED demo and one INCONCLUSIVE demo are the highlight moments
Pitch outline (5 slides, ~5 minutes)
Slide
Owner
Content
1. Hook + Problem
D
"Binding affinity is the drug-discovery bottleneck. Current ML throws away the MD trajectory — the very signal that distinguishes a stable bound pose from an unbinding event."
2. Architecture
D
One diagram: PDB id → trajectory → 6-channel time-series → Chronos → Llama → answer + rationale + verifier → CONFIRMED/INCONCLUSIVE. "First TSLM applied to molecular dynamics."
3. Live demo
C
Paste PDB id #1 → CONFIRMED. Paste PDB id #2 → INCONCLUSIVE with reason. "It knows when not to trust itself."
4. Results + judge alignment
D
Pearson r table + abstention precision/recall. "We extended your own OpenTSLM with stage 6, used your team's recommended Chronos checkpoint, mirrored the SOC-agent precedent."
5. AWS productionization + vision
D
"Compute on vast.ai for the sprint; production runs on SageMaker training jobs + SageMaker Endpoint + Bedrock for second-opinion summarization. Scales linearly per complex."
What we will say IF the live demo breaks
Have the backup video ready. Don't apologize twice. One smooth pivot:
"Live network is being a network — here's our recorded run on three judge-selected PDB ids from yesterday."
Play 30 seconds of video, return to slides.
What NOT to do in the next 19 hours
❌ Don't try to use the full 16k complex training set. 2000 is enough.
❌ Don't tune hyperparameters. Start with our config; only change if loss diverges.
❌ Don't add features beyond the spec. No bonus risk.
❌ Don't pitch the Pearson r number first — lead with the agent + abstention behavior.
❌ Don't reinstall anything that already works (Llama, Chronos, juncliu, OpenTSLM).
❌ Don't burn time getting AWS perfect — Bedrock summarizer is the only AWS-critical wiring.
Tracking artifact
This file lives at docs/hackathon-plan-19h.md. Update the "status" column below as you hit gates.