This is an autonomous research experiment for optimizing waste classification in Nigeria.
You are helping optimize an AI model that classifies waste into 3 categories:
- E-waste (class 0): batteries, phones, electronics, laptops
- Recyclable (class 1): plastic bottles, containers, wrappers, glass, metal, paper
- Organic (class 2): food waste, compost
The goal is to maximize the combined score:
combined_score = classification_accuracy - 0.1 * yield_prediction_mse
Higher classification accuracy is better. Lower yield prediction MSE is better.
- Agree on a run tag: Propose a tag based on today's date (for example
apr01) and create branchautoresearch/<tag>. - Read
train.py: This is the only file you modify during experimentation. It is self-contained and declares its dependencies inline foruv. - Read
prepare.py: This is the fixed evaluation reference. DO NOT MODIFY. - Verify Modal is configured: Run
modal token newif needed somodal runworks. - Test a single run: Confirm the stack works with
modal run modal_app.py. - Initialize
results.tsv: Create it with the header row only. - Confirm and go: Once the setup is valid, begin the experiment loop.
Each experiment runs on a T4 GPU via Modal. Launch training with:
modal run modal_app.py 2>&1 | tee run.logmodal_app.pyautomatically uploads the currenttrain.pyto the Modal container- Training runs on a T4 GPU with a 10-minute timeout
- Downloaded datasets are cached on a persistent Modal volume between runs
- Kaggle credentials are injected automatically if you've set them up with
modal secret create kaggle-credentials
Each experiment should fit within the Modal timeout budget.
What you CAN do:
- Modify
train.py - Change the model architecture
- Change hyperparameters such as learning rate, batch size, weight decay, epochs
- Change image size and data augmentation
- Add or remove components such as the yield prediction head or auxiliary losses
- Try different pretrained backbones
- Modify the training loop
What you CANNOT do:
- Modify
prepare.py - Modify the evaluation harness
- Change the definition of the reported metrics
The goal: Maximize the combined score:
combined_score = classification_accuracy - 0.1 * yield_prediction_mse
VRAM is a soft constraint. Some increase is acceptable for meaningful gains, but avoid wasteful blowups.
Simplicity criterion: All else being equal, simpler is better. A tiny gain that adds ugly complexity is usually not worth it. If you can remove code and keep or improve performance, that is a good outcome.
Before each experiment, use hf papers to find ideas from recent research.
# Search for relevant techniques
hf papers search "waste classification deep learning"
hf papers search "image classification transfer learning"
hf papers search "data augmentation computer vision"
hf papers search "efficient resnet training"# Read a promising paper
hf papers read <paper_id>Use papers as inspiration, not as scripts to copy mechanically. Prefer ideas that are simple to implement inside train.py and plausible under the runtime budget.
When training finishes, it prints:
val_accuracy: 85.50
yield_mse: 0.0234
combined_score: 85.27
training_seconds: 298.5
peak_vram_mb: 4120.5
Extract from log:
grep "^val_accuracy:\|^yield_mse:\|^combined_score:\|^training_seconds:\|^peak_vram_mb:" run.logLog to results.tsv (tab-separated):
commit val_accuracy yield_mse combined_score memory_gb status paper description
Example:
a1b2c3d 85.50 0.0234 85.27 4.0 keep - baseline resnet18
b2c3d4e 87.20 0.0210 87.01 4.2 keep 2503.08234 efficientnet_b0 from paper
c3d4e5f 86.80 0.0195 86.60 5.1 keep - stronger augmentation
d4e5f6g 0.00 0.0000 0.00 0.0 crash - vit OOM
Repeat forever:
- Research with
hf papers searchand identify one promising idea - Implement the idea by modifying
train.py - Commit the change with
git commit - Run the experiment:
modal run modal_app.py 2>&1 | tee run.log
- Evaluate the run by reading the reported metrics
- Log the result to
results.tsv - If
combined_scoreimproved, keep the commit - If the score is worse or equal, revert with
git reset --hard HEAD^ - Continue to the next experiment immediately
If the metric grep output is empty, the run crashed. Inspect the log, decide whether the failure is fixable, and either retry with a minimal fix or log a crash and move on.
NEVER STOP: Once the loop begins, do not ask the human whether to continue. Keep researching, implementing, running, evaluating, and iterating until interrupted.
The training script loads data via the HuggingFace datasets library. No manual setup needed.
HuggingFace (always available):
omasteam/waste-garbage-management-dataset— 10-class waste images (split:train)huaweilin/waste-classification— hierarchical waste labels (split:cleaned, label field:subclass)NeoAivara/Waste_Classification_data— 12-class waste images (split:train)
Kaggle (optional, requires credentials in environment):
asdasdasasdas/garbage-classification— 6-class garbage imagesisaacritharson/metal-glassgarbage-classification-data— glass, metals, cardboard
All labels are mapped to 3 categories: e-waste (0), recyclable (1), organic (2).
- The model classifies images, not text
train.pyalready handles HuggingFace datasets directly at runtime- Kaggle data is optional and should be skipped gracefully if credentials are unavailable
- ImageNet-pretrained backbones are good defaults
- Start from a strong baseline, then iterate
- Cost: T4 GPU at ~$0.005/min, or about ~$0.025 per 5-minute experiment, which is roughly 1000 experiments with $30 credit
- If you want faster runs, change
GPU_TYPEinmodal_app.pyto"a100"(more expensive)