UnslothPuzzles

Name: Rohit Nagraj

GPU Used for Testing: NVIDIA T4 (And RTX 3090 for BF16 correctness for challenge A)

Solutions

Challenge A

✅ Single Triton Kernel
✅ Speed >= 1.15 (1.40x speedup achieved)
✅ Kernel works in torch.compile
✅ Custom asm works (But I did not need to use it. Of course it could have helped speedup even more).
✅ Uses cache eviction
✅ Tested in FP16 and BF16 (BF16 on personal RTX 3090)

Challenge E

✅ VRAM 50% reduction (can reduce even more if lower chunk_size is used)
✅ Show cross-entropy loss works
✅ Show other loss functions works
✅ Allow dynamic chunk sizes
✅ Llama 3.2 1B training loss values match across all steps
❌ Works with GRPO Loss kernel (not tested)

Challenge B

❌ Unsuccessful Attempt

Challenge C & D

❌ No Attempt

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
q1		q1
q5		q5
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UnslothPuzzles

Solutions

Challenge A

Challenge E

Challenge B

Challenge C & D

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

UnslothPuzzles

Solutions

Challenge A

Challenge E

Challenge B

Challenge C & D

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages