- Professional CUDA C Programming - John Cheng, Max Grossman, Ty McKercher
- CUDA by Example - Jason Sanders, Edward Kandrot
- Programming Massively Parallel Processors - David Kirk, Wen-mei Hwu
Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/NewOptimization) - Commit your changes (
git commit -m 'Add new optimization technique') - Push to the branch (
git push origin feature/NewOptimization) - Open a Pull Request with a clear description
- Follow CUDA coding best practices
- Include benchmarking results for optimizations
- Add comments explaining the technique
- Verify correctness before submitting
This project is licensed under the MIT License. See the LICENSE file for details.
- GitHub: @bnvai
- Issues: GitHub Issues
- Discussions: GitHub Discussions
Beginner:
- Start with
01_vec_add.py- Understand basic CUDA workflow - Study
02_softmax.cu- Learn kernel structure and memory management - Explore
atomicAdd.cu- Understand thread synchronization
Intermediate:
- Analyze
naive_matmul.cu- Basic matrix operations - Study
tiled_matmul.cu- Shared memory optimization - Benchmark
unrolling_example.cu- Understand loop optimization
Advanced:
- Profile with NVTX in
nvtx_matmul.cu - Implement
stream_advanced.cu- Asynchronous execution - Create custom kernels for your use cases
Made with ❤️ for GPU Computing Enthusiasts
Happy CUDA Learning! 🚀🎓