Enable ROCm GPU acceleration for llama.cpp GGUF export#512
Enable ROCm GPU acceleration for llama.cpp GGUF export#512GoldenGrapeGentleman wants to merge 2 commits intounslothai:mainfrom
Conversation
Summary of ChangesHello @GoldenGrapeGentleman, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances llama.cpp GGUF export functionality by integrating automatic ROCm GPU detection and compilation. This change allows AMD GPU users to leverage hardware acceleration, drastically improving inference speeds for models converted to GGUF format, addressing the previous limitation of CPU-only processing on ROCm systems. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
The pull request introduces automatic ROCm GPU detection and compilation support for llama.cpp during GGUF export, which is a valuable addition for AMD GPU users. The changes aim to enable hardware-accelerated inference, significantly improving performance. The implementation correctly identifies ROCm and CUDA backends and generates appropriate CMake flags. However, there is a critical issue with the handling of the gpu_support variable that needs to be addressed to ensure the feature works as intended.
unsloth_zoo/llama_cpp.py
Outdated
|
|
||
| # Detect GPU backend for CMake | ||
| gpu_cmake_flags = "" | ||
| if gpu_support: # Accept both True and "ON" |
There was a problem hiding this comment.
Following up on the previous comment, since gpu_support is converted to a string ("ON" or "OFF") at line 382, this if gpu_support: check will always evaluate to True. This means the GPU detection logic will run even if the user intended to disable GPU support. Please ensure gpu_support remains a boolean until it's used in the cmake command.
| if gpu_support: # Accept both True and "ON" | |
| if gpu_support: |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fcc53bd30a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
unsloth_zoo/llama_cpp.py
Outdated
|
|
||
| # Detect GPU backend for CMake | ||
| gpu_cmake_flags = "" | ||
| if gpu_support: # Accept both True and "ON" |
There was a problem hiding this comment.
Respect gpu_support=OFF before auto-enabling CUDA/HIP
gpu_support is converted to the strings "ON"/"OFF" earlier in this function, so the new if gpu_support: guard is always truthy and runs backend auto-detection even when callers explicitly disable GPU support. In CUDA environments this path sets gpu_cmake_flags to -DGGML_CUDA=ON, which overrides an explicit OFF request and can make CPU-only builds fail on systems without a full CUDA toolchain; previously the CMake invocation preserved -DGGML_CUDA=OFF in this case.
Useful? React with 👍 / 👎.
danielhanchen
left a comment
There was a problem hiding this comment.
Code review -- found a bug in the gpu_support guard logic.
Bug: gpu_support is converted from bool to string ("ON"/"OFF") at line 427 before the new detection block. The check if gpu_support: on line ~430 evaluates "OFF" as truthy (non-empty string), so the detection block always runs regardless of the original gpu_support parameter.
When gpu_support=False is passed (e.g. from unsloth/save.py lines 1212-1218 where install_llama_cpp is called with gpu_support=False and the comment says "GGUF conversion doesn't need CUDA"):
- Original behavior:
-DGGML_CUDA=OFFis set - With this PR: detection block runs,
torch.cuda.is_available()returns True on GPU machines, sets-DGGML_CUDA=ONinstead of-DGGML_CUDA=OFF - This means GGUF builds attempt CUDA compilation even when explicitly disabled
Fix: Change if gpu_support: to if gpu_support == "ON": to match the already-converted string type.
The ROCm detection logic itself looks correct and well-structured. Just needs the guard fix.
|
Hi @danielhanchen, thanks for the review! Fixed the Please take another look when you get a chance. 🙏 |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e640bdfdb3
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| needs_build = False | ||
|
|
||
| # Ensure ~/.unsloth/ exists before we try to use it | ||
| os.makedirs(UNSLOTH_HOME, exist_ok=True) |
There was a problem hiding this comment.
Gate UNSLOTH_HOME creation behind default install path
install_llama_cpp() now unconditionally executes os.makedirs(UNSLOTH_HOME, exist_ok=True) before doing anything else. This breaks callers that intentionally pass a different llama_cpp_folder in environments where $HOME is read-only (common in managed containers), because the function raises PermissionError even though the requested install directory is writable. Previously those callers could proceed without touching ~/.unsloth.
Useful? React with 👍 / 👎.
| "-DBUILD_SHARED_LIBS=OFF", | ||
| gpu_cmake_flags, | ||
| ] |
There was a problem hiding this comment.
Pass ROCm CMake defines as separate argv entries
In the Windows CMake configure path, gpu_cmake_flags is appended as a single list element even when it contains multiple ROCm -D... options separated by spaces. Because subprocess.run(..., shell=False) does not split that string, CMake receives one malformed argument instead of distinct defines, so HIP compiler/architecture flags are not parsed correctly and ROCm configuration can fail.
Useful? React with 👍 / 👎.
e640bdf to
378a905
Compare
Adds automatic ROCm/HIP detection in install_llama_cpp(): - Detects ROCm via torch.version.hip on AMD systems - Generates HIP-specific CMake flags with GPU arch detection (e.g. gfx1100) - Falls back to CUDA flags on NVIDIA systems - gpu_support=False (default) skips GPU backend entirely (no behavior change) - Fixed guard: uses 'if gpu_support == "ON":' not truthy string check Tested on AMD Radeon PRO W7900 (gfx1100, ROCm 7.1): GPU inference: 312.4 t/s (5.9x faster than CPU-only)
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2428481b11
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if gpu_support == "ON": | ||
| try: | ||
| import torch | ||
| if hasattr(torch.version, 'hip') and torch.version.hip is not None and torch.cuda.is_available(): |
There was a problem hiding this comment.
Detect HIP backend without requiring visible GPU
The new ROCm branch only runs when torch.cuda.is_available() is true, so ROCm environments with a HIP build of PyTorch but no currently visible device (for example CI/headless containers or HIP_VISIBLE_DEVICES-restricted jobs) fall through to -DGGML_CUDA=ON. That makes CMake try CUDA instead of HIP and can fail the build even though the machine is configured for ROCm. Using torch.version.hip alone for backend selection (or at least not forcing CUDA in this case) avoids this regression.
Useful? React with 👍 / 👎.
- Merge the print + ROCm detection into a single if/else block - Remove redundant pre-assignment of gpu_cmake_flags outside the block - ON path: detect ROCm (HIP flags) or fall back to CUDA=ON - else path: gpu_cmake_flags = DGGML_CUDA=OFF - No logic change, cleaner control flow
Verification + response to unslothai/unsloth#4103 closureTested end-to-end on AMD MI355X (gfx950, ROCm 7.1) — installed this branch directly and called All binaries ( Re @danielhanchen's close reason on #4103:
On conversion: agreed — On compile time: first-time HIP build takes ~7 min on MI355X ( On scope: |
Enable ROCm GPU acceleration for llama.cpp GGUF export
Summary
Adds automatic ROCm/HIP detection in
install_llama_cpp()so AMD GPU users get hardware-accelerated GGUF inference.Problem
On AMD ROCm systems, llama.cpp was compiled without GPU support, resulting in CPU-only inference that is 5–9x slower than GPU-accelerated.
Solution
After
gpu_support = "ON"is confirmed, detect the active GPU backend:torch.version.hip, queriesgcnArchNamefor the target arch, sets-DGGML_HIP=ONwith clang/clang++ compilers andCMAKE_HIP_ARCHITECTURES-DGGML_CUDA=ONflags (no behavior change)gpu_support=False(default): entire block is skipped (no behavior change for existing users)Changes
unsloth_zoo/llama_cpp.py (
+33/-2):gpu_support == "ON"confirmation-DGGML_CUDA={gpu_support}replaced with dynamicgpu_cmake_flagsin both Linux and Windows cmake pathsif gpu_support == "ON":(wasif gpu_support:which treats"OFF"as truthy)Testing
Tested on AMD Radeon PRO W7900 (gfx1100, ROCm 7.1):
Notes
gpu_support=True) was closed by @danielhanchen pending investigation of compile time. This PR keeps the detection logic ready for when that decision is revisited.