Enable ROCm GPU acceleration for llama.cpp GGUF export by GoldenGrapeGentleman · Pull Request #512 · unslothai/unsloth-zoo

GoldenGrapeGentleman · 2026-02-24T15:17:41Z

Enable ROCm GPU acceleration for llama.cpp GGUF export

Summary

Adds automatic ROCm/HIP detection in install_llama_cpp() so AMD GPU users get hardware-accelerated GGUF inference.

Problem

On AMD ROCm systems, llama.cpp was compiled without GPU support, resulting in CPU-only inference that is 5–9x slower than GPU-accelerated.

Solution

After gpu_support = "ON" is confirmed, detect the active GPU backend:

ROCm systems: reads torch.version.hip, queries gcnArchName for the target arch, sets -DGGML_HIP=ON with clang/clang++ compilers and CMAKE_HIP_ARCHITECTURES
CUDA systems: falls through to existing -DGGML_CUDA=ON flags (no behavior change)
gpu_support=False (default): entire block is skipped (no behavior change for existing users)

Changes

unsloth_zoo/llama_cpp.py (+33/-2):

ROCm detection block inserted after gpu_support == "ON" confirmation
Hardcoded -DGGML_CUDA={gpu_support} replaced with dynamic gpu_cmake_flags in both Linux and Windows cmake paths
Fixed guard bug from review: if gpu_support == "ON": (was if gpu_support: which treats "OFF" as truthy)

Testing

Tested on AMD Radeon PRO W7900 (gfx1100, ROCm 7.1):

GPU inference: 312.4 t/s (5.9x faster than CPU-only)
ROCm libraries correctly linked via HIP flags
CUDA systems: no regression (guard prevents ROCm block from running)

Notes

Rebased cleanly onto current upstream main (1 commit ahead, 0 behind)
Resolves conflict with upstream Windows cmake refactor
Companion PR Enable GPU support for llama.cpp GGUF export unsloth#4103 (save.py gpu_support=True) was closed by @danielhanchen pending investigation of compile time. This PR keeps the detection logic ready for when that decision is revisited.

gemini-code-assist · 2026-02-24T15:18:02Z

Summary of Changes

Hello @GoldenGrapeGentleman, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances llama.cpp GGUF export functionality by integrating automatic ROCm GPU detection and compilation. This change allows AMD GPU users to leverage hardware acceleration, drastically improving inference speeds for models converted to GGUF format, addressing the previous limitation of CPU-only processing on ROCm systems.

Highlights

ROCm GPU Acceleration: Implemented automatic ROCm GPU detection and compilation support for llama.cpp during GGUF export, enabling hardware-accelerated inference for AMD GPU users.
Dynamic CMake Flags: The install_llama_cpp() function now dynamically generates appropriate CMake compilation flags, configuring HIP compilation for ROCm systems with GPU architecture detection, while maintaining existing CUDA flags for NVIDIA systems.
ROCm Path Customization: Added support for custom ROCm installations by respecting the ROCM_PATH environment variable.
GPU Availability Validation: Included validation for GPU availability before attempting to access device properties, ensuring graceful fallback.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

unsloth_zoo/llama_cpp.py
- Auto-detected ROCm vs CUDA GPU backends.
- Generated HIP-specific CMake flags for ROCm.
- Supported custom ROCm installations via ROCM_PATH environment variable.
- Validated GPU availability before accessing device properties.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request introduces automatic ROCm GPU detection and compilation support for llama.cpp during GGUF export, which is a valuable addition for AMD GPU users. The changes aim to enable hardware-accelerated inference, significantly improving performance. The implementation correctly identifies ROCm and CUDA backends and generates appropriate CMake flags. However, there is a critical issue with the handling of the gpu_support variable that needs to be addressed to ensure the feature works as intended.

gemini-code-assist · 2026-02-24T15:19:55Z

unsloth_zoo/llama_cpp.py


+    # Detect GPU backend for CMake
+    gpu_cmake_flags = ""
+    if gpu_support:  # Accept both True and "ON"


Following up on the previous comment, since gpu_support is converted to a string ("ON" or "OFF") at line 382, this if gpu_support: check will always evaluate to True. This means the GPU detection logic will run even if the user intended to disable GPU support. Please ensure gpu_support remains a boolean until it's used in the cmake command.

Suggested change

if gpu_support: # Accept both True and "ON"

if gpu_support:

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fcc53bd30a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-02-24T15:19:55Z

unsloth_zoo/llama_cpp.py


+    # Detect GPU backend for CMake
+    gpu_cmake_flags = ""
+    if gpu_support:  # Accept both True and "ON"


Respect gpu_support=OFF before auto-enabling CUDA/HIP

gpu_support is converted to the strings "ON"/"OFF" earlier in this function, so the new if gpu_support: guard is always truthy and runs backend auto-detection even when callers explicitly disable GPU support. In CUDA environments this path sets gpu_cmake_flags to -DGGML_CUDA=ON, which overrides an explicit OFF request and can make CPU-only builds fail on systems without a full CUDA toolchain; previously the CMake invocation preserved -DGGML_CUDA=OFF in this case.

Useful? React with 👍 / 👎.

danielhanchen

Code review -- found a bug in the gpu_support guard logic.

Bug: gpu_support is converted from bool to string ("ON"/"OFF") at line 427 before the new detection block. The check if gpu_support: on line ~430 evaluates "OFF" as truthy (non-empty string), so the detection block always runs regardless of the original gpu_support parameter.

When gpu_support=False is passed (e.g. from unsloth/save.py lines 1212-1218 where install_llama_cpp is called with gpu_support=False and the comment says "GGUF conversion doesn't need CUDA"):

Original behavior: -DGGML_CUDA=OFF is set
With this PR: detection block runs, torch.cuda.is_available() returns True on GPU machines, sets -DGGML_CUDA=ON instead of -DGGML_CUDA=OFF
This means GGUF builds attempt CUDA compilation even when explicitly disabled

Fix: Change if gpu_support: to if gpu_support == "ON": to match the already-converted string type.

The ROCm detection logic itself looks correct and well-structured. Just needs the guard fix.

GoldenGrapeGentleman · 2026-02-25T05:54:38Z

Hi @danielhanchen, thanks for the review!

Fixed the gpu_support guard bug — changed if gpu_support: to if gpu_support == "ON": so it correctly respects gpu_support=False (which becomes "OFF" string earlier in the function). Previously "OFF" was truthy and the GPU detection block always ran, potentially overriding an explicit OFF request.

Please take another look when you get a chance. 🙏

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e640bdfdb3

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-18T12:36:57Z

unsloth_zoo/llama_cpp.py

+    needs_build = False
+
+    # Ensure ~/.unsloth/ exists before we try to use it
+    os.makedirs(UNSLOTH_HOME, exist_ok=True)


Gate UNSLOTH_HOME creation behind default install path

install_llama_cpp() now unconditionally executes os.makedirs(UNSLOTH_HOME, exist_ok=True) before doing anything else. This breaks callers that intentionally pass a different llama_cpp_folder in environments where $HOME is read-only (common in managed containers), because the function raises PermissionError even though the requested install directory is writable. Previously those callers could proceed without touching ~/.unsloth.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-18T12:36:57Z

unsloth_zoo/llama_cpp.py

+                "-DBUILD_SHARED_LIBS=OFF",
+                gpu_cmake_flags,
+            ]


Pass ROCm CMake defines as separate argv entries

In the Windows CMake configure path, gpu_cmake_flags is appended as a single list element even when it contains multiple ROCm -D... options separated by spaces. Because subprocess.run(..., shell=False) does not split that string, CMake receives one malformed argument instead of distinct defines, so HIP compiler/architecture flags are not parsed correctly and ROCm configuration can fail.

Useful? React with 👍 / 👎.

Adds automatic ROCm/HIP detection in install_llama_cpp(): - Detects ROCm via torch.version.hip on AMD systems - Generates HIP-specific CMake flags with GPU arch detection (e.g. gfx1100) - Falls back to CUDA flags on NVIDIA systems - gpu_support=False (default) skips GPU backend entirely (no behavior change) - Fixed guard: uses 'if gpu_support == "ON":' not truthy string check Tested on AMD Radeon PRO W7900 (gfx1100, ROCm 7.1): GPU inference: 312.4 t/s (5.9x faster than CPU-only)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2428481b11

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-18T12:52:08Z

unsloth_zoo/llama_cpp.py

+    if gpu_support == "ON":
+        try:
+            import torch
+            if hasattr(torch.version, 'hip') and torch.version.hip is not None and torch.cuda.is_available():


Detect HIP backend without requiring visible GPU

The new ROCm branch only runs when torch.cuda.is_available() is true, so ROCm environments with a HIP build of PyTorch but no currently visible device (for example CI/headless containers or HIP_VISIBLE_DEVICES-restricted jobs) fall through to -DGGML_CUDA=ON. That makes CMake try CUDA instead of HIP and can fail the build even though the machine is configured for ROCm. Using torch.version.hip alone for backend selection (or at least not forcing CUDA in this case) avoids this regression.

Useful? React with 👍 / 👎.

- Merge the print + ROCm detection into a single if/else block - Remove redundant pre-assignment of gpu_cmake_flags outside the block - ON path: detect ROCm (HIP flags) or fall back to CUDA=ON - else path: gpu_cmake_flags = DGGML_CUDA=OFF - No logic change, cleaner control flow

GoldenGrapeGentleman · 2026-03-18T14:17:27Z

Verification + response to unslothai/unsloth#4103 closure

Tested end-to-end on AMD MI355X (gfx950, ROCm 7.1) — installed this branch directly and called install_llama_cpp(gpu_support=True):

Unsloth: Detected ROCm GPU (gfx950) -- building with HIP support
Unsloth: Successfully installed llama.cpp!

All binaries (llama-quantize, llama-cli, llama-server) link libamdhip64.so.7 and libhipblas.so.3 correctly. ✅

Re @danielhanchen's close reason on #4103:

"gguf conversion is only CPU based, since GPU based compilation might take way too long"

On conversion: agreed — llama-quantize runs on CPU regardless. The HIP build benefits llama-server/llama-cli inference (~6x speedup on W7900).

On compile time: first-time HIP build takes ~7 min on MI355X (cmake + make -j8). It is a one-time cost — the needs_build guard skips recompilation on subsequent calls.

On scope: gpu_support=False default is unchanged. This PR only fixes the path for users who explicitly pass gpu_support=True on AMD systems — previously they got a broken CUDA-only build.

cc @danielhanchen

gemini-code-assist bot reviewed Feb 24, 2026

View reviewed changes

chatgpt-codex-connector bot reviewed Feb 24, 2026

View reviewed changes

GoldenGrapeGentleman mentioned this pull request Feb 24, 2026

Enable GPU support for llama.cpp GGUF export unslothai/unsloth#4103

Closed

danielhanchen requested changes Feb 24, 2026

View reviewed changes

chatgpt-codex-connector bot reviewed Mar 18, 2026

View reviewed changes

GoldenGrapeGentleman closed this Mar 18, 2026

GoldenGrapeGentleman force-pushed the enable-rocm-gpu-support branch from e640bdf to 378a905 Compare March 18, 2026 12:39

GoldenGrapeGentleman reopened this Mar 18, 2026

chatgpt-codex-connector bot reviewed Mar 18, 2026

View reviewed changes

Conversation

GoldenGrapeGentleman commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Enable ROCm GPU acceleration for llama.cpp GGUF export

Summary

Problem

Solution

Changes

Testing

Notes

Uh oh!

gemini-code-assist bot commented Feb 24, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

danielhanchen left a comment

Choose a reason for hiding this comment

Uh oh!

GoldenGrapeGentleman commented Feb 25, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

GoldenGrapeGentleman commented Mar 18, 2026

Verification + response to unslothai/unsloth#4103 closure

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

GoldenGrapeGentleman commented Feb 24, 2026 •

edited

Loading