Skip to content

Enable ROCm GPU acceleration for llama.cpp GGUF export#512

Open
GoldenGrapeGentleman wants to merge 2 commits intounslothai:mainfrom
GoldenGrapeGentleman:enable-rocm-gpu-support
Open

Enable ROCm GPU acceleration for llama.cpp GGUF export#512
GoldenGrapeGentleman wants to merge 2 commits intounslothai:mainfrom
GoldenGrapeGentleman:enable-rocm-gpu-support

Conversation

@GoldenGrapeGentleman
Copy link
Copy Markdown
Contributor

@GoldenGrapeGentleman GoldenGrapeGentleman commented Feb 24, 2026

Enable ROCm GPU acceleration for llama.cpp GGUF export

Summary

Adds automatic ROCm/HIP detection in install_llama_cpp() so AMD GPU users get hardware-accelerated GGUF inference.

Problem

On AMD ROCm systems, llama.cpp was compiled without GPU support, resulting in CPU-only inference that is 5–9x slower than GPU-accelerated.

Solution

After gpu_support = "ON" is confirmed, detect the active GPU backend:

  • ROCm systems: reads torch.version.hip, queries gcnArchName for the target arch, sets -DGGML_HIP=ON with clang/clang++ compilers and CMAKE_HIP_ARCHITECTURES
  • CUDA systems: falls through to existing -DGGML_CUDA=ON flags (no behavior change)
  • gpu_support=False (default): entire block is skipped (no behavior change for existing users)

Changes

unsloth_zoo/llama_cpp.py (+33/-2):

  • ROCm detection block inserted after gpu_support == "ON" confirmation
  • Hardcoded -DGGML_CUDA={gpu_support} replaced with dynamic gpu_cmake_flags in both Linux and Windows cmake paths
  • Fixed guard bug from review: if gpu_support == "ON": (was if gpu_support: which treats "OFF" as truthy)

Testing

Tested on AMD Radeon PRO W7900 (gfx1100, ROCm 7.1):

  • GPU inference: 312.4 t/s (5.9x faster than CPU-only)
  • ROCm libraries correctly linked via HIP flags
  • CUDA systems: no regression (guard prevents ROCm block from running)

Notes

  • Rebased cleanly onto current upstream main (1 commit ahead, 0 behind)
  • Resolves conflict with upstream Windows cmake refactor
  • Companion PR Enable GPU support for llama.cpp GGUF export unsloth#4103 (save.py gpu_support=True) was closed by @danielhanchen pending investigation of compile time. This PR keeps the detection logic ready for when that decision is revisited.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @GoldenGrapeGentleman, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances llama.cpp GGUF export functionality by integrating automatic ROCm GPU detection and compilation. This change allows AMD GPU users to leverage hardware acceleration, drastically improving inference speeds for models converted to GGUF format, addressing the previous limitation of CPU-only processing on ROCm systems.

Highlights

  • ROCm GPU Acceleration: Implemented automatic ROCm GPU detection and compilation support for llama.cpp during GGUF export, enabling hardware-accelerated inference for AMD GPU users.
  • Dynamic CMake Flags: The install_llama_cpp() function now dynamically generates appropriate CMake compilation flags, configuring HIP compilation for ROCm systems with GPU architecture detection, while maintaining existing CUDA flags for NVIDIA systems.
  • ROCm Path Customization: Added support for custom ROCm installations by respecting the ROCM_PATH environment variable.
  • GPU Availability Validation: Included validation for GPU availability before attempting to access device properties, ensuring graceful fallback.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • unsloth_zoo/llama_cpp.py
    • Auto-detected ROCm vs CUDA GPU backends.
    • Generated HIP-specific CMake flags for ROCm.
    • Supported custom ROCm installations via ROCM_PATH environment variable.
    • Validated GPU availability before accessing device properties.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces automatic ROCm GPU detection and compilation support for llama.cpp during GGUF export, which is a valuable addition for AMD GPU users. The changes aim to enable hardware-accelerated inference, significantly improving performance. The implementation correctly identifies ROCm and CUDA backends and generates appropriate CMake flags. However, there is a critical issue with the handling of the gpu_support variable that needs to be addressed to ensure the feature works as intended.


# Detect GPU backend for CMake
gpu_cmake_flags = ""
if gpu_support: # Accept both True and "ON"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Following up on the previous comment, since gpu_support is converted to a string ("ON" or "OFF") at line 382, this if gpu_support: check will always evaluate to True. This means the GPU detection logic will run even if the user intended to disable GPU support. Please ensure gpu_support remains a boolean until it's used in the cmake command.

Suggested change
if gpu_support: # Accept both True and "ON"
if gpu_support:

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fcc53bd30a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".


# Detect GPU backend for CMake
gpu_cmake_flags = ""
if gpu_support: # Accept both True and "ON"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Respect gpu_support=OFF before auto-enabling CUDA/HIP

gpu_support is converted to the strings "ON"/"OFF" earlier in this function, so the new if gpu_support: guard is always truthy and runs backend auto-detection even when callers explicitly disable GPU support. In CUDA environments this path sets gpu_cmake_flags to -DGGML_CUDA=ON, which overrides an explicit OFF request and can make CPU-only builds fail on systems without a full CUDA toolchain; previously the CMake invocation preserved -DGGML_CUDA=OFF in this case.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

@danielhanchen danielhanchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review -- found a bug in the gpu_support guard logic.

Bug: gpu_support is converted from bool to string ("ON"/"OFF") at line 427 before the new detection block. The check if gpu_support: on line ~430 evaluates "OFF" as truthy (non-empty string), so the detection block always runs regardless of the original gpu_support parameter.

When gpu_support=False is passed (e.g. from unsloth/save.py lines 1212-1218 where install_llama_cpp is called with gpu_support=False and the comment says "GGUF conversion doesn't need CUDA"):

  • Original behavior: -DGGML_CUDA=OFF is set
  • With this PR: detection block runs, torch.cuda.is_available() returns True on GPU machines, sets -DGGML_CUDA=ON instead of -DGGML_CUDA=OFF
  • This means GGUF builds attempt CUDA compilation even when explicitly disabled

Fix: Change if gpu_support: to if gpu_support == "ON": to match the already-converted string type.

The ROCm detection logic itself looks correct and well-structured. Just needs the guard fix.

@GoldenGrapeGentleman
Copy link
Copy Markdown
Contributor Author

Hi @danielhanchen, thanks for the review!

Fixed the gpu_support guard bug — changed if gpu_support: to if gpu_support == "ON": so it correctly respects gpu_support=False (which becomes "OFF" string earlier in the function). Previously "OFF" was truthy and the GPU detection block always ran, potentially overriding an explicit OFF request.

Please take another look when you get a chance. 🙏

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e640bdfdb3

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

needs_build = False

# Ensure ~/.unsloth/ exists before we try to use it
os.makedirs(UNSLOTH_HOME, exist_ok=True)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Gate UNSLOTH_HOME creation behind default install path

install_llama_cpp() now unconditionally executes os.makedirs(UNSLOTH_HOME, exist_ok=True) before doing anything else. This breaks callers that intentionally pass a different llama_cpp_folder in environments where $HOME is read-only (common in managed containers), because the function raises PermissionError even though the requested install directory is writable. Previously those callers could proceed without touching ~/.unsloth.

Useful? React with 👍 / 👎.

Comment on lines +741 to +743
"-DBUILD_SHARED_LIBS=OFF",
gpu_cmake_flags,
]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Pass ROCm CMake defines as separate argv entries

In the Windows CMake configure path, gpu_cmake_flags is appended as a single list element even when it contains multiple ROCm -D... options separated by spaces. Because subprocess.run(..., shell=False) does not split that string, CMake receives one malformed argument instead of distinct defines, so HIP compiler/architecture flags are not parsed correctly and ROCm configuration can fail.

Useful? React with 👍 / 👎.

Adds automatic ROCm/HIP detection in install_llama_cpp():
- Detects ROCm via torch.version.hip on AMD systems
- Generates HIP-specific CMake flags with GPU arch detection (e.g. gfx1100)
- Falls back to CUDA flags on NVIDIA systems
- gpu_support=False (default) skips GPU backend entirely (no behavior change)
- Fixed guard: uses 'if gpu_support == "ON":' not truthy string check

Tested on AMD Radeon PRO W7900 (gfx1100, ROCm 7.1):
  GPU inference: 312.4 t/s (5.9x faster than CPU-only)
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2428481b11

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

if gpu_support == "ON":
try:
import torch
if hasattr(torch.version, 'hip') and torch.version.hip is not None and torch.cuda.is_available():
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Detect HIP backend without requiring visible GPU

The new ROCm branch only runs when torch.cuda.is_available() is true, so ROCm environments with a HIP build of PyTorch but no currently visible device (for example CI/headless containers or HIP_VISIBLE_DEVICES-restricted jobs) fall through to -DGGML_CUDA=ON. That makes CMake try CUDA instead of HIP and can fail the build even though the machine is configured for ROCm. Using torch.version.hip alone for backend selection (or at least not forcing CUDA in this case) avoids this regression.

Useful? React with 👍 / 👎.

- Merge the print + ROCm detection into a single if/else block
- Remove redundant pre-assignment of gpu_cmake_flags outside the block
- ON path: detect ROCm (HIP flags) or fall back to CUDA=ON
- else path: gpu_cmake_flags = DGGML_CUDA=OFF
- No logic change, cleaner control flow
@GoldenGrapeGentleman
Copy link
Copy Markdown
Contributor Author

Verification + response to unslothai/unsloth#4103 closure

Tested end-to-end on AMD MI355X (gfx950, ROCm 7.1) — installed this branch directly and called install_llama_cpp(gpu_support=True):

Unsloth: Detected ROCm GPU (gfx950) -- building with HIP support
Unsloth: Successfully installed llama.cpp!

All binaries (llama-quantize, llama-cli, llama-server) link libamdhip64.so.7 and libhipblas.so.3 correctly. ✅


Re @danielhanchen's close reason on #4103:

"gguf conversion is only CPU based, since GPU based compilation might take way too long"

On conversion: agreed — llama-quantize runs on CPU regardless. The HIP build benefits llama-server/llama-cli inference (~6x speedup on W7900).

On compile time: first-time HIP build takes ~7 min on MI355X (cmake + make -j8). It is a one-time cost — the needs_build guard skips recompilation on subsequent calls.

On scope: gpu_support=False default is unchanged. This PR only fixes the path for users who explicitly pass gpu_support=True on AMD systems — previously they got a broken CUDA-only build.

cc @danielhanchen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants