An Unreal plugin for llama.cpp to support embedding local LLMs in your projects.
Fork is modern re-write from upstream to support latest API, including: GPULayers, advanced sampling (MinP, Miro, etc), Jinja templates, chat history, partial rollback & context reset, regeneration, and more. Defaults to Vulkan build on windows for wider hardware support while retaining very similar performance to CUDA backend (~3% diff) in both prompt processing and token generation speed (as tested on b7285 benchmarks).
- Download Latest Release Ensure to use the
Llama-Unreal-UEx.x-vx.x.x.7zlink which contains compiled binaries, not the Source Code (zip) link. - Create new or choose desired unreal project.
- Browse to your project folder (project root)
- Copy Plugins folder from .7z release into your project root.
- Plugin should now be ready to use.
- If platform doesn't have a supported release, build llama.cpp for your platform (see below)
- Ensure project is mixed type (has c++ and blueprints), compile project (plugin gets compiled with it).
Everything is wrapped inside a ULlamaComponent or ULlamaSubsystem which interfaces in a threadsafe manner to llama.cpp code internally via FLlamaNative. All core functionality is available both in C++ and in blueprint.
- In your component or subsystem, adjust your
ModelParamsof typeFLLMModelParams. The most important settings are:
PathToModel- where your *.gguf is placed. If path begins with a . it's considered relative to Saved/Models path, otherwise it's an absolute path.SystemPrompt- this will be autoinserted on load by defaultMaxContextLength- this should match your model, default is 4096GPULayers- how many layers to offload to GPU. Specifying more layers than the model needs works fine, e.g. use 99 if you want all of them to be offloaded for various practical model sizes. NB: Typically an 8B model will have about 33 layers. Loading more layers will eat up more VRAM, fitting the entire model inside of your target GPU will greatly increase generation speed.
-
Call
LoadModel. Consider listening to theOnModelLoadedcallback to deal with post loading operations. -
Call
InsertTemplatedPromptwith your message and role (typically User) along with whether you want your prompt to generate a response or not. Optionally useInsertRawPromptif you're doing raw input style without chat formatting. Note that you can safely chain requests and they will queue up one after another, responses will return in order. -
You should receive replies via
OnResponseGeneratedwhen full response has been generated. If you need streaming information, listen toOnNewTokenGeneratedand optionallyOnPartialGeneratedwhich will provide token and sentance level streams respectively.
Explore LlamaComponent.h for detailed API. Also if you need to modify sampling properties you find them in FLLMModelAdvancedParams.
The plugin supports multimodal models - LLMs that can process images and/or audio alongside text - using the mtmd library bundled with llama.cpp.
Any vision or audio model available in GGUF format that ships a separate multimodal projector file (mmproj). Tested with:
- Qwen2.5-Omni (vision + audio)
Models and projectors are available on Hugging Face in their respective GGUF repositories.
You need two GGUF files per multimodal model:
| File | Purpose |
|---|---|
model.gguf |
The base language model - same as any text-only LLM |
mmproj-model-f16.gguf (or similar) |
The multimodal projector that encodes images/audio into token embeddings |
Place both in your Saved/Models folder (or any absolute path).
Set MmprojPath in FLLMModelParams before calling LoadModel. Paths beginning with . are relative to Saved/Models:
ModelParams.PathToModel = "./Qwen2.5-Omni-7B-Q4_K_M.gguf"
ModelParams.MmprojPath = "./mmproj-Qwen2.5-Omni-7B-Q8_0.gguf"
If MmprojPath is empty, multimodal is disabled and the model runs as text-only.
If building llama.cpp from source you must also build and include the mtmd target:
cmake --build . --config Release --target mtmd -j
Then copy alongside the other libs/dlls:
{build root}/tools/mtmd/Release/mtmd.lib→ThirdParty/LlamaCpp/Lib/Win64/{build root}/bin/Release/mtmd.dll→ThirdParty/LlamaCpp/Binaries/Win64/
And the headers:
{llama.cpp root}/tools/mtmd/mtmd.h{llama.cpp root}/tools/mtmd/mtmd-helper.h
→ ThirdParty/LlamaCpp/Include/mtmd/
Before making multimodal calls, verify the projector loaded and the model supports the desired modality:
IsMultimodalLoaded() // projector loaded successfully
SupportsVision() // model can process images
SupportsAudio() // model can process audio
GetAudioSampleRate() // expected PCM sample rate (typically 16000 Hz)
Calling a multimodal function without a loaded projector fires OnError with code 50 - no crash.
From a UTexture2D (e.g. a render target or imported asset, must be PF_B8G8R8A8 format):
InsertTemplateImagePrompt(MyTexture, "What is in this image?")
From a file path on disk (more efficient - avoids GPU readback):
InsertTemplateImagePromptFromFile("C:/Images/photo.jpg", "Describe this scene.")
Both functions accept Role, bAddAssistantBOS, and bGenerateReply parameters matching the text API.
Audio must be provided as mono float PCM at the model's expected sample rate (use GetAudioSampleRate() to check - typically 16 kHz). Use ULlamaAudioUtils to convert Unreal USoundWave assets:
// One-shot convenience: converts SoundWave → 16 kHz mono float PCM
TArray<float> PCM;
ULlamaAudioUtils::SoundWaveToLLMAudio(MySoundWave, PCM, GetAudioSampleRate());
// Then pass to the component
InsertTemplateAudioPrompt(PCM, "Transcribe this audio.")
ULlamaAudioUtils also exposes lower-level steps if you need finer control:
SoundWaveToPCMFloat- raw PCM decode (returns source sample rate and channel count)PCMFloatToMono- stereo/multichannel → mono downmixResamplePCMFloat- arbitrary sample rate conversion
Use InsertMultimodalPrompt with an FLlamaMultimodalPrompt struct to place multiple images or audio clips in a single message. Each <__media__> marker in the prompt text corresponds to one FLlamaMediaEntry in MediaEntries (matched in order):
FLlamaMultimodalPrompt P;
P.Prompt = "Image A: <__media__>\nImage B: <__media__>\nCompare these two images.";
P.MediaEntries = { EntryA, EntryB };
P.bGenerateReply = true;
InsertMultimodalPrompt(P);
If the prompt contains no <__media__> markers and MediaEntries has exactly one entry, the marker is auto-prepended.
Build up context across multiple calls using bGenerateReply = false, then trigger generation on the final call - works the same as the text-only API:
// Insert image without generating
InsertTemplateImagePrompt(ImageA, "First image:", User, false, false)
InsertTemplateImagePrompt(ImageB, "Second image:", User, false, false)
// Generate on final text-only message
InsertTemplatedPrompt("Now compare those two images.", User)
Multimodal errors are delivered through the existing OnError delegate:
| Code | Condition |
|---|---|
| 50 | Multimodal projector not loaded (MmprojPath empty or init failed) |
| 51 | <__media__> marker count doesn't match MediaEntries count |
| 52 | Invalid bitmap (null texture, unsupported pixel format, failed file load) |
| 53 | mtmd_tokenize failed |
| 54 | mtmd_helper_eval_chunks failed (eval error during image/audio ingestion) |
| 55 | Vision not supported by the loaded mmproj |
| 56 | Audio not supported by the loaded mmproj |
- Context rollback:
RollbackContextHistoryByMessagesdoes not correctly account for the variable token count of multimodal messages (image token count depends on resolution). UseResetContextHistoryto clear context after multimodal sessions instead. - Texture format:
InsertTemplateImagePromptonly supportsPF_B8G8R8A8textures. UseInsertTemplateImagePromptFromFileto load other formats directly via the mtmd file decoder. - Audio sample rate: The caller is responsible for providing PCM at the model's expected rate. Use
GetAudioSampleRate()andULlamaAudioUtils::ResamplePCMFloatto convert if needed.
Whisper.cpp embedded into the plugin, using the same ggml backend.
Exposed via UWhisperComponent wrapping FWhisperNative which can be optionally embedded in your own class instead. The basic api is the following:
-
Add
UWhisperComponentto your actor of choice. Model defined inModelParamswill load on startup, '.' before any path denotes relative toSaved/Models. Grab e.g.ggml-small.en.binfrom https://huggingface.co/ggerganov/whisper.cpp/tree/main -
Model will load on begin play, disable
bAutoLoadModelOnStartupon the component if you wish to load manually. -
Choose a VAD mode via
StreamParams.VADMode:- Disabled - no VAD; audio buffers from
StartMicrophoneCapturetoStopMicrophoneCaptureand is dispatched as one chunk. If audio exceedsMaxSpeechSegmentSec(default 15s) it is auto-chunked withNonVADOverlapSecoverlap (default 0.5s) - you may need to de-duplicate words at boundaries manually. - Energy-Based (RMS) (default) - lightweight onset/offset detection using an RMS energy threshold. Configurable via
VADThreshold,VADHoldTimeSec, andVADPreRollSec. Fast, zero extra model files, works best in quiet environments. - Silero Neural VAD - neural VAD using a ggml-converted Silero model. More robust in noisy environments. Requires a separate model file pointed to by
StreamParams.PathToVADModel(default./ggml-silero-v6.2.0.bin). The model loads automatically after the whisper model loads. Silero-specific stream params:SileroThreshold(default 0.5) - speech probability threshold per window. Lower values are more sensitive; raise to reduce false positives in noisy environments.SileroHoldTimeSec(default 0Z.2s) - silence duration before speech offset. Shorter than the EnergyBased default (0.8s) because Silero's neural detection is more precise.
Download Silero VAD models from:
Place the file in your project's
Saved/Models/folder and set the path with a leading.(e.g../ggml-silero-v6.2.0.bin). BindOnVADModelLoadedto react when the Silero model is ready.In all VAD modes, start the microphone with
StartMicrophoneCaptureand stop withStopMicrophoneCapture. Any in-progress speech at stop time is always flushed and dispatched. - Disabled - no VAD; audio buffers from
-
Listen to
OnTranscriptionResultfor transcriptions. BindOnVADStateChangedfor speech onset/offset events.
If you're running the inference in a high spec game fully loaded into the same GPU that renders the game, expect about ~1/3-1/2 of the performance due to resource contention; e.g. an 8B model running at ~90TPS might have ~40TPS speed in game. You may want to use a smaller model or apply pressure easing strategies to manage perfectly stable framerates.
To do custom backends or support platforms not currently supported you can follow these build instruction. Note that these build instructions should be run from the cloned llama.cpp root directory, not the plugin root.
SN: curl issues: ggml-org/llama.cpp#9937
- clone Llama.cpp
- build using commands given below e.g. for Vulkan
mkdir build
cd build/
cmake .. -DGGML_VULKAN=ON -DGGML_NATIVE=OFF
cmake --build . --config Release -j --verbose
also in newer builds consider
cmake .. -DGGML_VULKAN=ON -DGGML_NATIVE=OFF -DLLAMA_CURL=OFF -DCMAKE_CXX_FLAGS_RELEASE="/Zi"
to workaround CURL and generate .pdbs for debugging
- Include: After build
- Copy
{llama.cpp root}/include - Copy
{llama.cpp root}/ggml/include - into
{plugin root}/ThirdParty/LlamaCpp/Include - Copy
{llama.cpp root}/common/common.handsampling.h - into
{plugin root}/ThirdParty/LlamaCpp/Include/common - (Multimodal) Copy
{llama.cpp root}/tools/mtmd/mtmd.handmtmd-helper.h - (Multimodal) into
{plugin root}/ThirdParty/LlamaCpp/Include/mtmd
- Libs: Assuming
{llama.cpp root}/buildas{build root}.
- Copy
{build root}/src/Release/llama.lib, - Copy
{build root}/common/Release/common.lib, - Copy
{build root}/ggml/src/Release/ggml.lib,ggml-base.lib&ggml-cpu.lib, - Copy
{build root}/ggml/src/Release/ggml-vulkan/Release/ggml-vulkan.lib - (Multimodal) Copy
{build root}/tools/mtmd/Release/mtmd.lib - into
{plugin root}/ThirdParty/LlamaCpp/Lib/Win64
- Dlls:
- Copy
{build root}/bin/Release/ggml.dll,ggml-base.dll,ggml-cpu.dll,ggml-vulkan.dll, &llama.dll - (Multimodal) Copy
{build root}/bin/Release/mtmd.dll - into
{plugin root}/ThirdParty/LlamaCpp/Binaries/Win64
- Build plugin
Current Plugin Llama.cpp was built from git has/tag: b8586
NB: use -DGGML_NATIVE=OFF to ensure wider portability.
With the following build commands for windows.
mkdir build
cd build/
cmake .. -DGGML_NATIVE=OFF
cmake --build . --config Release -j --verbose
see https://github.com/ggml-org/llama.cpp/blob/b4762/docs/build.md#git-bash-mingw64
e.g. once Vulkan SDK has been installed run.
mkdir build
cd build/
cmake .. -DGGML_VULKAN=ON -DGGML_NATIVE=OFF
cmake --build . --config Release -j --verbose
ATM CUDA 12.4 runtime is recommended.
- Ensure
bTryToUseCuda = true;is set in LlamaCore.build.cs to add CUDA libs to build (untested in v0.9 update)
mkdir build
cd build
cmake .. -DGGML_CUDA=ON -DGGML_NATIVE=OFF
cmake --build . --config Release -j --verbose
mkdir build
cd build/
cmake .. -DBUILD_SHARED_LIBS=ON
cmake --build . --config Release -j --verbose
For Android build see: https://github.com/ggerganov/llama.cpp/blob/master/docs/android.md#cross-compile-using-android-ndk
mkdir build-android
cd build-android
export NDK=<your_ndk_directory>
cmake -DCMAKE_TOOLCHAIN_FILE=$NDK/build/cmake/android.toolchain.cmake -DANDROID_ABI=arm64-v8a -DANDROID_PLATFORM=android-23 -DCMAKE_C_FLAGS=-march=armv8.4a+dotprod ..
$ make
Then the .so or .lib file was copied into e.g. ThirdParty/LlamaCpp/Win64/cpu directory and all the .h files were copied to the Includes directory.