This repository contains a serverless inference worker for running llama.cpp models on RunPod. It uses the llama-server image to provide an API for interacting with the models.
The following OpenAI API endpoints are supported:
v1/modelsv1/chat/completionsv1/completions
Streaming responses is also supported.
Important! This project is still relatively new. Please open a new issue if you encounter any problems in order to get help.
This is a fork of SvenBrnn's runpod-worker-ollama.
To get the best performance out of this worker, it is recommended to use cached models. Please see the cached models documentation for more information, this is highly recommended and will save many resources.
The worker can be configured via environment variables set in the RunPod hub configuration:
LLAMA_SERVER_CMD_ARGS: Command line arguments (argv) for thellama-serverbinary. Example:-hf /path/to/model.gguf:Q4_K_M --ctx-size 4096. IMPORTANT: Please do not define the port argument here, as the worker will always use port3098automatically.MAX_CONCURRENCY: Maximum number of concurrent requests the worker can handle. Default is8.
Please see the LICENSE file for more information.