Skip to content

Jacob-ML/inference-worker

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

85 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llama.cpp logo

Serverless llama.cpp inference worker for RunPod

This repository contains a serverless inference worker for running llama.cpp models on RunPod. It uses the llama-server image to provide an API for interacting with the models. The following OpenAI API endpoints are supported:

  • v1/models
  • v1/chat/completions
  • v1/completions

Streaming responses is also supported.

Important! This project is still relatively new. Please open a new issue if you encounter any problems in order to get help.

This is a fork of SvenBrnn's runpod-worker-ollama.

Setup

To get the best performance out of this worker, it is recommended to use cached models. Please see the cached models documentation for more information, this is highly recommended and will save many resources.

Configuration

The worker can be configured via environment variables set in the RunPod hub configuration:

  • LLAMA_SERVER_CMD_ARGS: Command line arguments (argv) for the llama-server binary. Example: -hf /path/to/model.gguf:Q4_K_M --ctx-size 4096. IMPORTANT: Please do not define the port argument here, as the worker will always use port 3098 automatically.
  • MAX_CONCURRENCY: Maximum number of concurrent requests the worker can handle. Default is 8.

License

Please see the LICENSE file for more information.

Runpod badge

About

A serverless worker to run LLMs in the cloud - using llama.cpp!

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

No contributors

Languages

  • Python 76.1%
  • Shell 17.6%
  • Dockerfile 6.3%