vLLM support & NVFP4 quantized models. Laptop GPUs ? #43

koushik03 · 2026-05-17T14:46:35Z

koushik03
May 17, 2026

Hi, this is an amazing tool to use, I have been going through so many documentations, blogs, resources and running manual tests etc.. for understanding what the best models for agentic coding and openclaw usage are, based on my Hardware constraints. I have an RTX 5090 Laptop (24GB VRAM), so I can make use of NVFP4 quantized models, which is a big advantage for me, but considering other formats, other engines and all the benchmarks out there, along with the degradation of scores because of the quantizations of model weights and KV cache is something I have been looking up so that I can get the best models for various use cases like opencode agentic coding via local models, but still have good accuracies, max possible context lengths and nice speed, and in some cases energy efficiency as well. Or having various serving or inference scripts based on needs. But, the major hurdle has been comparing so many different models, quants, use cases and real benchmarks, cause the actual local usage on consumer hardware is something that isn't benchmarked anywhere.

This is really cool, but does it support vLLM engine, NVFP4 quants and need specific model evaluations as well ? Is there a way that I can contribute to this so that my particular difficulties are not faced by the community when trying to use open-source models, based on their needs and hardware.

Andyyyy64 · 2026-05-17T17:11:35Z

Andyyyy64
May 17, 2026
Maintainer

Thanks for the detailed write-up. This is very much the direction I want whichllm to grow toward.

Short answer:

vLLM is not supported yet.
NVFP4 is not supported yet.
Task profiles exist today, but they are still fairly coarse.

whichllm run supports GGUF through llama-cpp-python, and non-GGUF models through Transformers. It does not currently generate or run vLLM serving commands, and the ranking engine does not yet model vLLM-specific behavior like paged attention, batching, KV cache choices, or engine-specific throughput.

For task profiles, there is already --profile coding, --profile vision, --profile math, etc., and coding does use some coding-relevant benchmark data. But it is not yet a full agentic-coding evaluation layer. It does not currently model things like opencode-style workloads, tool-call-heavy coding agents, long-context repo editing, or local consumer-GPU measurements per engine.

NVFP4 / MXFP4 support is also on the roadmap. I have an issue for that here: #27. That work should include parsing those quant types, adding reasonable memory/speed/quality assumptions, and making sure --quant NVFP4 can return useful candidates.

Contributions would be very welcome. The most useful things would be:

RTX 5090 Laptop measurements for real models, especially NVFP4 models.
vLLM command examples that work well on 24GB laptop GPUs.
Data for model, quant, engine, context length, KV cache settings, peak VRAM, tokens/sec, and whether the result was actually usable.
Suggestions for agentic coding evals that are practical to run locally.

I think this probably breaks down into a few focused pieces:

Add NVFP4 / MXFP4 quant support.
Add vLLM as a first-class runtime/backend in recommendations.
Improve coding-agent profiles beyond generic coding benchmarks.
Document reproducible local benchmark formats so users can contribute real hardware results.

If you are interested, your RTX 5090 Laptop data would be especially useful because consumer laptop GPUs are exactly the kind of hardware that is underrepresented in public benchmarks.

1 reply

koushik03 May 17, 2026
Author

I am interested to share that data and make some of the suggested contributions, although it might take some time, cause I am actually working on running around 15 models (most of them NVFP4, one fp8) on my hardware with those specific settings you mentioned. For maximum possible context length, I should be offloading KV cache and for maximum possible speed, I should keep everything on the GPU VRAM, so right now I am trying to create multiple scripts for inference via vllm for various use cases or scenarios.

Another thing I wish to ask is that can't we have the benchmarks running online rather than locally, so that we can just provide an API Endpoint and let online benchmarks run as well, so that all the resources are not used up on the device being tested and get accurate results, this can go in hand with some local and some custom benchmarks.

Once I finish setting up the model serving scripts and testing them on some benchmarks I will share the data, cause this tool is really very useful and could help a lot of people save quite a lot of time.

I am still learning and am relatively new to the field, so I still need to know quite a lot, help would be much appreciated in terms of figuring out benchmarking stuff (If I might have made a mistake with the comments above).

Is there a way to communicate with you more informally rather than here on the discussions page, so that I can follow up and share the data when I get it and also get some guidance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vLLM support & NVFP4 quantized models. Laptop GPUs ? #43

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

vLLM support & NVFP4 quantized models. Laptop GPUs ? #43

Uh oh!

koushik03 May 17, 2026

Replies: 1 comment · 1 reply

Uh oh!

Andyyyy64 May 17, 2026 Maintainer

Uh oh!

koushik03 May 17, 2026 Author

koushik03
May 17, 2026

Replies: 1 comment 1 reply

Andyyyy64
May 17, 2026
Maintainer

koushik03 May 17, 2026
Author