[data.llm] Data LLM template under-utilizing GPU

I tried "LLM Batch Inference" on default workspace setup (g6.2xlarge, L4 GPU) on multiple ray-llm releases.

## 2.44.1

```
(_MapWorker pid=7383) INFO 07-27 09:53:46 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.2) with config: model='unsloth/Llama-3.1-8B-Instruct', speculative_config=None, tokenizer='unsloth/Llama-3.1-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=16384, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=unsloth/Llama-3.1-8B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, 
...
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=7383) [vLLM] Elapsed time for batch 760b8d0bca854be0a45631b59d55f365 with size 16: 70.98764790899997
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=7383) [vLLM] Elapsed time for batch b9ec679e91404b0292f932b40c93670d with size 16: 73.330533511
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=7383) INFO 07-27 09:56:51 metrics.py:455] Avg prompt throughput: 3301.4 tokens/s, Avg generation throughput: 59.2 tokens/s, Running: 32 reqs, Swapped: 0 reqs, Pending: 109 reqs, GPU KV cache usage: 94.7%, CPU KV cache usage: 0.0%.
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=7383) INFO 07-27 09:56:56 metrics.py:455] Avg prompt throughput: 611.9 tokens/s, Avg generation throughput: 298.3 tokens/s, Running: 32 reqs, Swapped: 0 reqs, Pending: 109 reqs, GPU KV cache usage: 99.8%, CPU KV cache usage: 0.0%.
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=7383) INFO 07-27 09:57:01 metrics.py:455] Avg prompt throughput: 151.5 tokens/s, Avg generation throughput: 332.6 tokens/s, Running: 30 reqs, Swapped: 0 reqs, Pending: 111 reqs, GPU KV cache usage: 99.3%, CPU KV cache usage: 0.0%.
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=7383) [vLLM] Elapsed time for batch edcd4b19f00348a2a24c3a3212226ffe with size 16: 89.20994047800014
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=7383) INFO 07-27 09:57:06 metrics.py:455] Avg prompt throughput: 1474.1 tokens/s, Avg generation throughput: 207.2 tokens/s, Running: 25 reqs, Swapped: 0 reqs, Pending: 112 reqs, GPU KV cache usage: 85.9%, CPU KV cache usage: 0.0%.
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=7383) [vLLM] Elapsed time for batch 2e1c1e82eff04b4593a6c33bd3c62587 with size 16: 92.64492836099998
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=7383) INFO 07-27 09:57:12 metrics.py:455] Avg prompt throughput: 3249.9 tokens/s, Avg generation throughput: 60.5 tokens/s, Running: 40 reqs, Swapped: 0 reqs, Pending: 101 reqs, GPU KV cache usage: 98.7%, CPU KV cache usage: 0.0%.
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=7383) INFO 07-27 09:57:17 metrics.py:455] Avg prompt throughput: 401.4 tokens/s, Avg generation throughput: 389.7 tokens/s, Running: 37 reqs, Swapped: 0 reqs, Pending: 104 reqs, GPU KV cache usage: 98.2%, CPU KV cache usage: 0.0%.
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=7383) INFO 07-27 09:57:22 metrics.py:455] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 402.4 tokens/s, Running: 34 reqs, Swapped: 0 reqs, Pending: 107 reqs, GPU KV cache usage: 98.5%, CPU KV cache usage: 0.0%.
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=7383) [vLLM] Elapsed time for batch 1a325fc4bc324e36815261c1231ba0c9 with size 16: 110.31428994299995
```
The elapse time is measured against one batch, and keeps increasing. Also the throughput reported from vLLM is unstable, ranging from 60 to 3300.

According to "Ray Workload" dashboard, processing rate is ~5 rows/s, and the template is processing 1.44M rows. That translates to 80 hours.

## 2.47.1

Similar to 2.44.1, the elapse time keeps growing

```
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=198382) [vLLM] Elapsed time for batch ec7f34b375ae414aa11086497f272877 with size 16: 1559.1772196590027
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=198382) [vLLM] Elapsed time for batch ea703dccf3a84965b005ae6d689a3d38 with size 16: 1575.1775471759975
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=198382) [vLLM] Elapsed time for batch 63fb6941595e49d1b5f8feefdedb0c3d with size 16: 1578.9769548449985
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=198382) [vLLM] Elapsed time for batch 75f83aec4e6849c4a12b53e169ffde1c with size 16: 1582.1405324659936
```

## 2.48.0

I kept getting huggingface rate limit error
```
(_MapWorker pid=8700) INFO 07-27 21:31:40 [core.py:69] Initializing a V1 LLM engine (v0.9.2) with config: model='unsloth/Llama-3.1-8B-Instruct', speculative_config=None, tokenizer='unsloth/Llama-3.1-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=16384, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=unsloth/Llama-3.1-8B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}
...
(_MapWorker pid=94274) (VllmWorker rank=0 pid=101836) ERROR 07-27 19:24:13 [multiproc_executor.py:487]     hf_raise_for_status(r)
(_MapWorker pid=94274) (VllmWorker rank=0 pid=101836) ERROR 07-27 19:24:13 [multiproc_executor.py:487]   File "/home/ray/anaconda3/lib/python3.11/site-packages/huggingface_hub/utils/_http.py", line 482, in hf_raise_for_status
(_MapWorker pid=94274) (VllmWorker rank=0 pid=101836) ERROR 07-27 19:24:13 [multiproc_executor.py:487]     raise _format(HfHubHTTPError, str(e), response) from e
(_MapWorker pid=94274) (VllmWorker rank=0 pid=101836) ERROR 07-27 19:24:13 [multiproc_executor.py:487] huggingface_hub.errors.HfHubHTTPError: 429 Client Error: Too Many Requests for url: https://huggingface.co/api/models/unsloth/Llama-3.1-8B-Instruct
```

The exception didn't go away even after `huggingface-cli download unsloth/Llama-3.1-8B-Instruct`. Restarting workspace and bypass this limit.

```
(_MapWorker pid=5230) INFO 07-27 21:42:17 [core.py:58] Initializing a V1 LLM engine (v0.8.5) with config: model='unsloth/Llama-3.1-8B-Instruct', speculative_config=None, tokenizer='unsloth/Llama-3.1-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=16384, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=unsloth/Llama-3.1-8B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
...
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=5230) [vLLM] Elapsed time for batch 89eb4185cc504d6b94875ae272e84f88 with size 16: 238.47156920100008
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=5230) [vLLM] Elapsed time for batch 268e4d28afe74c17853b7d1b743258b7 with size 16: 241.686682776
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=5230) [vLLM] Elapsed time for batch 7a4c3d78cb0f472cb66cae386fc2431a with size 16: 258.86174575000007
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=5230) [vLLM] Elapsed time for batch 3b5f2f097b0b45c6a878a7ad3657e425 with size 16: 262.36196482
...
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=5230) [vLLM] Elapsed time for batch ed6b1491cfb8425da2d6f39d85cd9741 with size 16: 4563.3658349
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=5230) [vLLM] Elapsed time for batch bfcf3fd344d94d02b80fd1cbe88082ca with size 16: 4569.885854316
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=5230) [vLLM] Elapsed time for batch 6147ac80a0614fbbb15f2296d16948e8 with size 16: 4586.141410784
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=5230) [vLLM] Elapsed time for batch 0a509403d81b49da812567772f70af35 with size 16: 4590.619594891
```

The "Ray Workloads" shows ~1.8 rows/s throughput <img width="4296" height="1830" alt="Image" src="https://github.com/user-attachments/assets/f1e762f0-9b27-4a0c-92ea-861627f6dbd8" />

## nightly

```
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=20874) INFO 07-27 22:30:35 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 374.7 tokens/s, Running: 20 reqs, Waiting: 0 reqs, GPU KV cache usage: 50.3%, Prefix cache hit rate: 21.8%
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=20874) [vLLM] Elapsed time for batch 794e7d2d761243c3a9a57515954f95e6 with size 16: 20.362185237000176
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=20874) INFO 07-27 22:30:38 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 188.2 tokens/s, Running: 6 reqs, Waiting: 0 reqs, GPU KV cache usage: 15.9%, Prefix cache hit rate: 21.8%
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=20874) [vLLM] Elapsed time for batch 9bc4df89ffa841c2be462e4f927ea90a with size 10: 13.92767948599976
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=20874) INFO 07-27 22:30:39 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 83.8 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 21.8%
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=20874) [vLLM] Elapsed time for batch e39ca435c1a948f6a4de6ee39117486d with size 16: 27.59387874899994
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=20874) INFO 07-27 22:31:07 [loggers.py:118] Engine 000: Avg prompt throughput: 1788.3 tokens/s, Avg generation throughput: 226.1 tokens/s, Running: 36 reqs, Waiting: 101 reqs, GPU KV cache usage: 92.0%, Prefix cache hit rate: 20.1%
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=20874) [vLLM] Elapsed time for batch 2b985ead54624f659721f874862c4562 with size 16: 29.679831919000208
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=20874) INFO 07-27 22:31:09 [loggers.py:118] Engine 000: Avg prompt throughput: 3463.0 tokens/s, Avg generation throughput: 91.1 tokens/s, Running: 36 reqs, Waiting: 102 reqs, GPU KV cache usage: 93.3%, Prefix cache hit rate: 20.2%
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=20874) [vLLM] Elapsed time for batch 18300f436ec24ac2a4028d6d589ef04c with size 16: 43.51748637699984
```
elapse time doesn't grow as aggressively as 2.47.1

## Overall

During the experiment, I occasionally hit exception that "Engine core failed to start", and noticed "pthread_kill" in the log. But if I retry the cell, the exception went away.

## Actions to take

1. update to launch with 2.48.0 instead of 2.44.1, which uses vLLM V1 with continuous batching enabled by default.
2. limit the input data size, say 10k instead of 1.44M, to finish within 1 hour
3. fix the argument to better utilize GPU, with higher throughput


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data.llm] Data LLM template under-utilizing GPU #442

2.44.1

2.47.1

2.48.0

nightly

Overall

Actions to take

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[data.llm] Data LLM template under-utilizing GPU #442

Description

2.44.1

2.47.1

2.48.0

nightly

Overall

Actions to take

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions