Cascade Inference: Memory Bandwidth Efficient Shared Prefix Batch Decoding | FlashInfer
Many LLM inference tasks involves multiple independent text generation from a shared prefix (prompt), e.g. Self-Consistency, Tree of Thoughts and Skeleton-of-thought. Serving LLMs with common prefix could be memory and time-consuming, especially when common prefix is long and the number of requests is large: a possible use case is long document QA (Figure 1), multiple users interacts with ChatBot with the same document as prompt. While vLLM alleviate the memory issue by only storing one copy of the common prefix. However, it still suffers from the low-efficiency because the default PageAttention implementation do not optimize KV-Cache access to the shared prompt.
https://flashinfer.ai/2024/02/02/cascade-inference.html
Cascade Inference: Memory Bandwidth Efficient Shared Prefix Batch Decoding | FlashInfer
Many LLM inference tasks involves multiple independent text generation from a shared prefix (prompt), e.g. Self-Consistency, Tree of Thoughts and Skeleton-of-thought. Serving LLMs with common prefix could be memory and time-consuming, especially when common prefix is long and the number of requests is large: a possible use case is long document QA (Figure 1), multiple users interacts with ChatBot with the same document as prompt. While vLLM alleviate the memory issue by only storing one copy of the common prefix. However, it still suffers from the low-efficiency because the default PageAttention implementation do not optimize KV-Cache access to the shared prompt.
https://flashinfer.ai/2024/02/02/cascade-inference.html