2024/02/02/cascade-inference

# Cascade Inference: Memory Bandwidth Efficient Shared Prefix Batch Decoding | FlashInfer

Many LLM inference tasks involves multiple independent text generation from a shared prefix (prompt), e.g. Self-Consistency, Tree of Thoughts and Skeleton-of-thought. Serving LLMs with common prefix could be memory and time-consuming, especially when common prefix is long and the number of requests is large: a possible use case is long document QA (Figure 1), multiple users interacts with ChatBot with the same document as prompt. While vLLM alleviate the memory issue by only storing one copy of the common prefix. However, it still suffers from the low-efficiency because the default PageAttention implementation do not optimize KV-Cache access to the shared prompt.

[https://flashinfer.ai/2024/02/02/cascade-inference.html](https://flashinfer.ai/2024/02/02/cascade-inference.html)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2024/02/02/cascade-inference #9

Cascade Inference: Memory Bandwidth Efficient Shared Prefix Batch Decoding | FlashInfer

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

2024/02/02/cascade-inference #9

Description

Cascade Inference: Memory Bandwidth Efficient Shared Prefix Batch Decoding | FlashInfer

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions