Skip to content

woodx9/minimal-embedding-server

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

12 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Minimal Embedding Server (MES) - ้ซ˜ๆ€ง่ƒฝๅคš่ฟ›็จ‹ๆŽจ็†ๆก†ๆžถ

English | ไธญๆ–‡

ไธ€ไธชๅŸบไบŽๅคš่ฟ›็จ‹ๆžถๆž„็š„้ซ˜ๆ€ง่ƒฝ Embedding ๆœๅŠกๅ™จ๏ผŒไธ“้—จไธบ่งฃๅ†ณ CPU tokenizer ็“ถ้ขˆๅ’Œๆœ€ๅคงๅŒ– GPU ๅˆฉ็”จ็އ่€Œ่ฎพ่ฎกใ€‚

ๆ ธๅฟƒ็‰นๆ€ง๏ผš

  • ๆ”ฏๆŒ Flash Attention ๅ’Œ FlashInfer ๅŠ ้€Ÿๆณจๆ„ๅŠ›่ฎก็ฎ—
  • ๆ”ฏๆŒ tensor parallel
  • ๅคš่ฟ›็จ‹ๆžถๆž„ๅฎŒๅ…จ็ช็ ด Python GIL ้™ๅˆถ
  • ไธ“ไธบ Embedding ๅœบๆ™ฏไผ˜ๅŒ–็š„่ฝป้‡็บงๆŽจ็†ๅผ•ๆ“Ž
  • ๆ™บ่ƒฝๅŠจๆ€ batch ่šๅˆ๏ผŒๆœ€ๅคงๅŒ– GPU ๅžๅ
  • ่‡ชๅŠจๆจกๅž‹ๆžถๆž„่ฏ†ๅˆซ๏ผŒๆ”ฏๆŒ Qwen2 ๅ’Œ Qwen3 ็ณปๅˆ—ๆจกๅž‹

ๆ”ฏๆŒ็š„ๆจกๅž‹ๆžถๆž„๏ผš

  • Qwen2ForCausalLM๏ผšQwen2 ็ณปๅˆ—ๆจกๅž‹๏ผˆๅฆ‚ Qwen/Qwen2-0.5Bใ€Qwen/Qwen2-1.5B ็ญ‰๏ผ‰
  • Qwen3ForCausalLM๏ผšQwen3 ็ณปๅˆ—ๆจกๅž‹๏ผˆๅฆ‚ Qwen/Qwen3-Embedding-0.6B ็ญ‰๏ผ‰

็ณป็ปŸไผšๆ นๆฎๆจกๅž‹้…็ฝฎ็š„ architectures ๅญ—ๆฎต่‡ชๅŠจ้€‰ๆ‹ฉๅฏนๅบ”็š„ๆจกๅž‹ๅฎž็Žฐ

ๅฟซ้€Ÿๅผ€ๅง‹

ๅฎ‰่ฃ…

# ๅ…‹้š†ไป“ๅบ“
git clone https://github.com/woodx9/minimal-embedding-server.git
cd minimal-embedding-server

pip install -e .

ๆณจๆ„๏ผšๅฎ‰่ฃ…่ฟ‡็จ‹ไผš่‡ชๅŠจไธ‹่ฝฝๅนถๅฎ‰่ฃ…๏ผš

  • PyTorch 2.9.1 (CUDA 12.8)
  • SGL-Kernel 0.3.21
  • FlashInfer 0.6.2
  • ๅ…ถไป–ไพ่ต–ๅŒ…

ไฝฟ็”จๆ–นๅผ

ๆ–นๅผ 1: ๅ‘ฝไปค่กŒๅฏๅŠจ๏ผˆๆŽจ่๏ผ‰

# ๅฏๅŠจ Qwen3 Embedding ๆจกๅž‹
mes --model "Qwen/Qwen3-Embedding-0.6B"

# ๅฏๅŠจ Qwen2 ๆจกๅž‹
mes --model "Qwen/Qwen2-0.5B"

# ๆŒ‡ๅฎš็ซฏๅฃๅ’Œๆณจๆ„ๅŠ›ๅŽ็ซฏ
mes --model "Qwen/Qwen3-Embedding-0.6B" --port 8000 --attn-backend flash_attn

# ไฝฟ็”จไธๅŒๆ•ฐๆฎ็ฑปๅž‹
mes --model "Qwen/Qwen2-1.5B" --dtype bfloat16

# ๅคšGPUๅนถ่กŒๆŽจ็†
mes --model "Qwen/Qwen3-Embedding-0.6B" --tensor_parallel_size 2 --dtype auto

# ๆŸฅ็œ‹ๆ›ดๅคš้€‰้กน
mes --help

ๅ‘ฝไปค่กŒๅ‚ๆ•ฐ๏ผš

ๅ‚ๆ•ฐ ้ป˜่ฎคๅ€ผ ่ฏดๆ˜Ž
--model ๅฟ…้œ€ ๆจกๅž‹ๅ็งฐๆˆ–่ทฏๅพ„๏ผˆๅฟ…้œ€ๅ‚ๆ•ฐ๏ผ‰
--host 0.0.0.0 ๆœๅŠกๅ™จ็›‘ๅฌๅœฐๅ€
--port 8000 ๆœๅŠกๅ™จ็›‘ๅฌ็ซฏๅฃ
--attn-backend flash_attn ๆณจๆ„ๅŠ›ๅŽ็ซฏ๏ผˆflash_attn/flash_infer๏ผ‰
--tensor_parallel_size 1 ๅผ ้‡ๅนถ่กŒ size
--dtype auto ๆจกๅž‹ๆ•ฐๆฎ็ฑปๅž‹๏ผˆauto/float32/float16/bfloat16๏ผ‰

ๆต‹่ฏ• API

# ๅฅๅบทๆฃ€ๆŸฅ
curl http://localhost:8000/health

# ่Žทๅ– embeddings
curl -X POST http://localhost:8000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-Embedding-0.6B",
    "input": ["ไฝ ๅฅฝ๏ผŒไธ–็•Œ๏ผ", "Hello, world!"]
  }'

ๆ€ง่ƒฝ่กจ็Žฐ

ๅŽ‹ๆต‹ๅฏนๆฏ”๏ผˆ10 ๅนถๅ‘ๅฎขๆˆท็ซฏ๏ผŒๆฏๆ‰น 20 ไธชๆ–‡ๆœฌ๏ผŒๆฏๆ–‡ๆœฌ 1000 tokens๏ผ‰

ๆก†ๆžถ QPS ๆ€ง่ƒฝๆๅ‡
vLLM 1.04 ๅŸบๅ‡†
ๆœฌๆก†ๆžถ 1.10 ๅฟซ 5.8%

ๆต‹่ฏ•ๅ‘ฝไปค๏ผš

python3  benchmark/stress_test.py \
    --concurrent-clients 10 \
    --batch-size 20 \
    --token-length 1000 \
    --base-url http://localhost:8000 \
    --model Qwen/Qwen3-Embedding-0.6B

ๆต‹่ฏ•่„šๆœฌๅ’Œ้ƒจ็ฝฒ่„šๆœฌไฝไบŽ benchmark/ ็›ฎๅฝ•ไธ‹๏ผš

  • stress_test.py - ๆ€ง่ƒฝๅŽ‹ๆต‹่„šๆœฌ
  • vllm.sh - vLLM ้ƒจ็ฝฒ่„šๆœฌ
  • compare_transformers.py - ๅฏนๆฏ” transformer ้€Ÿๅบฆ่„šๆœฌ

ไธบไป€ไนˆๆ›ดๅฟซ๏ผŸ

ๆœฌๆก†ๆžถไธ“ไธบ Embedding ๅœบๆ™ฏ่ฎพ่ฎก๏ผŒๆ›ดๅŠ ็ฒพ็ฎ€้ซ˜ๆ•ˆ๏ผš

  • ๅŽป้™คไบ† vLLM ไธญๅคๆ‚็š„้€š็”จ LLM ๆŽจ็†้€ป่พ‘๏ผˆ้‡‡ๆ ทใ€่งฃ็ ใ€KV Cache ็ญ‰๏ผ‰
  • ้’ˆๅฏน Embedding ไปปๅŠกไผ˜ๅŒ–็š„่ฝป้‡็บงๆžถๆž„
  • ๅคš่ฟ›็จ‹้š”็ฆป๏ผŒCPU tokenizer ๅ’Œ GPU ๆŽจ็†ๅฎŒๅ…จๅนถ่กŒ
  • ๆ™บ่ƒฝๅŠจๆ€ batch ่šๅˆ๏ผŒๆœ€ๅคงๅŒ– GPU ๅžๅ
  • ๅ‘้‡ๅŒ–ๅŽๅค„็†๏ผŒๅ•ๆฌก GPU ๅŒๆญฅไปฃๆ›ฟๅคšๆฌกๅŒๆญฅ

ๆ ธๅฟƒ่ฎพ่ฎก็›ฎๆ ‡

ๅœจไผ ็ปŸ็š„ๅ•่ฟ›็จ‹ๆŽจ็†ๆœๅŠกไธญ๏ผŒ็ปๅธธ้‡ๅˆฐไปฅไธ‹้—ฎ้ข˜๏ผš

  • CPU ๅˆฉ็”จ็އๆšดๆถจ่‡ณ 400%๏ผˆๅคš็บฟ็จ‹ tokenizer ๅ— GIL ้™ๅˆถ๏ผ‰
  • GPU ๅˆฉ็”จ็އไธ‹้™๏ผˆtokenizer ้˜ปๅกžๅฏผ่‡ด GPU ้ฅฅ้ฅฟ๏ผ‰
  • ๆŽจ็†ๅปถ่ฟŸๅขžๅŠ ๏ผˆCPU ๅ’Œ GPU ๆ— ๆณ•ๅนถ่กŒๅทฅไฝœ๏ผ‰

ๆœฌๆก†ๆžถ้€š่ฟ‡ๅคš่ฟ›็จ‹ๆžถๆž„ๅฝปๅบ•่งฃๅ†ณ่ฟ™ไบ›้—ฎ้ข˜๏ผŒๅฎž็Žฐ CPU ๅ’Œ GPU ็š„ๅฎŒๅ…จๅนถ่กŒใ€‚


ๆžถๆž„่ฎพ่ฎก

ๆ•ดไฝ“ๆžถๆž„ๅ›พ

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                         FastAPI Server                          โ”‚
โ”‚                        (uvicorn + asyncio)                      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                             โ”‚
                             โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                      Engine (ไธปๅ่ฐƒๅ™จ)                           โ”‚
โ”‚  - ๅˆ›ๅปบ MPQueue ่ฟ›่กŒ่ฟ›็จ‹้—ด้€šไฟก                                     โ”‚
โ”‚  - ๅฏๅŠจ Tokenizer Manager ่ฟ›็จ‹                                   โ”‚
โ”‚  - ๅฏๅŠจ GPU Worker ่ฟ›็จ‹                                          โ”‚
โ”‚  - ็ป“ๆžœๅˆ†ๅ‘็บฟ็จ‹๏ผˆResult Dispatcher๏ผ‰                              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
           โ”‚                                      โ”‚
           โ–ผ                                      โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Tokenizer Manager ่ฟ›็จ‹   โ”‚      โ”‚      GPU Worker ่ฟ›็จ‹          โ”‚
โ”‚  (CPU ๅฏ†้›†ๅž‹)             โ”‚      โ”‚      (GPU ๅฏ†้›†ๅž‹)             โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค      โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ โ€ข 10 ไธช Tokenizer ็บฟ็จ‹    โ”‚      โ”‚ โ€ข ๆจกๅž‹ๅŠ ่ฝฝๅˆฐ GPU               โ”‚
โ”‚ โ€ข 1 ไธช Batch Prepare ็บฟ็จ‹ โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚ โ€ข 1 ไธช Inference ็บฟ็จ‹        โ”‚
โ”‚ โ€ข CPU ไธŠๅฎŒๆˆๆ‰€ๆœ‰ tokenize  โ”‚      โ”‚ โ€ข 4 ไธช Callback ็บฟ็จ‹          โ”‚
โ”‚ โ€ข ๅŠจๆ€ batch ่šๅˆ          โ”‚      โ”‚ โ€ข ๅ‘้‡ๅŒ–ๅŽๅค„็†                โ”‚
โ”‚ โ€ข numpy ๅบๅˆ—ๅŒ–ไผ ่พ“         โ”‚      โ”‚ โ€ข ๆ‰น้‡ๅฝ’ไธ€ๅŒ–                  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

ไธ‰ๅคงๆ ธๅฟƒไผ˜ๅŒ–

1. ๅคš่ฟ›็จ‹้š”็ฆป๏ผšๅฝปๅบ•็ช็ ด GIL

้—ฎ้ข˜๏ผš Python GIL ๅฏผ่‡ดๅคš็บฟ็จ‹ tokenizer ๆ— ๆณ•็œŸๆญฃๅนถ่กŒ๏ผŒCPU ้ฃ™ๅ‡ไฝ†ๆ•ˆ็އไฝŽไธ‹ใ€‚

่งฃๅ†ณๆ–นๆกˆ๏ผš

# Tokenizer Manager - ็‹ฌ็ซ‹่ฟ›็จ‹
_prepare_process = Process(target=tokenizer_manager_main)

# GPU Worker - ็‹ฌ็ซ‹่ฟ›็จ‹  
_inference_process = Process(target=gpu_worker_main)

ๆ•ˆๆžœ๏ผš

  • Tokenizer ๅ’Œ GPU ๆŽจ็†ๅœจไธๅŒ่ฟ›็จ‹ไธญ่ฟ่กŒ
  • ๅฎŒๅ…จ้ฟๅผ€ GIL ้™ๅˆถ
  • CPU ๅ’Œ GPU ็œŸๆญฃๅนถ่กŒๅทฅไฝœ

2. ๆ™บ่ƒฝ Batch ่šๅˆ๏ผšๆœ€ๅคงๅŒ– GPU ๅžๅ

ๆ ธๅฟƒ็ญ–็•ฅ๏ผš

# ๅŠจๆ€็ญ‰ๅพ…็ญ–็•ฅ
max_wait_rounds = 1 if ready_queue.qsize() < 3 else 10

while total_tokens < max_tokens_per_batch:
    # 1. ๅฟซ้€Ÿๆ”ถ้›†้˜Ÿๅˆ—ไธญๆ‰€ๆœ‰็ญ‰ๅพ…่ฏทๆฑ‚
    while not tokenized_queue.empty():
        batch.append(tokenized_queue.get_nowait())
    
    # 2. ๆ นๆฎ GPU ่ดŸ่ฝฝๅŠจๆ€่ฐƒๆ•ด็ญ‰ๅพ…ๆ—ถ้—ด
    # GPU ็ฉบ้—ฒๆ—ถๅฟซ้€Ÿๅ‘้€๏ผŒGPU ๅฟ™ๆ—ถๆฟ€่ฟ›่šๅˆ

ไผ˜ๅŠฟ๏ผš

  • GPU ็ฉบ้—ฒๆ—ถ๏ผš็ซ‹ๅณๅ‘้€ๅฐ batch๏ผŒ้™ไฝŽๅปถ่ฟŸ
  • GPU ็นๅฟ™ๆ—ถ๏ผš็ญ‰ๅพ…ๆ›ดๅคš่ฏทๆฑ‚๏ผŒ่šๅˆๆˆๅคง batch
  • Token ไธŠ้™๏ผšmax_tokens_per_batch = 120,000๏ผŒๅ……ๅˆ†ๅˆฉ็”จ GPU ๆ˜พๅญ˜

3. ๅ‘้‡ๅŒ–ๅŽๅค„็†๏ผšๆถˆ้™ค GPU ๅŒๆญฅๅผ€้”€

ไผ ็ปŸๆ–นๆณ•็š„้—ฎ้ข˜๏ผš

#  ๆ—งไปฃ็ ๏ผšๅคšๆฌก GPU ๅŒๆญฅ๏ผŒๆ€ง่ƒฝๅทฎ
for seq_len in seq_lengths:
    embedding = outputs[start:start+seq_len][-1]  # GPU ๆ“ไฝœ
    embedding = F.normalize(embedding)            # GPU ๆ“ไฝœ
    embeddings.append(embedding.cpu())            # GPUโ†’CPU ๅŒๆญฅ
    start += seq_len

ไผ˜ๅŒ–ๅŽ็š„ๅ‘้‡ๅŒ–ๅค„็†๏ผš

# ๆ–ฐไปฃ็ ๏ผšๅ•ๆฌก GPU ๅŒๆญฅ๏ผŒๆ€ง่ƒฝๆๅ‡ 10 ๅ€
# 1. ้ข„่ฎก็ฎ—ๆ‰€ๆœ‰ last token ็ดขๅผ•๏ผˆCPU ไธŠๅฎŒๆˆ๏ผ‰
last_token_indices = [idx + seq_len - 1 for idx, seq_len in ...]

# 2. ไธ€ๆฌกๆ€งๆๅ–ๆ‰€ๆœ‰ embeddings๏ผˆGPU ๅ‘้‡ๅŒ–ๆ“ไฝœ๏ผ‰
last_token_indices_tensor = torch.tensor(last_token_indices, device='cuda')
all_embeddings = outputs[last_token_indices_tensor]  # [N, hidden_dim]

# 3. ๆ‰น้‡ๅฝ’ไธ€ๅŒ–๏ผˆGPU ๅ‘้‡ๅŒ–ๆ“ไฝœ๏ผ‰
all_embeddings = F.normalize(all_embeddings, p=2, dim=1)

# 4. ๅ•ๆฌก่ฝฌ CPU๏ผˆๅชๆœ‰ไธ€ๆฌก GPU ๅŒๆญฅ๏ผ๏ผ‰
all_embeddings_cpu = all_embeddings.cpu()

ๆ•ฐๆฎๆต่ฏฆ่งฃ

่ฏทๆฑ‚ๅค„็†ๅ…จๆต็จ‹

1. ็”จๆˆท่ฏทๆฑ‚
   POST /v1/embeddings {"input": ["text1", "text2"]}
                โ”‚
                โ–ผ
2. Engine.v1_embeddings (ไธป่ฟ›็จ‹)
   - ็”Ÿๆˆ UUID ไฝœไธบ future_id
   - ๅญ˜ๅ‚จๅˆฐ _future_map: {uuid: (future, num_texts)}
   - ๅ‘้€ๅˆฐ raw_request_queue: (texts, future_id)
                โ”‚
                โ–ผ
3. Tokenizer Manager ่ฟ›็จ‹
   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ”‚ Tokenizer ็บฟ็จ‹ๆฑ  (10 ็บฟ็จ‹)           โ”‚
   โ”‚  - ๅนถ่กŒ tokenize ๅคšไธช่ฏทๆฑ‚            โ”‚
   โ”‚  - CPU ๅฏ†้›†ๅž‹ๆ“ไฝœ๏ผŒๅฎŒๅ…จๅนถ่กŒ          โ”‚
   โ”‚  โ†’ tokenized_queue                  โ”‚
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                  โ”‚
   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ”‚ Batch Prepare ็บฟ็จ‹ (1 ็บฟ็จ‹)          โ”‚
   โ”‚  - ๆฟ€่ฟ›่šๅˆ๏ผšๆ”ถ้›†ๅคšไธช่ฏทๆฑ‚             โ”‚
   โ”‚  - ๅŠจๆ€็ญ‰ๅพ…็ญ–็•ฅ                      โ”‚
   โ”‚  - ้ข„่ฎก็ฎ— last_token_indices         โ”‚
   โ”‚  - ่ฝฌ numpy ๅ‡†ๅค‡่ทจ่ฟ›็จ‹ไผ ่พ“           โ”‚
   โ”‚  โ†’ ready_inference_queue            โ”‚
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                  โ”‚
4. GPU Worker ่ฟ›็จ‹
   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ”‚ Inference ็บฟ็จ‹ (1 ็บฟ็จ‹)              โ”‚
   โ”‚  - numpy โ†’ tensor โ†’ GPU             โ”‚
   โ”‚  - ๆจกๅž‹ๆŽจ็†                          โ”‚
   โ”‚  - ๅ‘้‡ๅŒ–ๅŽๅค„็† (ๅ•ๆฌก GPU ๅŒๆญฅ)       โ”‚
   โ”‚  โ†’ callback_queue                   โ”‚
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                  โ”‚
   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ”‚ Callback ็บฟ็จ‹ๆฑ  (4 ็บฟ็จ‹)             โ”‚
   โ”‚  - ๅผ‚ๆญฅๅ‘้€็ป“ๆžœ                      โ”‚
   โ”‚  โ†’ result_queue (่ทจ่ฟ›็จ‹)            โ”‚
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                  โ”‚
5. Engine ็ป“ๆžœๅˆ†ๅ‘็บฟ็จ‹ (ไธป่ฟ›็จ‹)
   - ไปŽ result_queue ๆŽฅๆ”ถ็ป“ๆžœ
   - ๆ นๆฎ num_texts ๆญฃ็กฎๅˆ†ๅ‰ฒ embeddings
   - ้€š่ฟ‡ future.set_result() ่ฟ”ๅ›ž็ป™ๅฏนๅบ”่ฏทๆฑ‚
                โ”‚
                โ–ผ
6. ่ฟ”ๅ›ž็ป™็”จๆˆท
   {"data": [{"embedding": [...], "index": 0}, ...]}

ๅ…ณ้”ฎๆŠ€ๆœฏ็ป†่Š‚

1. ๆ— ้”ๅนถๅ‘่ฎพ่ฎก

# UUID ไฟ่ฏๅ”ฏไธ€ๆ€ง๏ผŒๆ— ้œ€้”ไฟๆŠค
future_id = str(uuid.uuid4())

# GIL ไฟ่ฏๅ•ไธช่ต‹ๅ€ผ็š„ๅŽŸๅญๆ€ง
self._future_map[future_id] = (future, len(input))

# ๅชๅœจ็ป„ๅˆๆ“ไฝœ๏ผˆcheck + read + delete๏ผ‰ๆ—ถๅŠ ้”
with self._future_lock:
    if future_id in self._future_map:
        future, num_texts = self._future_map[future_id]
        del self._future_map[future_id]

2. ่ทจ่ฟ›็จ‹้€šไฟกไผ˜ๅŒ–

# ไฝฟ็”จ multiprocessing.Queue ่ฟ›่กŒ่ฟ›็จ‹้—ด้€šไฟก
raw_request_queue = MPQueue(maxsize=1000)
ready_inference_queue = MPQueue(maxsize=100)
result_queue = MPQueue(maxsize=1000)

# tensor ่ฝฌ numpy ๆ–นไพฟๅบๅˆ—ๅŒ–ไผ ่พ“
merged_input_ids.numpy()  # ๅœจ Tokenizer Manager
torch.from_numpy(input_ids_np).to('cuda')  # ๅœจ GPU Worker

3. ๅŠจๆ€ Batch ่šๅˆ็ฎ—ๆณ•

# ๆ นๆฎ GPU ้˜Ÿๅˆ—ๆทฑๅบฆๅŠจๆ€่ฐƒๆ•ด็ญ‰ๅพ…็ญ–็•ฅ
if ready_queue.qsize() < 3:
    max_wait_rounds = 1  # GPU ็ฉบ้—ฒ๏ผŒๅฟซ้€Ÿๅ‘้€
else:
    max_wait_rounds = 10  # GPU ็นๅฟ™๏ผŒๆฟ€่ฟ›่šๅˆ

# ๆŒ็ปญๆ”ถ้›†็›ดๅˆฐ token ไธŠ้™ๆˆ–่ถ…ๆ—ถ
while total_tokens < 120000 and wait_rounds < max_wait_rounds:
    # ๅฟซ้€Ÿๆ”ถ้›† + ่ถ…ๆ—ถ็ญ‰ๅพ…

่ฎธๅฏ่ฏ

MIT License


่‡ด่ฐข

ๆœฌ้กน็›ฎไธ“ไธบ่งฃๅ†ณๅฎž้™…็”Ÿไบง็Žฏๅขƒไธญ็š„ GPU ๅˆฉ็”จ็އ้—ฎ้ข˜่€Œ่ฎพ่ฎก๏ผŒ้‡‡็”จไบ†ๅคš้กนไธš็•Œๆœ€ไฝณๅฎž่ทตใ€‚

About

a embedding infer server faster than vllm and sglang

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors