GLM-5.2-W4AFP8-SGL-TP8: cap waiting queue at 16 by Evrard-Nil · Pull Request #120 · nearai/cvm-compose-files

Evrard-Nil · 2026-06-25T08:39:36Z

Sets a bounded admission queue on the GLM-5.2 W4AFP8 TP8 config.

Change

Add --max-queued-requests 16 to the shared sglang command block. Once 16 requests are waiting behind the running batch, sglang sheds new ones (429) rather than letting the queue grow unbounded.

Context length

No change needed: the config already serves the full 1M context (--context-length 1048576). The header documents that W4AFP8's halved weight footprint is what frees the VRAM to run 1M at TP8.

--max-running-requests 256 is unchanged.

🤖 Generated with Claude Code

Add --max-queued-requests 16 so the sglang scheduler rejects (429) once 16 requests are waiting behind the running batch instead of letting the queue grow unbounded. Context length already serves the full 1M tokens (--context-length 1048576), so no change there.

github-actions · 2026-06-25T08:40:06Z

✅ OpenCodeReview: No comments generated. Looks good to me.

PierreLeGuen

Single-line config change adding --max-queued-requests 16 to the shared x-glm52-cmd sglang args block in prod/GLM-5.2-W4AFP8-SGL-TP8.yaml. Caps the admission queue so excess requests get rejected rather than silently buffered behind the 256 running slots; no other files touched and behaves as described.

prod/GLM-5.2-W4AFP8-SGL-TP8.yaml — flag folds cleanly into the folded-scalar command, inserted between --max-running-requests 256 and --cuda-graph-max-bs 128, no duplicate. Consistent with the same flag/value already used in prod/GLM-5.1-SGL-AWQ-TP4.yaml.

Local checks: docker compose config renders the command with --max-queued-requests 16; YAML parses cleanly; validate_otel_labels.rb, validate_proxy_dependencies.rb, and validate_proxy_environment.rb all passed.

Evrard-Nil requested review from PierreLeGuen and lloydmak99 June 25, 2026 08:39

PierreLeGuen approved these changes Jun 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GLM-5.2-W4AFP8-SGL-TP8: cap waiting queue at 16#120

GLM-5.2-W4AFP8-SGL-TP8: cap waiting queue at 16#120
Evrard-Nil wants to merge 1 commit into
mainfrom
feat/glm52-w4afp8-queue16

Evrard-Nil commented Jun 25, 2026

Uh oh!

github-actions Bot commented Jun 25, 2026

Uh oh!

PierreLeGuen left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Evrard-Nil commented Jun 25, 2026

Change

Context length

Uh oh!

github-actions Bot commented Jun 25, 2026

Uh oh!

PierreLeGuen left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants