Skip to content

GLM-5.2-W4AFP8-SGL-TP8: cap waiting queue at 16#120

Open
Evrard-Nil wants to merge 1 commit into
mainfrom
feat/glm52-w4afp8-queue16
Open

GLM-5.2-W4AFP8-SGL-TP8: cap waiting queue at 16#120
Evrard-Nil wants to merge 1 commit into
mainfrom
feat/glm52-w4afp8-queue16

Conversation

@Evrard-Nil

Copy link
Copy Markdown
Contributor

Sets a bounded admission queue on the GLM-5.2 W4AFP8 TP8 config.

Change

  • Add --max-queued-requests 16 to the shared sglang command block. Once 16 requests are waiting behind the running batch, sglang sheds new ones (429) rather than letting the queue grow unbounded.

Context length

  • No change needed: the config already serves the full 1M context (--context-length 1048576). The header documents that W4AFP8's halved weight footprint is what frees the VRAM to run 1M at TP8.

--max-running-requests 256 is unchanged.

🤖 Generated with Claude Code

Add --max-queued-requests 16 so the sglang scheduler rejects (429) once
16 requests are waiting behind the running batch instead of letting the
queue grow unbounded. Context length already serves the full 1M tokens
(--context-length 1048576), so no change there.
@github-actions

Copy link
Copy Markdown

OpenCodeReview: No comments generated. Looks good to me.

@PierreLeGuen PierreLeGuen left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Single-line config change adding --max-queued-requests 16 to the shared x-glm52-cmd sglang args block in prod/GLM-5.2-W4AFP8-SGL-TP8.yaml. Caps the admission queue so excess requests get rejected rather than silently buffered behind the 256 running slots; no other files touched and behaves as described.

  • prod/GLM-5.2-W4AFP8-SGL-TP8.yaml — flag folds cleanly into the folded-scalar command, inserted between --max-running-requests 256 and --cuda-graph-max-bs 128, no duplicate. Consistent with the same flag/value already used in prod/GLM-5.1-SGL-AWQ-TP4.yaml.

Local checks: docker compose config renders the command with --max-queued-requests 16; YAML parses cleanly; validate_otel_labels.rb, validate_proxy_dependencies.rb, and validate_proxy_environment.rb all passed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants