Skip to content

Auto integration#286

Closed
easel wants to merge 386 commits into
Luce-Org:mainfrom
easel:auto-integration
Closed

Auto integration#286
easel wants to merge 386 commits into
Luce-Org:mainfrom
easel:auto-integration

Conversation

@easel

@easel easel commented May 27, 2026

Copy link
Copy Markdown
Collaborator

No description provided.

Hermes PR Integrator and others added 30 commits May 27, 2026 09:34
Update the integration manifest after merging the latest PR Luce-Org#274 head (adaptive anchor radius and PFLASH_COMPRESS env rename). Record a fresh PR Luce-Org#266 worktree conflict attempt and current blocked classifications.
Merge latest origin/main (d947c70) into the integration stack and record the current PR classification. PR Luce-Org#266 was attempted again in an isolated worktree and remains blocked pending selective harness/server porting; Codex/Claude delegated resolution is unavailable due auth.
…nch in-tree

Collapses 134 commits of `integration/props-uv-squared-clean` onto current
main as one reviewable change. Most of the underlying server-side work
already landed via separate PRs: thinking-budget v2 + multi-dialect
reasoning + sidecars (Luce-Org#269), dflash→server rename + optimizations/ grouping
(Luce-Org#281), qwen35moe hybrid CPU/CUDA expert split (Luce-Org#262), and a stream of
smaller fixes from bragi over the last week. What remained in integration
is everything *above* the server: the host-side runner, the container
image, the benchmark/profile evidence pipeline, the harness for driving
real clients, and the luce-bench framework itself.

## What changed

### Docker + host wrapper
- Dockerfile (CUDA 12.8 base; copies server/, lucebox/, harness/, luce-bench/
  into one image; wires `python -m lucebench.cli` as the `benchmark`
  entrypoint subcommand).
- `lucebox.sh` (~470 lines of host bash, zero deps beyond docker + nvidia-smi):
  `check`, `configure`, `pull`, `download-models`, `serve`, `install`/`start`/
  `status`/`logs` (user-systemd), `print-run`, `benchmark`, `profile`.
- `.github/workflows/docker.yml` builds + pushes `ghcr.io/luce-org/lucebox-hub`
  tags (`:cuda12`, `:vX.Y.Z-cuda12`, `:X.Y-cuda12`, `:sha-<short>-cuda12`).
- `server/scripts/entrypoint.sh` resolves draft GGUF by target architecture
  (gemma4 → gemma drafter, qwen3.6 → dflash-draft-3.6); warns when multiple
  targets are present in models/.

### lucebox Python package (in-container CLI)
- `lucebox/` workspace member: `cli.py`, `autotune.py` (VRAM-tiered tier
  selection + per-host config writeback), `config.py` (typed TOML),
  `download.py`, `docker_run.py`, `host_check.py`, `host_facts.py`,
  `profile.py` (profile sweep across DFLASH_MAX_CTX × DFLASH_BUDGET, KV
  cache types, pFlash modes, lazy-draft, prefix-cache slots), `smoke.py`,
  `types.py`.
- `lucebox/tests/` for the typed surfaces.
- Level1/Level2/Level3 profile gates; sweep results merged back into
  `~/.lucebox/config.toml` only after capability + ds4-eval/agentic-tools/
  agentic-session validation gates pass.

### luce-bench in monorepo
- `luce-bench/` workspace member at v0.2.4 — the standalone bench framework
  (areas: ds4-eval, code, longctx, agent, forge; sweep + per-host snapshot
  output; v0.2.4 includes the forge area's EvalConfig + run_scenario
  signature realignment).
- `[tool.uv.sources] luce-bench = { workspace = true }` replaces the prior
  git tag pin.
- `.github/workflows/release-luce-bench.yml` publishes to PyPI from the
  monorepo on `luce-bench-v*` tags (trusted publisher, `pypi` environment).

### harness workspace
- `harness/` workspace member: client adapters (`claude_code`, `codex`,
  `opencode`, `hermes`, `pi`, `openclaw`), `client_test_runner.py`,
  `benchmarks/run_lucebox_vs_llamacpp.sh`, prompts. `lucebox profile`
  delegates the actual bench runs to harness.

### Bench + profile evidence
- `server/docs/BENCHMARK_SNAPSHOT_SPEC.md` — schema for tuning/profile
  artifacts. Snapshots themselves live in the standalone
  `luce-bench-baselines` repo (out of this tree).

### Misc
- Updated CI workflow path filters for `server/` (post-rename).
- README's "Quick start" section, hardware coverage table, env var
  reference table; minor edits to optimizations READMEs.
- model card sidecar updates landed alongside Luce-Org#269 but kept here at
  current values (qwen3.6, gemma-4-26b-a4b, gemma-4-31b, laguna-xs.2,
  `_schema.json`).

## Out of scope / follow-ups
- 31b backend wiring beyond what `share/model_cards/gemma-4-31b-it.json`
  shipped (working empirically @ 24GB on sindri AR-only; 26b spec-decode
  path already proven).
- gemma4 MoE expert split (howard0su's PR Luce-Org#262 territory; merged but not
  applied to gemma4 yet).
- Multi-Token Prediction (upstream PR #23398, draft).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Integrates PR Luce-Org#266 into the auto-integration stack over easel/auto-integration. Resolves server/ layout conflicts by keeping the current server tree, retaining existing harness adapters, and packaging the PR's metrics parser/session proxy/tests under harness/src/harness for uv workspace imports.\n\nVerification:\n- python3 -m py_compile harness/client_test_runner.py harness/clients/session_inject_proxy.py harness/src/harness/metrics_parser.py harness/src/harness/tests/*.py\n- uv run --extra dev --package harness pytest harness/src/harness/tests -q (56 passed)\n- git diff --check\n- conflict marker scan (no conflict markers)\n- C++ target build skipped: server/build missing in worktree
…d dep (v0.2.6)

UX fixes to `lucebench`:

* Preflight (liveness → /v1/models → /props) runs before any cases.
  Aborts with a clean one-line error if the server is unreachable —
  no more 92 timeouts on a typo'd URL.
* New `smoke` area: three tiny prompts that complete in seconds.
* `--areas` replaces `--area` / `--sweep`. Default: `smoke`.
  `--areas all` runs every stdlib area (old `--sweep`).
  `--areas ds4-eval,forge` runs a subset.
  Old flags still accepted with deprecation notice.
* `anthropic` SDK promoted from `[forge]` extra to a regular runtime
  dep. The forge area is no longer a special install path; future
  Anthropic-shape tests (smoke / longctx hitting `/v1/messages`) need
  the SDK too. `[forge]` extra preserved as an empty no-op alias for
  back-compat.

Bumps luce-bench to v0.2.6. Drops `luce-bench[forge]` suffix from
lucebox-hub root + harness deps — plain `luce-bench` resolves the SDK
now.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Before this, the version (`[lucebench] v0.2.6 ...`) was only on the
area-header line which appears AFTER preflight + model resolution.
That made stale-cache debugging painful — by the time the user sees
which version is actually running, they've already scanned past four
lines of output.

New first line: `[lucebench] v0.2.6`. Stale uvx caches surface at a
glance.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Port PR Luce-Org#62 onto the server/ layout in auto-integration.\n\nThe original PR split target/draft transient StepGraph cleanup to avoid gallocr reallocation churn after daemon resets. Current auto-integration already has a separate draft_sg, so apply the reset/migration cleanup and add the regression test under server/test.
…ea headers

The new first-line banner (b46ed55) already shows `[lucebench] v0.2.6`
at the top of every run. The sweep header and area header were each
prefixing their lines with `v{__version__} ...` too — two copies of
the version in any single-area run, three in a sweep.

Banner is the single source. Sweep / area headers now lead with
their actual context (`sweep name=...`, `area=...`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace hardcoded `version = "X.Y.Z"` strings with a `dynamic = ["version"]`
declaration backed by hatch-vcs. The release version is now computed from
the matching git tag (`luce-bench-v*` or `lucebox-v*`) at build time, with
a build hook writing `src/<pkg>/_version.py` so `__init__.py` can re-export
`__version__`. Untagged checkouts resolve to a `*.dev0` fallback, and
commits past a tag get a `.devN+g<sha>` suffix so dev installs are visibly
distinct from releases.

Single source of truth for each package's version is now the git tag —
drops the tag-vs-pyproject assertion step from the luce-bench release
workflow (now redundant: hatch-vcs derives the version from the tag itself,
so they can't disagree).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Quick start leads with the smoke default (no Status section gating it)
- Document the `smoke` area (3-case sanity check, runs by default)
- Add OpenRouter smoke recipe
- Add uvx-from-branch recipe for pre-release validation
- Note the `[lucebench] vX.Y.Z` version banner
- Replace deprecated `--sweep` with `--areas all` throughout
- Drop the stale "Sister of luce-dflash" framing; point Contributing
  at the lucebox-hub monorepo instead of the soon-to-be-archived
  easel/luce-bench repo

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The preflight grid now renders the full model list on the `/v1/models`
line, with a `*` prefix on the model that will actually be sent.
Default selection is the first model; an explicit `--model X` that
matches a listed id moves the `*`.

When the list overflows the 80-char preflight width, layout is: the
first model, then the selected model (if different), then sequential
fillers until the budget is hit, ending in `… (+N more)`. Long lists
(e.g. OpenRouter) stay readable instead of dumping hundreds of ids.

Examples:
  /v1/models   ✓  *deepseek-v4-flash, deepseek-v4-pro
  /v1/models   ✓  qwen/qwen3.6-27b, *anthropic/claude-opus-4-7, … (+2 more)

Collapses the now-redundant post-preflight resolution print loop into
a single `--model default → 'X'` line; the preflight grid already
shows what was on offer and which was picked.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User-facing preflight line and surrounding docstrings/comments now
call the server 'lucebox' instead of 'dflash'. Examples:

  /props       ✓  absent (non-lucebox server) — skipped

Scope is intentionally narrow (CLI module only): DFLASH_* env vars,
the DflashRuntime class + cfg.dflash TOML key, and binary names stay
as-is — those are interface renames for a separate change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…decode split, token-count breakdown)

When the server doesn't surface `usage.timings.decode_tokens_per_sec`
(e.g. OpenRouter, vLLM), fall back to `completion_tokens / wall_seconds`
and mark it with a trailing `*` so the "prefill rolled in" caveat stays
visible. When lucebox-server provides `prefill_ms` + `decode_ms`, show
them as `prefill=210ms decode=3.5s` instead of just the wall time.
Capture `usage.completion_tokens_details.reasoning_tokens` (OpenAI/OR
shape) plus the deprecated top-level `usage.reasoning_tokens` so the
per-case line can render `in=42 think=120 out=8` — no tokenizer
dependency, just what the server reports.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Docker image versioning now mirrors the Python packages' hatch-vcs
scheme: `lucebox-v<X.Y.Z>` git tags are the single source of truth.

docker-bake.hcl
  Accept a `VERSION` env var. When set, `cuda12-local` emits both the
  moving `lucebox-hub:cuda12` and the pinned `lucebox-hub:<version>-cuda12`.
  Unset → just the moving tag (zero-config local builds still work).
  Legacy `TAG=` override still honored for back-compat.

scripts/build_image.sh
  Thin wrapper that runs `git describe --tags --match 'lucebox-v*'`
  (same regex hatch-vcs uses), strips the `lucebox-v` prefix, exports
  VERSION, and delegates to `docker buildx bake cuda12-local`. Handles
  dirty trees + untagged checkouts gracefully.

.github/workflows/docker.yml
  - Bare `cuda12` tag (was release-only) now also fires on pushes to
    main and on workflow_dispatch with push:true. Branch builds + PRs
    still get only branch-suffixed tags so they can't clobber `:cuda12`.
  - New `type=match,pattern=lucebox-v(\d+\.\d+\.\d+)` rule auto-emits
    `<version>-cuda12` when a `lucebox-v*` tag is pushed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Merge latest origin/main README/GPU asset updates into the integration stack and record this run's PR 237 conflict-resolution attempts.\n\nVerification: git diff --check; PR 237 merge attempted in isolated worktree with Claude Code and Codex assistance, conflicts retained for manual inspection.
The previous `$(git rev-parse --show-toplevel)/..` resolved to the
parent directory of the repo, so `docker buildx bake` failed with
"couldn't find a bake definition". Strip the stray `/..` and add a
proper fallback for when the script is run outside a git checkout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ble-fire

`a || b && c` left-associates: when `a` (git rev-parse) succeeds, the
short-circuit skips `b` but `&& c` still runs, printing pwd a second
time and corrupting the captured path. Wrap the fallback in a subshell
so `&& pwd` only fires when git can't resolve the repo root.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hermes PR Integrator and others added 28 commits May 28, 2026 19:56
Record the 2026-05-28 20:23 cron revalidation: upstream and carried PR heads remain current, draft Luce-Org#304 is excluded, fresh conflicted-PR probes were retained, and a tmux-driven Codex inspection keeps Luce-Org#135 as a designed current-layout port instead of a mechanical merge.
Revalidate open non-draft PR classification against origin/main 8782d07 and easel/auto-integration fbea938. Fresh isolated probes confirm unchanged conflict classes for the remaining selective-port/superseded PRs.
Revalidate open non-draft PR classifications against origin/main 8782d07 and easel/auto-integration 9117fc1. Record fresh isolated conflict probes for the remaining selective-port/superseded PRs.
Revalidate open non-draft contributor PR classification against origin/main 8782d07 and easel/auto-integration 7a869df. Record fresh isolated conflict probes for the unchanged blocked/superseded PR set.
Record the 2026-05-28 21:23 unattended refresh, including fresh open-PR fetches, ancestor checks, and conflict-probe worktrees.
Revalidates open PR classification after fetching origin/easel, records fresh isolated conflict probes for the unchanged blocked PR set, and updates validation details for the current unattended run.
Revalidate open Lucebox PR classification against origin/main 8782d07 and easel/auto-integration 984a022. Repeat direct merge probes for still-conflicted non-draft PRs in isolated worktrees; source stack remains unchanged.
Record the latest open PR classification, note that Luce-Org#297 is non-draft and already carried, and capture refreshed conflict probes plus Luce-Org#237 delegated feasibility results.
The post-push PR recheck showed Luce-Org#297 is draft again, so keep it as an already-carried draft dependency rather than a current non-draft target.
Revalidate open PR classifications and retained probe worktrees for the 2026-05-28 22:30 cron run. No source stack rewrite was needed.
Revalidate open PR heads against origin/main and easel/auto-integration. Record repeated conflict probes for remaining non-draft PRs and the tmux-driven Luce-Org#237 Codex salvage-port report.
Record the 2026-05-28 23:14 cron pass, repeated conflict probes, and the Codex salvage assessment for PR Luce-Org#221. No source stack rewrite was needed because origin/main and carried non-draft PR heads were already current.
Record the 2026-05-28 23:28 cron pass, repeated conflict probes, and the Codex salvage assessment for PR Luce-Org#135. No source stack rewrite was needed because origin/main and carried non-draft PR heads were already current.
Record the 2026-05-29 cron preflight, open PR classification, repeated conflicted merge probes, and retained worktree paths. No source stack rewrite was needed because origin/main and all carried mergeable non-draft PR heads were already included.
Note the overlapping manifest refresh observed during the cron run and correct the recorded parent commit for this manifest-only update.
Record the 2026-05-29 cron preflight, current PR classification, repeated conflicted merge probes, and fresh delegated feasibility results for the remaining selective-port candidates.
Record the 2026-05-29 00:41 cron preflight, open PR classification, repeat worktree merge probes, and fresh delegated PR Luce-Org#135 attempts. No source stack rewrite was needed because origin/main and carried mergeable PR heads were already included.
Record the 2026-05-29 01:00 EDT unattended integration run, fresh merge probes, and Codex feasibility output for PR Luce-Org#135.
…nch in-tree

Squashes 78 commits from feat/lucebox-docker (PR Luce-Org#285) onto origin/main.
Net: 189 files changed.

Major workstreams folded in:

* Docker prebuild stack: ghcr.io/easel/lucebox-hub:cuda12 image, multi-stage
  Dockerfile, docker-bake.hcl, .github/workflows/docker.yml with GHA cache,
  build identity baked into /opt/lucebox-hub/IMAGE_INFO + /opt/lucebox-hub/HOST_INFO.
* Host wrapper (lucebox.sh): probe_host, smart cmd_serve (INVOCATION_ID
  guard, container-state preflight), cmd_systemctl_passthrough (already-
  active short-circuit, restart-loop detection), cmd_update (bootstrap-
  installer pattern), cmd_completion (bash/zsh/fish), config.toml reader
  (env > toml > default precedence), shellcheck-clean.
* Bootstrap installer (install.sh): bakes LUCEBOX_INSTALLED_FROM into the
  installed copy so lucebox update keeps tracking the channel; refuses
  SHA-pinned URLs without LUCEBOX_INSTALL_CHANNEL.
* In-container Python CLI (lucebox/): sparse config.toml persistence,
  config get/set/unset sub-app, models list/download sub-app (replaces
  download-models), autotune with --apply / --json / --sweep, profile
  collapsed onto luce-bench snapshot (1701 → 183 lines).
* luce-bench: snapshot subcommand + canonical HostInfo schema v2 +
  levels (level0/1/2/3) + report subcommand + submit-baseline + regrade.
* Server (C++): /props.host block + props_schema=4 + host_info read at
  startup, /props.build identity, GGUF metadata + sha256 sidecars,
  model card sidecars.
* Harness: client implementations for claude/codex/opencode/hermes/pi.
* Strict 11-field config.toml allowlist for dflash.* runtime tunables.

Deleted (rolled into new structure):
* server/scripts/bench_agent.py, bench_he.py, bench_llm.py — replaced by
  luce-bench snapshot + areas.
* lucebox configure, lucebox download-models, lucebox benchmark — replaced
  by config sub-app, models sub-app, autotune --sweep.
* luce-bench --sweep flag — moved to argv-sniff subcommand dispatch.

Conflict resolution:
* server/scripts/bench_{agent,he,llm}.py — modify/delete kept the deletion
  (feat/lucebox-docker moved bench machinery into luce-bench).
* README.md — took feat-branch version. origin/main had 19 commits worth
  of minor README tweaks since the branch base; those need to be folded
  back in as a follow-up PR.
* docs/specs/openapi-props.yaml + docs/specs/props-endpoint.md — took
  feat-branch version. origin/main had 1 link-fix commit; feat-branch
  has the schema-4 + host-block additions that strictly supersede.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`_load_or_build()` returned `config_mod.load()`'s result verbatim when
config.toml existed, ignoring `LUCEBOX_*` env vars entirely. That
contradicted the precedence lucebox.sh documents (env > toml > default)
and bit sindri in production: its config.toml had `[image]` without a
`registry` line, so the dataclass default `ghcr.io/luce-org/lucebox-hub`
beat the systemd unit's `Environment=LUCEBOX_IMAGE=ghcr.io/easel/...`.
Symptom: `lucebox start` brought up the wrong (stale luce-org) image
even after explicit `lucebox install` + `lucebox pull` against easel.

Fix: overlay env on top of whatever `load()` returns (or `live_config()`
falls back to). Only the five top-level scalars have env hooks
(LUCEBOX_VARIANT/IMAGE/PORT/CONTAINER/MODELS) — dflash/host/model
intentionally don't.

Adds two regression tests:
- env beats config.toml when toml has no explicit value for that key,
- env still wins when toml is absent (covers the live_config fallback).

102 lucebox tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…g#285 CI

CI's "Lint Python surfaces touched by lucebox tooling" job ran
`ruff check .` and found 11 errors across surfaces this branch touches.
Ruff --fix handled 6 (import sorting, unused imports); 5 needed
hand-edits:

  luce-bench/src/lucebench/report.py:172  E741  rename `for l in` → `for lineup in`
  lucebox/tests/test_check.py:39, 95      E731  lambda → def stub() for the two HostFacts stubs
  lucebox/tests/test_cli.py:95            E501  wrap the LUCEBOX_HOST_GPU_LIST_CSV setenv
  lucebox/tests/test_sweep.py:174, 177    E501  wrap two CellResult constructors

22 lucebox tests touched still pass; ruff is clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Record the 2026-05-29 01:18 cron refresh, repeated conflict probes, and the fresh tmux-driven PR Luce-Org#237 Claude/Codex delegation results.
Merge PR Luce-Org#285 after it changed from draft to open during the cron run. Resolve refreshed Docker/lucebox/luce-bench conflicts by taking the PR head for feature files while preserving the server include required by the existing integration stack.\n\nValidation:\n- git diff --check\n- python3 -m compileall -q lucebox/src lucebox/tests luce-bench/src luce-bench/tests harness/src\n- uv run --with pytest python -m pytest lucebox/tests luce-bench/tests/test_report.py luce-bench/tests/test_smoke_area.py luce-bench/tests/test_runner.py -q
Keep the primary checkout clean after integrating PR Luce-Org#285 by ignoring the generated .docker-build/ CMake scratch directory. Update the auto-integration manifest with the final PR Luce-Org#285 merge and validation details.
@easel easel closed this May 29, 2026
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 29, 2026
Record that PR Luce-Org#286 is closed and report integration branch status directly from easel/auto-integration.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants