Skip to content

[FLUX.2] Add e2e text-to-image pipeline: nightly + benchmark#5359

Draft
ashokkumarkannan1 wants to merge 5 commits into
mainfrom
akannan/flux2_e2e_pipeline
Draft

[FLUX.2] Add e2e text-to-image pipeline: nightly + benchmark#5359
ashokkumarkannan1 wants to merge 5 commits into
mainfrom
akannan/flux2_e2e_pipeline

Conversation

@ashokkumarkannan1

@ashokkumarkannan1 ashokkumarkannan1 commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Ticket

Problem description

What's changed

Wires the full FLUX.2-dev pipeline (the standard diffusers Flux2Pipeline orchestrates; every compute module runs on TT via torch.compile(backend="tt"), tensor-parallel sharded with the same SPMD shard specs as the component tests) into nightly + benchmark CI, following the playground_v2_5 / sdxl_lightning e2e conventions.

  • Nightly (tests/torch/models/flux2/test_flux2_pipeline.py): self-contained — pipeline class + test inline, imports shard specs from third_party.tt_forge_models (no import from examples/). Added @pytest.mark.qb2_blackhole so the existing standalone-models multichip nightly job collects it; markers: tensor_parallel, nightly, model_test, large, qb2_blackhole, record_test_properties. Mesh is now selected from MESH_SHAPES[num_devices] (model-parallel axis fixed at degree 4; no-op at 4 chips). Memory strategy: the text encoder is placed → used → evicted before the transformer is placed, and the VAE is placed lazily at first decode, so peak DRAM ≈ max(component) rather than the sum.
  • Benchmark: the shared diffusion harness tests/benchmark/benchmarks/imagegen_benchmark.py is made component-set-agnostic — a model reports only the components it runs (FLUX.2 has a single text encoder, so te2 is omitted from the breakdown + measurements), with an optional per-step label and mesh_shape reporting. SDXL/Playground output is unchanged (verified on hardware, see table). A self-contained multichip benchmarks/flux2_pipeline.py sets CONVERT_SHLO_TO_SHARDY=1 before use_spmd() (required so the StableHLO handed to tt-mlir carries shardy annotations) and builds the mesh from MESH_SHAPES[num_devices], with per-component _perf timing. Entry test_flux2 in test_imagegen.py. Two-pass warmup + steady-state. optimization_level=0.
  • Matrix: flux2 entry added to .github/workflows/perf-bench-matrix.json with runs-on: qb2-blackhole (4-chip). A qb2-blackhole standalone-models nightly job (scoped to the flux2 dir, mark qb2_blackhole) added to model-test-lb-blackhole-nightly.json.
  • Machine placement: the component correctness tests (FLUX.2: Add component tests (text encoder, transformer, VAE decoder) #5317) run on lb-blackhole (the functional-nightly convention); the e2e pipeline + perf benchmark run on qb2-blackhole — matching the perf-matrix convention (all Blackhole multichip perf runs on qb2) and co-locating the e2e with the box it was validated on.

Validated on a 4-chip qb2 Blackhole:

Test Result Time
test_flux2_pipeline (nightly e2e) ✅ passed 20m59s (cold cache)
test_flux2 (benchmark) ✅ passed 19m54s (warmup+steady)

Steady-state benchmark numbers (qb2-blackhole, 4 chips, opt_level=0, 50 steps):

metric value
e2e latency 835.8s
warmup (incl. compile) 336.2s
text encoder 29.8s
transformer step mean 13.55s
VAE 6.4s

Both the nightly and benchmark passes produce a coherent 1024×1024 image.

Checklist

Single-device text_encoder/transformer are skipped (exceed single-chip DRAM);
the sharded variants run tensor-parallel with relaxed PCC (0.98). Submodule
left at main's pin — the FLUX.2 loader changes ship via the tt-forge-models PR.
…732 combined commit)

Bumps third_party/tt_forge_models to f9d98fdb28, which carries both the
text-encoder OOM fix (#732) and the transformer contraction-parallel PCC
fix (#772), so the component tests and the upcoming e2e pipeline run against
the full set of FLUX.2 loader fixes.
Runs the standard diffusers Flux2Pipeline with every compute module on
Tenstorrent, tensor-parallel sharded (encoder + transformer) / replicated
(VAE), via torch.compile(backend="tt") wrappers with per-step CPU round-trips
and lazy VAE placement (mirrors the validated composite_all_tt bring-up). The
encoder is placed, used and evicted before the transformer to keep peak DRAM at
max(component) rather than the sum.

WIP: at 1024 on a 4-chip Blackhole branch the transformer's torch.compile(tt)
path is currently a very long cold compile / OOM-prone; to be re-validated after
rebasing on the latest tt-mlir + tt-metal.
@ashokkumarkannan1 ashokkumarkannan1 changed the title [FLUX.2] Add e2e text-to-image pipeline: nightly + benchmark (4-chip blackhole) [FLUX.2] Add e2e text-to-image pipeline: nightly + benchmark Jun 24, 2026
@ashokkumarkannan1 ashokkumarkannan1 force-pushed the akannan/flux2_e2e_pipeline branch from d5a2368 to 5ebda76 Compare June 24, 2026 11:38
@ashokkumarkannan1 ashokkumarkannan1 marked this pull request as draft June 24, 2026 12:09
@ashokkumarkannan1 ashokkumarkannan1 force-pushed the akannan/flux2_e2e_pipeline branch 3 times, most recently from 5f9ca6e to d56e983 Compare June 24, 2026 18:14
Mirrors the image-gen e2e pattern (#5044 Playground, #5244 SDXL-Lightning,
#5291 Janus-Pro), extended for the first multichip (tensor-parallel) image-gen
model. FLUX.2 runs on 4 Blackhole chips: ~24B Mistral3 text encoder + ~32B
Flux2 transformer SPMD-sharded (model-parallel degree 4), VAE replicated.

Benchmark CI:
- imagegen_benchmark.py: make per-component perf reporting component-set
  agnostic (a model reports only the components it runs; FLUX.2 has a single
  text encoder, so te2 is omitted), with an optional step label and mesh_shape
  reporting. SDXL/Playground output is unchanged (verified on hardware).
- benchmarks/flux2_pipeline.py: new self-contained multichip pipeline. Sets
  CONVERT_SHLO_TO_SHARDY=1 before use_spmd() (required so tt-mlir gets shardy
  annotations), mesh from MESH_SHAPES[num_devices].
- test_imagegen.py: add test_flux2.
- perf-bench-matrix.json: flux2 entry on qb2-blackhole (4-chip).

Nightly test (tests/torch/models/flux2/test_flux2_pipeline.py):
- Add @pytest.mark.qb2_blackhole and make the mesh adaptive via MESH_SHAPES
  (no-op at 4 chips).
- Add a qb2-blackhole standalone-models job (scoped to the flux2 dir) to
  model-test-lb-blackhole-nightly.json so the test is collected on 4 chips.

Validated on a 4-chip qb2 Blackhole: FLUX.2 benchmark passes with a coherent
1024x1024 image; SDXL single-chip benchmark passes unchanged.
Bring the component tests in line with the component-test PR (#5317):
- docstrings 128x128 -> 1024x1024 (FLUX.2 loader resolution per #772/#732)
- skip reasons: "requires a multi-chip mesh" instead of hardcoded "8+ chips"
- mark *_sharded tests lb_blackhole so the lb-blackhole nightly job collects them
- encoder PCC comment -> measured ~0.981

Component (correctness) tests run on lb-blackhole; the e2e pipeline runs on
qb2-blackhole (test_flux2_pipeline keeps qb2_blackhole).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants