[FLUX.2] Add e2e text-to-image pipeline: nightly + benchmark by ashokkumarkannan1 · Pull Request #5359 · tenstorrent/tt-xla

ashokkumarkannan1 · 2026-06-24T11:24:45Z

Ticket

[FLUX.2- dev] Wire e2e pipeline into nightly and benchmark CI #5234

Problem description

FLUX.2-dev passed component bring-up (text encoder, transformer, VAE decoder — added in FLUX.2: Add component tests (text encoder, transformer, VAE decoder) #5317) and the full e2e text-to-image pipeline runs on Tenstorrent, but there was no e2e pipeline wired into benchmark CI, and the nightly e2e test carried no machine marker, so nothing in CI actually collected/ran it.
This mirrors the image-gen pattern introduced in [Playground v2.5] Wire e2e pipeline into nightly and benchmark CI #5044 (Playground v2.5) and extended by SDXL-Lightning ([SDXL Lightning] Add e2e pipeline in nightly and benchmark CI #5244) and Janus-Pro ([Janus-Pro] Add e2e text-to-image pipeline (1B + 7B): nightly + benchmark #5291): a model whose components pass bring-up should get an e2e pipeline in nightly + a perf benchmark.
FLUX.2 is the first multichip (tensor-parallel) image-gen model to be wired in: the ~24B Mistral3 text encoder and ~32B Flux2 transformer are SPMD-sharded across the mesh's model axis (contraction-parallel degree 4); the VAE is replicated. It targets the 4-chip Blackhole QuietBox (qb2-blackhole).

What's changed

Wires the full FLUX.2-dev pipeline (the standard diffusers Flux2Pipeline orchestrates; every compute module runs on TT via torch.compile(backend="tt"), tensor-parallel sharded with the same SPMD shard specs as the component tests) into nightly + benchmark CI, following the playground_v2_5 / sdxl_lightning e2e conventions.

Nightly (tests/torch/models/flux2/test_flux2_pipeline.py): self-contained — pipeline class + test inline, imports shard specs from third_party.tt_forge_models (no import from examples/). Added @pytest.mark.qb2_blackhole so the existing standalone-models multichip nightly job collects it; markers: tensor_parallel, nightly, model_test, large, qb2_blackhole, record_test_properties. Mesh is now selected from MESH_SHAPES[num_devices] (model-parallel axis fixed at degree 4; no-op at 4 chips). Memory strategy: the text encoder is placed → used → evicted before the transformer is placed, and the VAE is placed lazily at first decode, so peak DRAM ≈ max(component) rather than the sum.
Benchmark: the shared diffusion harness tests/benchmark/benchmarks/imagegen_benchmark.py is made component-set-agnostic — a model reports only the components it runs (FLUX.2 has a single text encoder, so te2 is omitted from the breakdown + measurements), with an optional per-step label and mesh_shape reporting. SDXL/Playground output is unchanged (verified on hardware, see table). A self-contained multichip benchmarks/flux2_pipeline.py sets CONVERT_SHLO_TO_SHARDY=1 before use_spmd() (required so the StableHLO handed to tt-mlir carries shardy annotations) and builds the mesh from MESH_SHAPES[num_devices], with per-component _perf timing. Entry test_flux2 in test_imagegen.py. Two-pass warmup + steady-state. optimization_level=0.
Matrix: flux2 entry added to .github/workflows/perf-bench-matrix.json with runs-on: qb2-blackhole (4-chip). A qb2-blackhole standalone-models nightly job (scoped to the flux2 dir, mark qb2_blackhole) added to model-test-lb-blackhole-nightly.json.
Machine placement: the component correctness tests (FLUX.2: Add component tests (text encoder, transformer, VAE decoder) #5317) run on lb-blackhole (the functional-nightly convention); the e2e pipeline + perf benchmark run on qb2-blackhole — matching the perf-matrix convention (all Blackhole multichip perf runs on qb2) and co-locating the e2e with the box it was validated on.

Validated on a 4-chip qb2 Blackhole:

Test	Result	Time
`test_flux2_pipeline` (nightly e2e)	✅ passed	20m59s (cold cache)
`test_flux2` (benchmark)	✅ passed	19m54s (warmup+steady)

Steady-state benchmark numbers (qb2-blackhole, 4 chips, opt_level=0, 50 steps):

metric	value
e2e latency	835.8s
warmup (incl. compile)	336.2s
text encoder	29.8s
transformer step mean	13.55s
VAE	6.4s

Both the nightly and benchmark passes produce a coherent 1024×1024 image.

Checklist

Validate nightly e2e + benchmark on 4-chip qb2 blackhole (logs - zip)
Confirm single-chip imagegen (SDXL-Lightning) benchmark unaffected by the harness change
Perf benchmark CI run (test_flux2, qb2-blackhole): https://github.com/tenstorrent/tt-xla/actions/runs/28118639108

Single-device text_encoder/transformer are skipped (exceed single-chip DRAM); the sharded variants run tensor-parallel with relaxed PCC (0.98). Submodule left at main's pin — the FLUX.2 loader changes ship via the tt-forge-models PR.

…732 combined commit) Bumps third_party/tt_forge_models to f9d98fdb28, which carries both the text-encoder OOM fix (#732) and the transformer contraction-parallel PCC fix (#772), so the component tests and the upcoming e2e pipeline run against the full set of FLUX.2 loader fixes.

Runs the standard diffusers Flux2Pipeline with every compute module on Tenstorrent, tensor-parallel sharded (encoder + transformer) / replicated (VAE), via torch.compile(backend="tt") wrappers with per-step CPU round-trips and lazy VAE placement (mirrors the validated composite_all_tt bring-up). The encoder is placed, used and evicted before the transformer to keep peak DRAM at max(component) rather than the sum. WIP: at 1024 on a 4-chip Blackhole branch the transformer's torch.compile(tt) path is currently a very long cold compile / OOM-prone; to be re-validated after rebasing on the latest tt-mlir + tt-metal.

Mirrors the image-gen e2e pattern (#5044 Playground, #5244 SDXL-Lightning, #5291 Janus-Pro), extended for the first multichip (tensor-parallel) image-gen model. FLUX.2 runs on 4 Blackhole chips: ~24B Mistral3 text encoder + ~32B Flux2 transformer SPMD-sharded (model-parallel degree 4), VAE replicated. Benchmark CI: - imagegen_benchmark.py: make per-component perf reporting component-set agnostic (a model reports only the components it runs; FLUX.2 has a single text encoder, so te2 is omitted), with an optional step label and mesh_shape reporting. SDXL/Playground output is unchanged (verified on hardware). - benchmarks/flux2_pipeline.py: new self-contained multichip pipeline. Sets CONVERT_SHLO_TO_SHARDY=1 before use_spmd() (required so tt-mlir gets shardy annotations), mesh from MESH_SHAPES[num_devices]. - test_imagegen.py: add test_flux2. - perf-bench-matrix.json: flux2 entry on qb2-blackhole (4-chip). Nightly test (tests/torch/models/flux2/test_flux2_pipeline.py): - Add @pytest.mark.qb2_blackhole and make the mesh adaptive via MESH_SHAPES (no-op at 4 chips). - Add a qb2-blackhole standalone-models job (scoped to the flux2 dir) to model-test-lb-blackhole-nightly.json so the test is collected on 4 chips. Validated on a 4-chip qb2 Blackhole: FLUX.2 benchmark passes with a coherent 1024x1024 image; SDXL single-chip benchmark passes unchanged.

Bring the component tests in line with the component-test PR (#5317): - docstrings 128x128 -> 1024x1024 (FLUX.2 loader resolution per #772/#732) - skip reasons: "requires a multi-chip mesh" instead of hardcoded "8+ chips" - mark *_sharded tests lb_blackhole so the lb-blackhole nightly job collects them - encoder PCC comment -> measured ~0.981 Component (correctness) tests run on lb-blackhole; the e2e pipeline runs on qb2-blackhole (test_flux2_pipeline keeps qb2_blackhole).

ashokkumarkannan1 added 3 commits June 22, 2026 09:05

ashokkumarkannan1 changed the title ~~[FLUX.2] Add e2e text-to-image pipeline: nightly + benchmark (4-chip blackhole)~~ [FLUX.2] Add e2e text-to-image pipeline: nightly + benchmark Jun 24, 2026

ashokkumarkannan1 force-pushed the akannan/flux2_e2e_pipeline branch from d5a2368 to 5ebda76 Compare June 24, 2026 11:38

vvukomanTT approved these changes Jun 24, 2026

View reviewed changes

nsumrakTT approved these changes Jun 24, 2026

View reviewed changes

ashokkumarkannan1 marked this pull request as draft June 24, 2026 12:09

ashokkumarkannan1 force-pushed the akannan/flux2_e2e_pipeline branch 3 times, most recently from 5f9ca6e to d56e983 Compare June 24, 2026 18:14

ashokkumarkannan1 force-pushed the akannan/flux2_e2e_pipeline branch from d56e983 to ecade08 Compare June 24, 2026 18:25

ashokkumarkannan1 requested a review from meenakshiramanathan1 June 26, 2026 09:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FLUX.2] Add e2e text-to-image pipeline: nightly + benchmark#5359

[FLUX.2] Add e2e text-to-image pipeline: nightly + benchmark#5359
ashokkumarkannan1 wants to merge 5 commits into
mainfrom
akannan/flux2_e2e_pipeline

ashokkumarkannan1 commented Jun 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

ashokkumarkannan1 commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Ticket

Problem description

What's changed

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ashokkumarkannan1 commented Jun 24, 2026 •

edited

Loading