[FLUX.2] Add e2e text-to-image pipeline: nightly + benchmark#5359
Draft
ashokkumarkannan1 wants to merge 5 commits into
Draft
[FLUX.2] Add e2e text-to-image pipeline: nightly + benchmark#5359ashokkumarkannan1 wants to merge 5 commits into
ashokkumarkannan1 wants to merge 5 commits into
Conversation
Single-device text_encoder/transformer are skipped (exceed single-chip DRAM); the sharded variants run tensor-parallel with relaxed PCC (0.98). Submodule left at main's pin — the FLUX.2 loader changes ship via the tt-forge-models PR.
Runs the standard diffusers Flux2Pipeline with every compute module on Tenstorrent, tensor-parallel sharded (encoder + transformer) / replicated (VAE), via torch.compile(backend="tt") wrappers with per-step CPU round-trips and lazy VAE placement (mirrors the validated composite_all_tt bring-up). The encoder is placed, used and evicted before the transformer to keep peak DRAM at max(component) rather than the sum. WIP: at 1024 on a 4-chip Blackhole branch the transformer's torch.compile(tt) path is currently a very long cold compile / OOM-prone; to be re-validated after rebasing on the latest tt-mlir + tt-metal.
d5a2368 to
5ebda76
Compare
vvukomanTT
approved these changes
Jun 24, 2026
nsumrakTT
approved these changes
Jun 24, 2026
5f9ca6e to
d56e983
Compare
Mirrors the image-gen e2e pattern (#5044 Playground, #5244 SDXL-Lightning, #5291 Janus-Pro), extended for the first multichip (tensor-parallel) image-gen model. FLUX.2 runs on 4 Blackhole chips: ~24B Mistral3 text encoder + ~32B Flux2 transformer SPMD-sharded (model-parallel degree 4), VAE replicated. Benchmark CI: - imagegen_benchmark.py: make per-component perf reporting component-set agnostic (a model reports only the components it runs; FLUX.2 has a single text encoder, so te2 is omitted), with an optional step label and mesh_shape reporting. SDXL/Playground output is unchanged (verified on hardware). - benchmarks/flux2_pipeline.py: new self-contained multichip pipeline. Sets CONVERT_SHLO_TO_SHARDY=1 before use_spmd() (required so tt-mlir gets shardy annotations), mesh from MESH_SHAPES[num_devices]. - test_imagegen.py: add test_flux2. - perf-bench-matrix.json: flux2 entry on qb2-blackhole (4-chip). Nightly test (tests/torch/models/flux2/test_flux2_pipeline.py): - Add @pytest.mark.qb2_blackhole and make the mesh adaptive via MESH_SHAPES (no-op at 4 chips). - Add a qb2-blackhole standalone-models job (scoped to the flux2 dir) to model-test-lb-blackhole-nightly.json so the test is collected on 4 chips. Validated on a 4-chip qb2 Blackhole: FLUX.2 benchmark passes with a coherent 1024x1024 image; SDXL single-chip benchmark passes unchanged.
d56e983 to
ecade08
Compare
Bring the component tests in line with the component-test PR (#5317): - docstrings 128x128 -> 1024x1024 (FLUX.2 loader resolution per #772/#732) - skip reasons: "requires a multi-chip mesh" instead of hardcoded "8+ chips" - mark *_sharded tests lb_blackhole so the lb-blackhole nightly job collects them - encoder PCC comment -> measured ~0.981 Component (correctness) tests run on lb-blackhole; the e2e pipeline runs on qb2-blackhole (test_flux2_pipeline keeps qb2_blackhole).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Ticket
Problem description
modelaxis (contraction-parallel degree 4); the VAE is replicated. It targets the 4-chip Blackhole QuietBox (qb2-blackhole).What's changed
Wires the full FLUX.2-dev pipeline (the standard diffusers
Flux2Pipelineorchestrates; every compute module runs on TT viatorch.compile(backend="tt"), tensor-parallel sharded with the same SPMD shard specs as the component tests) into nightly + benchmark CI, following theplayground_v2_5/sdxl_lightninge2e conventions.tests/torch/models/flux2/test_flux2_pipeline.py): self-contained — pipeline class + test inline, imports shard specs fromthird_party.tt_forge_models(no import fromexamples/). Added@pytest.mark.qb2_blackholeso the existing standalone-models multichip nightly job collects it; markers:tensor_parallel,nightly,model_test,large,qb2_blackhole,record_test_properties. Mesh is now selected fromMESH_SHAPES[num_devices](model-parallel axis fixed at degree 4; no-op at 4 chips). Memory strategy: the text encoder is placed → used → evicted before the transformer is placed, and the VAE is placed lazily at first decode, so peak DRAM ≈ max(component) rather than the sum.tests/benchmark/benchmarks/imagegen_benchmark.pyis made component-set-agnostic — a model reports only the components it runs (FLUX.2 has a single text encoder, sote2is omitted from the breakdown + measurements), with an optional per-step label andmesh_shapereporting. SDXL/Playground output is unchanged (verified on hardware, see table). A self-contained multichipbenchmarks/flux2_pipeline.pysetsCONVERT_SHLO_TO_SHARDY=1beforeuse_spmd()(required so the StableHLO handed to tt-mlir carries shardy annotations) and builds the mesh fromMESH_SHAPES[num_devices], with per-component_perftiming. Entrytest_flux2intest_imagegen.py. Two-pass warmup + steady-state.optimization_level=0.flux2entry added to.github/workflows/perf-bench-matrix.jsonwithruns-on: qb2-blackhole(4-chip). Aqb2-blackholestandalone-models nightly job (scoped to theflux2dir, markqb2_blackhole) added tomodel-test-lb-blackhole-nightly.json.lb-blackhole(the functional-nightly convention); the e2e pipeline + perf benchmark run onqb2-blackhole— matching the perf-matrix convention (all Blackhole multichip perf runs on qb2) and co-locating the e2e with the box it was validated on.Validated on a 4-chip qb2 Blackhole:
test_flux2_pipeline(nightly e2e)test_flux2(benchmark)Steady-state benchmark numbers (qb2-blackhole, 4 chips, opt_level=0, 50 steps):
Both the nightly and benchmark passes produce a coherent 1024×1024 image.
Checklist
Validate nightly e2e + benchmark on 4-chip qb2 blackhole (logs - zip)
Confirm single-chip imagegen (SDXL-Lightning) benchmark unaffected by the harness change
Perf benchmark CI run (
test_flux2, qb2-blackhole): https://github.com/tenstorrent/tt-xla/actions/runs/28118639108