Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
5e08403
fix: resolve simulator deadlock for hierarchical dataflow regions
sunwookim028 Mar 6, 2026
83905ea
Fix HLS codegen for hierarchical dataflow regions (#561)
sunwookim028 Mar 6, 2026
46cfa30
Merge remote-tracking branch 'origin/main' into feature/mesh-accelerator
sunwookim028 Mar 6, 2026
ecd0180
fix: simulator hierarchical deadlock
sunwookim028 Mar 6, 2026
812d222
Fix dataflow region nested kernel issues and OMP deadlock
sunwookim028 Mar 6, 2026
d58130e
feat: mesh architecture v1
sunwookim028 Mar 8, 2026
0a01930
feat: non-blocking streams, decoupled mesh, and performance evaluatio…
sunwookim028 Mar 9, 2026
01f25e2
feat: Catapult HLS synthesis working end-to-end for decoupled 2x1 mesh
sunwookim028 Mar 10, 2026
4580b7a
docs: add non-blocking stream walkthrough, Catapult analysis, and env…
sunwookim028 Apr 7, 2026
bc8f6ef
fix: apply stashed improvements for simulator, ip paths, and code gen…
sunwookim028 Apr 7, 2026
f8fa50c
Merge upstream main (SPMW ops #555) into feature/mesh-accelerator
sunwookim028 Apr 10, 2026
5595fab
fix: add stateful backward-compat alias after SPMW upstream merge
sunwookim028 Apr 10, 2026
89339f0
chore: remove Catapult backend and Tapa NB stream additions
sunwookim028 Apr 13, 2026
e665bbc
fix: adopt upstream Stateful API, remove lowercase stateful alias
sunwookim028 Apr 13, 2026
039e163
chore: add CLAUDE.md, move STATE.md to root, commit issues/ and notes/
sunwookim028 Apr 13, 2026
cd16345
docs: update STATE.md and issues to reflect session end state
sunwookim028 Apr 13, 2026
ce1e0b3
fix: add float16->half type mapping for Vitis HLS backend
sunwookim028 Apr 13, 2026
853d5dd
docs: add fp16 HLS notes, issues 008-010, code quality + scratch path…
sunwookim028 Apr 13, 2026
cc1963a
fix: revert TapaModuleEmitter NB-stream override decls to unblock CI
sunwookim028 Apr 14, 2026
246e704
fix: fp16 scalar math dispatch and hls:: namespace for half ops (ISSU…
sunwookim028 Apr 14, 2026
10d3753
chore: add license headers to all files changed vs origin/main
sunwookim028 Apr 14, 2026
06ce561
chore: exclude fork-local mgmt files from license check; keep test he…
sunwookim028 Apr 14, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -94,3 +94,17 @@ sg_execution_times.rst
tests/xlscc_tests/**/test.cpp
tests/xlscc_tests/**/test_block.cpp
tests/xlscc_tests/**/test

# Catapult / HLS experiment artifacts (local only)
mlir/build-rhel8/
hls_projects/
glibc_compat/
ccc/
f32_softmax_prj/
softmax_prj/
*_csim.prj/
*_csyn.prj/
catapult_*.prj/
catapult_*.prj


21 changes: 21 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,3 +21,24 @@
# Don'ts
- Do not modify repository structure without approval
- Do not install system packages without explicit user confirmation

# Troubleshooting & Known Issues
- **"Fail to resolve the expression as symbolic expression" in Dataflow**: When using stream arrays (e.g., `gemm_in_A[m]`), the index `m` must be a compile-time constant (like `df.get_pid()`) or statically unrollable. Using a dynamic runtime loop index or variable will cause this type-inference failure. Manually unroll the loops or use literal constants where possible.
- **"AttributeError: 'ASTContext' object has no attribute 'global_op_cache'"**: This occurs when compiling stateful variables in nested kernels inside `df.region()`. Fixed upstream; ensure `ASTContext.copy()` properly preserves `self.global_op_cache`. Documented in `ALLO_CHANGE.md`.
- **"func.call op operand type mismatch: expected operand type 'i32', but provided 'memref<i32>'"**: When passing scalars into a `df.region()` top-level function, MLIR's `_build_top` lowers the arguments to memrefs, causing a crash when passed to inner scalar-expecting implementations. **Workaround**: Always type scalar arguments in `df.region()` and `df.kernel()` as 1-element arrays, e.g., `size: int32[1]` instead of `int32`, and access via `size[0]`.
- **Duplicate annotated variable names cause silent hangs**: In Allo, `x: int32 = expr` is an annotated assignment that creates a new buffer. If the same variable name is annotated twice in the same kernel (e.g., `addr1: int32 = ...` in two separate `if` branches), Allo may silently hang during compilation or simulation. **Fix**: Use unique names for each branch (e.g., `g_addr1` / `v_addr1`).
- **Dataflow deadlock from conditional stream usage**: In `df.region()`, all inner kernels execute exactly once per invocation. If one kernel conditionally `.put()`s to a stream but another kernel unconditionally `.get()`s from the same stream, the reader blocks forever when no data is sent. **Fix**: Use explicit enable streams (`en: Stream[int32, depth]`) that are always `.put()` by the producer and always `.get()` by the consumer. Guard the actual data stream `.get()` behind `if en == 1:`.
- **Multi-instruction loop pattern**: To execute multiple instructions per `CTRL_RUN`, all kernels must iterate a fixed `IMEM_SIZE` times using `for pc in range(IMEM_SIZE):`. The controller broadcasts enable signals each iteration; compute kernels conditionally execute or skip based on the enable value.
- **Multiple calls to inner df.region cause `@stateful` global redefinition**: If a `df.kernel` calls the same `df.region` from multiple branches (e.g., different `if/elif` arms), the MLIR builder emits the region's `@stateful` global declarations once per call site, causing `error: redefinition of symbol`. **Fix**: Restructure the kernel so there is exactly ONE call site to each inner region — set up arguments in branches, then call unconditionally at the end. Use a `CTRL_NOP` sentinel value if you need a no-op path.
- **Nested OMP parallelism deadlock**: When peer `df.kernel`s inside a `df.region` call sub-`df.region`s, nested OMP parallel sections deadlock because OpenMP serializes nested parallelism by default. **Fix**: The simulator backend now auto-sets `OMP_MAX_ACTIVE_LEVELS=4`. If running tests manually, set this env var.
- **All `@df.kernel` names must be globally unique**: Across the entire module, every `@df.kernel` function must have a unique Python name. Duplicate names cause MLIR symbol collisions. Use suffixes like `_t1`, `_p0`, `_p1` to distinguish per-tile instances.

# Key File Pointers (2x2 Mesh Accelerator)
- **Implementation**: [test_mesh_accelerator.py](tests/dataflow/test_mesh_accelerator.py)
- **Architecture Plan**: [PLAN.md](tests/dataflow/PLAN.md)
- **ISA Document**: [ISA.md](tests/dataflow/ISA.md)
- **Progress Log**: [PROGRESS.md](tests/dataflow/PROGRESS.md)
- **Compiler Changes**: [ALLO_CHANGE.md](ALLO_CHANGE.md)
- **Simulator Fix**: [simulator.py](allo/backend/simulator.py) — OMP injection for nested regions
- **Upstream Issue**: [cornell-zhang/allo#561](https://github.com/cornell-zhang/allo/issues/561) — HLS codegen bugs for nested regions

102 changes: 102 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
## This fork's project management conventions

This fork (`sunwookim028/allo`, branch `feature/mesh-accelerator`) extends upstream Allo with
non-blocking stream primitives, a tile-based hierarchical dataflow mesh, and simulator/codegen
fixes. The following conventions govern how agents should navigate and update the project.

### notes/ — Modular knowledge files

`notes/` contains standalone reference documents covering architecture decisions, environment
setup, synthesis results, and exploration work. Examples: `ENVIRONMENT.md`, `HLS_SYNTH_REPORT.md`,
`ASIC_HLS_EXPLORATION.md`, `ALLO_CHANGE.md`.

**Agent rule**: When you discover new knowledge (a synthesis result, a bug root-cause, an env
quirk), find the relevant note and update it, or create a new note if none fits. Keep notes
factual and self-contained so any agent can read one note and understand the topic.

### issues/ — Self-contained task files

`issues/` contains one `.md` file per task (`ISSUE-NNN-short-title.md`). Each file describes
the problem, the plan, acceptance criteria, and status (`OPEN` / `IN-PROGRESS` / `DONE`).

**Agent rule**: When you complete work that resolves a task, mark the issue `DONE` and update
any relevant notes in the same commit. Never close an issue file without also checking whether
`STATE.md` needs updating.

### STATE.md — Project dashboard (root)

`STATE.md` at the repo root is the single source of truth for project vision, the task board
(status table of all ISSUE-NNN items), dependency graph, and upstream watch items.

**Agent rule**: After completing a task or merging upstream changes, update the task board row
in `STATE.md` in the same commit. Keep the "Completed This Cycle" section current.

### Commit hygiene

When a commit resolves a task tracked in `issues/`:

1. Update the issue file status to `DONE`.
2. Update the `STATE.md` task board row.
3. Update any affected note in `notes/`.
4. Include all three changes in the same commit.

Any step that touches GitHub upstream (push, `gh pr create/close/comment`) requires **explicit
user approval before execution**.

---

# Building
- Always run `conda activate allo` before building or running tests
- Run `pip install -v -e .` to build the full project (includes MLIR/C++ backend)
- Read `docs/source/dive/frontend_syntax.rst` for comprehensive Allo frontend syntax reference
- Read `docs/source/dive/dataflow.rst` for the dataflow programming model (regions, kernels, streams)

# Testing
- Run `bash scripts/lint/task_lint.sh` for formatting checks
- Run `python3 -m pytest --ignore=tests/dataflow tests -v` for tests
- Prefer running a single test file instead of the full suite (full suite is slow)
- Use only software simulators (`target="llvm"` or `target="simulator"`)
- If Vitis HLS tests are needed, ask the user to run them manually

# Code style
- Make small, targeted diffs rather than large refactors, and always be concise
- Prefer general solutions instead of one-off `if/else` patches
- Place Python frontend code in `allo/`
- Place MLIR dialects and passes code in `mlir/`
- Add tests and documentation for new features in `tests/` and `docs/`

# Don'ts
- Do not modify repository structure without approval
- Do not install system packages without explicit user confirmation

# Code Quality
- All implementation must follow the project's and relevant community's established practices
- Web-search idiomatic patterns before writing non-trivial code if not 100% confident
- No ad-hoc patching: check how similar features are done in the same file/module first
- This applies to all upstream PRs, compiler changes, MLIR passes, and HLS backend code

# Filesystem
- `/work/shared/users/phd/sk3463/` — NFS home for source files and docs; **quota-limited**
- `/scratch/sk3463/` — local scratch, ~1.8 TB free; use for HLS project dirs, build artifacts, large outputs

# Troubleshooting & Known Issues
- **"Fail to resolve the expression as symbolic expression" in Dataflow**: When using stream arrays (e.g., `gemm_in_A[m]`), the index `m` must be a compile-time constant (like `df.get_pid()`) or statically unrollable. Using a dynamic runtime loop index or variable will cause this type-inference failure. Manually unroll the loops or use literal constants where possible.
- **"AttributeError: 'ASTContext' object has no attribute 'global_op_cache'"**: This occurs when compiling stateful variables in nested kernels inside `df.region()`. Fixed upstream; ensure `ASTContext.copy()` properly preserves `self.global_op_cache`. Documented in `notes/ALLO_CHANGE.md`.
- **"func.call op operand type mismatch: expected operand type 'i32', but provided 'memref<i32>'"**: When passing scalars into a `df.region()` top-level function, MLIR's `_build_top` lowers the arguments to memrefs, causing a crash when passed to inner scalar-expecting implementations. **Workaround**: Always type scalar arguments in `df.region()` and `df.kernel()` as 1-element arrays, e.g., `size: int32[1]` instead of `int32`, and access via `size[0]`.
- **Duplicate annotated variable names cause silent hangs**: In Allo, `x: int32 = expr` is an annotated assignment that creates a new buffer. If the same variable name is annotated twice in the same kernel (e.g., `addr1: int32 = ...` in two separate `if` branches), Allo may silently hang during compilation or simulation. **Fix**: Use unique names for each branch (e.g., `g_addr1` / `v_addr1`).
- **Dataflow deadlock from conditional stream usage**: In `df.region()`, all inner kernels execute exactly once per invocation. If one kernel conditionally `.put()`s to a stream but another kernel unconditionally `.get()`s from the same stream, the reader blocks forever when no data is sent. **Fix**: Use explicit enable streams (`en: Stream[int32, depth]`) that are always `.put()` by the producer and always `.get()` by the consumer. Guard the actual data stream `.get()` behind `if en == 1:`.
- **Multi-instruction loop pattern**: To execute multiple instructions per `CTRL_RUN`, all kernels must iterate a fixed `IMEM_SIZE` times using `for pc in range(IMEM_SIZE):`. The controller broadcasts enable signals each iteration; compute kernels conditionally execute or skip based on the enable value.
- **Multiple calls to inner df.region cause `@stateful` global redefinition**: If a `df.kernel` calls the same `df.region` from multiple branches (e.g., different `if/elif` arms), the MLIR builder emits the region's `@stateful` global declarations once per call site, causing `error: redefinition of symbol`. **Fix**: Restructure the kernel so there is exactly ONE call site to each inner region — set up arguments in branches, then call unconditionally at the end. Use a `CTRL_NOP` sentinel value if you need a no-op path.
- **Nested OMP parallelism deadlock**: When peer `df.kernel`s inside a `df.region` call sub-`df.region`s, nested OMP parallel sections deadlock because OpenMP serializes nested parallelism by default. **Fix**: The simulator backend now auto-sets `OMP_MAX_ACTIVE_LEVELS=4`. If running tests manually, set this env var.
- **All `@df.kernel` names must be globally unique**: Across the entire module, every `@df.kernel` function must have a unique Python name. Duplicate kernel names cause MLIR symbol collisions. Use suffixes like `_t1`, `_p0`, `_p1` to distinguish per-tile instances.

# Key File Pointers

- **Non-blocking stream tests**: [`tests/dataflow/test_stream_nb_simple.py`](tests/dataflow/test_stream_nb_simple.py)
- **Decoupled mesh (1-CT and 2×1)**: [`tests/dataflow/test_decoupled_mesh.py`](tests/dataflow/test_decoupled_mesh.py)
- **Blocking mesh reference**: [`tests/dataflow/test_hierachical_mesh.py`](tests/dataflow/test_hierachical_mesh.py)
- **Simulator backend**: [`allo/backend/simulator.py`](allo/backend/simulator.py)
- **Compiler changes log**: [`notes/ALLO_CHANGE.md`](notes/ALLO_CHANGE.md)
- **Project state (task board)**: [`STATE.md`](STATE.md)
- **Task issues**: [`issues/`](issues/)
- **Knowledge notes**: [`notes/`](notes/)
119 changes: 119 additions & 0 deletions STATE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# Project State

## Vision & Scope

This fork (`sunwookim028/allo`, branch `feature/mesh-accelerator`) extends the upstream
Allo compiler (`cornell-zhang/allo`) with:

1. **Non-blocking stream semantics** — `try_put`/`try_get`/`empty`/`full` FIFO primitives
enabling valid-ready handshake protocols between dataflow kernels. Implemented in the
MLIR dialect (AlloOps.td), Vitis HLS emitter, simulator backend.

2. **Tile-based hierarchical dataflow mesh** — a programming model and example architecture
(Memory Tile + Compute Tiles) using decoupled control (NB streams) and burst data streams.
Key example: `tests/dataflow/test_decoupled_mesh.py` (1-CT and 2×1 mesh).

3. **Upstream bug fixes** — hierarchical dataflow simulator/codegen fixes submitted as PRs.

**What we do NOT maintain:** Catapult HLS backend (removed; see `notes/ASIC_HLS_EXPLORATION.md`),
Tapa NB stream additions (removed). Primary synthesis target: Vitis HLS.

---

## Task Board

| ID | Title | Status | Upstream? | Blocked by |
|----|-------|--------|-----------|------------|
| [ISSUE-001](issues/ISSUE-001-pr554-black-format.md) | Black formatting on PR #554 branch | **DONE** | push to fork only | — |
| [ISSUE-002](issues/ISSUE-002-pr554-add-tests.md) | Add test cases for PR #554 | **DONE** | push + comment | ISSUE-001 |
| [ISSUE-003](issues/ISSUE-003-pr563-surgery.md) | Create focused PR replacing #563 | **NEEDS-REVIEW** (CI PASSED) | close PR + open new | ISSUE-004 |
| [ISSUE-004](issues/ISSUE-004-merge-upstream-spmw.md) | Merge upstream SPMW into feature branch | **DONE** | local only | — |
| [ISSUE-005](issues/ISSUE-005-u280-inner-product-overview.md) | U280 inner product — architecture & overview | **OPEN** | standalone RTL project | — |
| [ISSUE-006](issues/ISSUE-006-u280-rtl-implementation.md) | U280 inner product — RTL implementation (5 SV files) | **OPEN** | standalone RTL project | ISSUE-005 |
| [ISSUE-007](issues/ISSUE-007-u280-packaging-and-verify.md) | U280 inner product — packaging, build, sw_emu verify | **OPEN** | standalone RTL project | ISSUE-006 |
| [ISSUE-008](issues/ISSUE-008-fp16-hls-synthesis-verify.md) | Verify float16 arithmetic + exp synthesize in Vitis HLS | **DONE** | no (local verify; informs PR #578) | — |
| [ISSUE-009](issues/ISSUE-009-fp16-builder-scalar-exp.md) | Fix scalar exp dispatch in builder.py for float16 | **DONE** | yes (PR against upstream) | ISSUE-008 |
| [ISSUE-010](issues/ISSUE-010-fp16-emitter-exp-half.md) | Fix exp(half) emitter in EmitVivadoHLS.cpp | **DONE** | yes (follow-up to PR #578) | ISSUE-008 |

---

## Dependency Graph

```
ISSUE-001 ──► ISSUE-002 ──► (PR #554 merge-ready, await maintainer)
ISSUE-004 ──► ISSUE-003 ──► (PR #577 opened, replaces #563, CI running)

ISSUE-005 (arch) ──► ISSUE-006 (RTL) ──► ISSUE-007 (package+verify)

ISSUE-008 (fp16 csyn verify) ──► ISSUE-009 (builder.py fix, upstream PR)
──► ISSUE-010 (emitter fix, conditional on Test B fail)
[standalone U280 inner product RTL kernel project]
```

- **ISSUE-001 → ISSUE-002**: Black format fix must land on the branch before test push so CI
runs clean.
- **ISSUE-004 → ISSUE-003**: The cherry-pick in ISSUE-003 is easier / less conflict-prone once
we know SPMW merges cleanly. Also confirms the commit SHA of `origin/main` the new PR targets.
- ISSUE-001 and ISSUE-004 are **independent** and can proceed in parallel.

---

## Approval Gates

Any step that touches GitHub upstream requires explicit user approval **before** execution:

- Pushing to `fork` remote (triggers CI on open PRs)
- `gh pr close` / `gh pr create` / `gh pr comment`
- `git push fork <branch>`

Steps that are purely local (edit, commit, local merge, run tests) do **not** require approval.

---

## Upstream Watch

- **PR #574** (`fix/dataflow-kernel-ordering`, zzzDavid, open) — touches `allo/dataflow.py`;
our `fix/hierarchical-dataflow-codegen` also touches it. Rebase needed if #574 lands first.
- **PR #570** (`spmw-builder-tmp`, Fangtangtang, open) — touches `allo/ir/types.py`;
our `stateful = Stateful` alias is there. Rebase needed if #570 lands first.
- Note: upstream already had a Catapult backend (PR #543, zzzDavid, merged 2026-02-05).
Our Catapult work was modifications on top of that. Both are now removed from this branch.

---

## Completed This Cycle (2026-04-13, continued)

- ISSUE-001: PR #554 black fmt — DONE
- ISSUE-002: PR #554 tests — DONE
- ISSUE-003: PR #577 opened (replaces #563) — NEEDS-REVIEW (CI PASSED — awaiting maintainer)
- ISSUE-004: SPMW merge — DONE
- Catapult/Tapa removal — DONE (commit `89339f0`)
- `stateful` alias removed; all files updated to `Stateful` — DONE (commit `e665bbc`)
- Project structure: `CLAUDE.md` created, root `STATE.md` + `notes/` + `issues/` restructured — DONE (commit `039e163`)
- PR #554 CI: PASSED; PR #577 CI: PASSED
- ISSUE-008: fp16 HLS synthesis verified — Test A PASS (half synthesizes, LUT=3986, FF=4233, 411 MHz), Test B FAIL (exp(half) ambiguous in hls_math.h → ISSUE-010) — DONE
- ISSUE-009: builder.py scalar math dispatch now handles F16Type/BF16Type — DONE (fixes allo.exp(float16) Python→MLIR lowering)
- ISSUE-010: EmitVivadoHLS.cpp math unary ops now emit hls::exp etc. for Float16Type — DONE; also fixed missing TapaHLS emitStreamTry*/Empty/Full implementations; both fp16 synth tests PASS (arith: LUT=3986/FF=4233; exp: LUT=2668/FF=2576)

---

## Out-of-Scope (Deferred)

- Catapult HLS backend: removed from branch; see `notes/ASIC_HLS_EXPLORATION.md`
- Tapa NB stream support: removed; Vitis HLS is the validated primary target
- 2×2 decoupled mesh, credit-based flow control: future work on `feature/mesh-accelerator`
- CIRCT backend: future high-priority item once upstream PRs are resolved

---

## U280 Inner Product RTL Kernel (ISSUE-005 to 007)

Issues moved to standalone project: `u280_inner_product/issues/`
Entry point for a fresh agent: `u280_inner_product/AGENT.md`
Project is self-contained; move `u280_inner_product/` outside Allo to use independently.

Harness skeleton: `u280_producer_consumer_hw.prj/` — generated by
`tests/u280_hw_deploy.py --codegen-only`. Copied into `u280_inner_product/baseline/` (complete
harness) and `u280_inner_product/harness/` (xcl2.* + utils.mk only). Run
`./run_allo.sh python tests/u280_hw_deploy.py --codegen-only` to regenerate if missing.
Loading
Loading