Particle Transformer on Versal AI Engines

A reusable software framework for mapping quantized transformer models to AMD Versal VCK190 AI Engine (AIE). This project demonstrates an integer-only transformer implementation for jet tagging at particle accelerators, featuring a code-generation framework that automatically converts high-level Python model descriptions into optimized Vitis AIE graphs.

Paper: "Reconfigurable Computing Challenge: Transformer for Jet Tagging on Versal AI Engines" - FCCM Journal Authors: Gram Koski, Sean Lipps, Zhenghua Ma, G Abarajithan, Ryan Kastner (UC San Diego)

Project Overview

This framework addresses the challenge of deploying transformer-based neural networks in real-time, resource-constrained edge inference systems. It is motivated by low-level triggers for jet tagging at particle accelerators, where the ideal constraints are:

Input Rate: 40 MHz collision rate
Latency Budget: A few microseconds for Level-1 Trigger decisions
Throughput Requirement: O(10⁵) events per second
Constraints: Tight on-detector power and resource budgets

Traditional CPU/GPU platforms cannot meet these requirements. This is a preliminary, unoptimized implementation that demonstrates the framework's feasibility and establishes a foundation for future optimization. Currently, performance is limited by bottlenecks such as integer-only softmax (55+ µs latency) that prevent real-time operation at production scale. This project leverages the AMD Versal VCK190 SoC's AI Engine array for low-latency, high-throughput ML inference using quantized integer-only arithmetic, with potential for improvement in future work.

Key Contributions

Modular Code-Generation Framework: Automatically generates Vitis AIE graphs from high-level Python model descriptions with composable building blocks (Dense, MHA, ResAdd, DenseSoftmax layers)
Integer-Only Transformer: Fully quantized implementation using int8 weights/activations and int32 accumulators with fixed-point rescaling, including a novel integer-only softmax
Head-Parallel MHA Mapping: Assigns attention heads to parallel AIE tiles, achieving ~4× throughput improvement (1 head: 734.6 µs vs. 4 heads: 187.0 µs)
Validation Framework: Every model automatically runs both AIE emulation and a NumPy golden reference for numerical correctness verification

Installation and Setup

Requirements

Vitis 2024.1 or later (with appropriate licenses)
Python 3.8+
NumPy and dependencies listed in environment.yml

Setup

# Create conda environment
conda env create -f environment.yml
conda activate particle-transformer

# Verify installation
python -c "import numpy; print('NumPy OK')"

The AIEModel Framework

The framework exposes a clean Python API for building transformer models without manually writing C++ kernels or Vitis graph code:

from model import AIEModel
from layers import DenseLayer, MHALayer, ResAddLayer

# Create model with AIE grid parameters (m, k, n)
model = AIEModel(m=4, k=8, n=8, iterations=1, dynamic_quant=True)

# Add layers
dense = DenseLayer(name='dense_0', weight=W, bias=b, relu=True)
model.add_layer(dense, inputs=[None])

mha = MHALayer(name='mha_1', Wq=Wq, Wk=Wk, Wv=Wv, Wo=Wo, num_heads=4, ...)
model.add_layer(mha, inputs=[dense])

# Forward pass: generates code, compiles, emulates, validates
output = model.forward(input_data)

Framework Workflow

Model Definition → High-level Python layer specification
Golden Reference → NumPy implementation for validation
Code Generation → C++ kernels and Vitis graph instantiation
Compilation → Vitis 2024.1 compilation
AIE Emulation → Cycle-accurate simulation
Numerical Validation → AIE output vs. NumPy reference comparison

Examples

Each example demonstrates different aspects of the framework. All use quantized int8 arithmetic on a 160×8 input tensor (padded from 150×3 jets with 3 features).

1. `skeleton.py` - Transformer Backbone (No Softmax)

Core transformer validation (2 MHA + FFN blocks, no softmax/bias). Demonstrates paper Table I results: 4 heads: 187 µs latency, 10,042 samples/s.

python examples/skeleton.py

2. `mlp.py` - Simple Multi-Layer Perceptron

Basic dense layers (8→64→8) with static quantization. Good for framework testing and understanding DenseLayer quantization.

python examples/mlp.py

3. `dense_softmax_model.py` - Dense Layer with Integer Softmax

Isolated softmax benchmarking (64→64). Shows softmax bottleneck from paper Table II: ~250× latency increase.

python examples/dense_softmax_model.py

4. `particle_transformer_no_softmax.py` - Full Transformer (No Softmax, Dynamic Quant)

Complete transformer with bias and dynamic quantization, no softmax. Production-oriented variant of skeleton.

python examples/particle_transformer_no_softmax.py

5. `particle_transformer.py` - Full Transformer (With Softmax, Static Quant)

Full transformer with optional softmax and per-head static quantization. Reference implementation.

python examples/particle_transformer.py

Example Comparison Table

Example	Purpose	Softmax	Quantization	Bias	Complexity	Use Case
`skeleton.py`	Core validation	✗	Dynamic	✗	Medium	Baseline performance, paper Table I
`mlp.py`	Foundational blocks	✗	Static	✓	Low	Framework testing, quick iteration
`dense_softmax_model.py`	Softmax analysis	✓	Static	✓	Very Low	Bottleneck identification, paper Table II
`particle_transformer_no_softmax.py`	Transformer backbone	✗	Dynamic	✓	High	Production without softmax
`particle_transformer.py`	Full transformer	✓	Static	✓	High	Complete reference implementation

Performance Results

All experiments conducted on the AMD Versal VCK190 SoC using AIE hardware emulation with randomly initialized weights (as noted in the paper).

Table I: Skeleton Model Latency & Throughput (No Bias, No Softmax)

Configuration	Latency (ns)	Throughput (MB/s)	Throughput (samples/s)
4 heads	187,014.2	12.85	10,041.8
1 head	734,559.2	3.55	2,775.3
Speedup	3.93×	3.62×	3.62×

Key Insight: Head-parallel execution achieves ~4× throughput improvement by distributing attention heads across AIE tiles.

Table II: Impact of Bias and Softmax on Single Dense Layer

Configuration	Latency (ns)	Throughput (MB/s)
Dense only	218.3	697.66
Dense + bias	1,695.8	22.41
Dense + softmax (no bias)	55,199.2	4.60
Dense + bias + softmax	55,392.5	4.59
Softmax Overhead	~250×	~150×

Key Insight: Integer-only softmax and associated data movement form the dominant bottleneck, introducing orders-of-magnitude latency increase. This is why skeleton.py and particle_transformer_no_softmax.py omit softmax for real-time inference.

Quantization Strategy

All models use symmetric per-tensor quantization with int8 weights and activations:

Weight Format: int8 (range: -128 to 127)
Activation Format: int8 (range: -128 to 127)
Accumulator Format: int32 (for multiplication products)
Rescaling: Fixed-point with per-layer scale and shift parameters

Dynamic vs. Static Quantization

Dynamic Quantization (dynamic_quant=True):

Automatically calibrates quantization scales from the reference forward pass
Useful for randomly initialized weights or unknown data distributions
Used by skeleton.py and particle_transformer_no_softmax.py

Static Quantization (dynamic_quant=False):

User-specified scale and shift parameters
Better for production with known input distributions
Enables per-head quantization for improved precision (MHA layers)
Used by mlp.py, dense_softmax_model.py, and particle_transformer.py

Framework Architecture

Layer Types

DenseLayer - Quantized matrix multiplication with optional bias and ReLU
MHALayer - Multi-head attention with per-head quantization and optional softmax
ResAddLayer - Residual connection (element-wise addition)
DenseSoftmaxLayer - Combined dense + integer softmax operation

Code Generation Pipeline

Python Model Definition
    ↓
AIEModel.add_layer() [builds DAG]
    ↓
AIEModel.forward() calls:
    ├→ _compute_golden() [NumPy reference]
    ├→ _generate_code() [C++ kernels & graph]
    ├→ _compile_and_simulate() [Vitis emulation]
    └→ _validate() [compare outputs]
    ↓
Output validation or error report

Directory Structure

aie/
├─ kernels.h          # AIE kernel library (C++)
examples/
├─ skeleton.py                      # Core transformer validation
├─ mlp.py                          # Simple MLP test
├─ dense_softmax_model.py          # Softmax benchmarking
├─ particle_transformer.py          # Full transformer (with softmax)
└─ particle_transformer_no_softmax.py  # Full transformer (no softmax)
layers/
├─ __init__.py
├─ base.py            # AIELayer abstract base class
├─ dense.py           # DenseLayer implementation
├─ dense_softmax.py   # DenseSoftmaxLayer implementation
├─ mha.py             # MHALayer (multi-head attention)
└─ resadd.py          # ResAddLayer (residual addition)
utils/
├─ integer_modules.py # Integer arithmetic utilities
├─ np_mha_linear.py   # NumPy attention reference
└─ tiling.py          # Tensor tiling utilities
model.py              # AIEModel framework

Future Work

Optimized Softmax Implementation: Reduce integer softmax latency from 55+ µs to acceptable levels for real-time inference
Extended Layer Support: Add integer-only LayerNorm, pooling, and other common transformer components
Advanced Quantization: Move beyond symmetric per-tensor scaling to per-channel or mixed-precision schemes
Trained Model Integration: Evaluate with actual trained jet tagging models and realistic calibration data (currently using random weights for validation)
Enhanced Tiling Strategies: Explore alternative tile mappings and load-balancing schemes for non-uniform workloads

Building and Running

# Run the skeleton model (baseline)
python examples/skeleton.py

# Run the simple MLP test
python examples/mlp.py

# Run softmax benchmarking
python examples/dense_softmax_model.py

# Run full transformer (no softmax)
python examples/particle_transformer_no_softmax.py

# Run full transformer (with softmax)
python examples/particle_transformer.py

Citation

If you use this framework in your research, please cite:

@article{koski2024transformer,
  title={Reconfigurable Computing Challenge: Transformer for Jet Tagging on Versal AI Engines},
  author={Koski, Gram and Lipps, Sean and Ma, Zhenghua and Abarajithan, G and Kastner, Ryan},
  journal={FCCM},
  year={2024}
}

License

Open-source software released for research and development.

Contact

For questions or contributions, contact the authors at UC San Diego Department of Computer Science and Engineering.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
aie		aie
examples		examples
layers		layers
utils		utils
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
environment.yaml		environment.yaml
model.py		model.py
run.sh		run.sh

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Particle Transformer on Versal AI Engines

Project Overview

Key Contributions

Installation and Setup

Requirements

Setup

The AIEModel Framework

Framework Workflow

Examples

1. skeleton.py - Transformer Backbone (No Softmax)

2. mlp.py - Simple Multi-Layer Perceptron

3. dense_softmax_model.py - Dense Layer with Integer Softmax

4. particle_transformer_no_softmax.py - Full Transformer (No Softmax, Dynamic Quant)

5. particle_transformer.py - Full Transformer (With Softmax, Static Quant)

Example Comparison Table

Performance Results

Table I: Skeleton Model Latency & Throughput (No Bias, No Softmax)

Table II: Impact of Bias and Softmax on Single Dense Layer

Quantization Strategy

Dynamic vs. Static Quantization

Framework Architecture

Layer Types

Code Generation Pipeline

Directory Structure

Future Work

Building and Running

Citation

License

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. `skeleton.py` - Transformer Backbone (No Softmax)

2. `mlp.py` - Simple Multi-Layer Perceptron

3. `dense_softmax_model.py` - Dense Layer with Integer Softmax

4. `particle_transformer_no_softmax.py` - Full Transformer (No Softmax, Dynamic Quant)

5. `particle_transformer.py` - Full Transformer (With Softmax, Static Quant)

Packages