A reusable software framework for mapping quantized transformer models to AMD Versal VCK190 AI Engine (AIE). This project demonstrates an integer-only transformer implementation for jet tagging at particle accelerators, featuring a code-generation framework that automatically converts high-level Python model descriptions into optimized Vitis AIE graphs.
Paper: "Reconfigurable Computing Challenge: Transformer for Jet Tagging on Versal AI Engines" - FCCM Journal Authors: Gram Koski, Sean Lipps, Zhenghua Ma, G Abarajithan, Ryan Kastner (UC San Diego)
This framework addresses the challenge of deploying transformer-based neural networks in real-time, resource-constrained edge inference systems. It is motivated by low-level triggers for jet tagging at particle accelerators, where the ideal constraints are:
- Input Rate: 40 MHz collision rate
- Latency Budget: A few microseconds for Level-1 Trigger decisions
- Throughput Requirement: O(10⁵) events per second
- Constraints: Tight on-detector power and resource budgets
Traditional CPU/GPU platforms cannot meet these requirements. This is a preliminary, unoptimized implementation that demonstrates the framework's feasibility and establishes a foundation for future optimization. Currently, performance is limited by bottlenecks such as integer-only softmax (55+ µs latency) that prevent real-time operation at production scale. This project leverages the AMD Versal VCK190 SoC's AI Engine array for low-latency, high-throughput ML inference using quantized integer-only arithmetic, with potential for improvement in future work.
-
Modular Code-Generation Framework: Automatically generates Vitis AIE graphs from high-level Python model descriptions with composable building blocks (Dense, MHA, ResAdd, DenseSoftmax layers)
-
Integer-Only Transformer: Fully quantized implementation using int8 weights/activations and int32 accumulators with fixed-point rescaling, including a novel integer-only softmax
-
Head-Parallel MHA Mapping: Assigns attention heads to parallel AIE tiles, achieving ~4× throughput improvement (1 head: 734.6 µs vs. 4 heads: 187.0 µs)
-
Validation Framework: Every model automatically runs both AIE emulation and a NumPy golden reference for numerical correctness verification
- Vitis 2024.1 or later (with appropriate licenses)
- Python 3.8+
- NumPy and dependencies listed in
environment.yml
# Create conda environment
conda env create -f environment.yml
conda activate particle-transformer
# Verify installation
python -c "import numpy; print('NumPy OK')"The framework exposes a clean Python API for building transformer models without manually writing C++ kernels or Vitis graph code:
from model import AIEModel
from layers import DenseLayer, MHALayer, ResAddLayer
# Create model with AIE grid parameters (m, k, n)
model = AIEModel(m=4, k=8, n=8, iterations=1, dynamic_quant=True)
# Add layers
dense = DenseLayer(name='dense_0', weight=W, bias=b, relu=True)
model.add_layer(dense, inputs=[None])
mha = MHALayer(name='mha_1', Wq=Wq, Wk=Wk, Wv=Wv, Wo=Wo, num_heads=4, ...)
model.add_layer(mha, inputs=[dense])
# Forward pass: generates code, compiles, emulates, validates
output = model.forward(input_data)- Model Definition → High-level Python layer specification
- Golden Reference → NumPy implementation for validation
- Code Generation → C++ kernels and Vitis graph instantiation
- Compilation → Vitis 2024.1 compilation
- AIE Emulation → Cycle-accurate simulation
- Numerical Validation → AIE output vs. NumPy reference comparison
Each example demonstrates different aspects of the framework. All use quantized int8 arithmetic on a 160×8 input tensor (padded from 150×3 jets with 3 features).
Core transformer validation (2 MHA + FFN blocks, no softmax/bias). Demonstrates paper Table I results: 4 heads: 187 µs latency, 10,042 samples/s.
python examples/skeleton.pyBasic dense layers (8→64→8) with static quantization. Good for framework testing and understanding DenseLayer quantization.
python examples/mlp.pyIsolated softmax benchmarking (64→64). Shows softmax bottleneck from paper Table II: ~250× latency increase.
python examples/dense_softmax_model.pyComplete transformer with bias and dynamic quantization, no softmax. Production-oriented variant of skeleton.
python examples/particle_transformer_no_softmax.pyFull transformer with optional softmax and per-head static quantization. Reference implementation.
python examples/particle_transformer.py| Example | Purpose | Softmax | Quantization | Bias | Complexity | Use Case |
|---|---|---|---|---|---|---|
skeleton.py |
Core validation | ✗ | Dynamic | ✗ | Medium | Baseline performance, paper Table I |
mlp.py |
Foundational blocks | ✗ | Static | ✓ | Low | Framework testing, quick iteration |
dense_softmax_model.py |
Softmax analysis | ✓ | Static | ✓ | Very Low | Bottleneck identification, paper Table II |
particle_transformer_no_softmax.py |
Transformer backbone | ✗ | Dynamic | ✓ | High | Production without softmax |
particle_transformer.py |
Full transformer | ✓ | Static | ✓ | High | Complete reference implementation |
All experiments conducted on the AMD Versal VCK190 SoC using AIE hardware emulation with randomly initialized weights (as noted in the paper).
| Configuration | Latency (ns) | Throughput (MB/s) | Throughput (samples/s) |
|---|---|---|---|
| 4 heads | 187,014.2 | 12.85 | 10,041.8 |
| 1 head | 734,559.2 | 3.55 | 2,775.3 |
| Speedup | 3.93× | 3.62× | 3.62× |
Key Insight: Head-parallel execution achieves ~4× throughput improvement by distributing attention heads across AIE tiles.
| Configuration | Latency (ns) | Throughput (MB/s) |
|---|---|---|
| Dense only | 218.3 | 697.66 |
| Dense + bias | 1,695.8 | 22.41 |
| Dense + softmax (no bias) | 55,199.2 | 4.60 |
| Dense + bias + softmax | 55,392.5 | 4.59 |
| Softmax Overhead | ~250× | ~150× |
Key Insight: Integer-only softmax and associated data movement form the dominant bottleneck, introducing orders-of-magnitude latency increase. This is why skeleton.py and particle_transformer_no_softmax.py omit softmax for real-time inference.
All models use symmetric per-tensor quantization with int8 weights and activations:
- Weight Format: int8 (range: -128 to 127)
- Activation Format: int8 (range: -128 to 127)
- Accumulator Format: int32 (for multiplication products)
- Rescaling: Fixed-point with per-layer scale and shift parameters
Dynamic Quantization (dynamic_quant=True):
- Automatically calibrates quantization scales from the reference forward pass
- Useful for randomly initialized weights or unknown data distributions
- Used by
skeleton.pyandparticle_transformer_no_softmax.py
Static Quantization (dynamic_quant=False):
- User-specified scale and shift parameters
- Better for production with known input distributions
- Enables per-head quantization for improved precision (MHA layers)
- Used by
mlp.py,dense_softmax_model.py, andparticle_transformer.py
- DenseLayer - Quantized matrix multiplication with optional bias and ReLU
- MHALayer - Multi-head attention with per-head quantization and optional softmax
- ResAddLayer - Residual connection (element-wise addition)
- DenseSoftmaxLayer - Combined dense + integer softmax operation
Python Model Definition
↓
AIEModel.add_layer() [builds DAG]
↓
AIEModel.forward() calls:
├→ _compute_golden() [NumPy reference]
├→ _generate_code() [C++ kernels & graph]
├→ _compile_and_simulate() [Vitis emulation]
└→ _validate() [compare outputs]
↓
Output validation or error report
aie/
├─ kernels.h # AIE kernel library (C++)
examples/
├─ skeleton.py # Core transformer validation
├─ mlp.py # Simple MLP test
├─ dense_softmax_model.py # Softmax benchmarking
├─ particle_transformer.py # Full transformer (with softmax)
└─ particle_transformer_no_softmax.py # Full transformer (no softmax)
layers/
├─ __init__.py
├─ base.py # AIELayer abstract base class
├─ dense.py # DenseLayer implementation
├─ dense_softmax.py # DenseSoftmaxLayer implementation
├─ mha.py # MHALayer (multi-head attention)
└─ resadd.py # ResAddLayer (residual addition)
utils/
├─ integer_modules.py # Integer arithmetic utilities
├─ np_mha_linear.py # NumPy attention reference
└─ tiling.py # Tensor tiling utilities
model.py # AIEModel framework
-
Optimized Softmax Implementation: Reduce integer softmax latency from 55+ µs to acceptable levels for real-time inference
-
Extended Layer Support: Add integer-only LayerNorm, pooling, and other common transformer components
-
Advanced Quantization: Move beyond symmetric per-tensor scaling to per-channel or mixed-precision schemes
-
Trained Model Integration: Evaluate with actual trained jet tagging models and realistic calibration data (currently using random weights for validation)
-
Enhanced Tiling Strategies: Explore alternative tile mappings and load-balancing schemes for non-uniform workloads
# Run the skeleton model (baseline)
python examples/skeleton.py
# Run the simple MLP test
python examples/mlp.py
# Run softmax benchmarking
python examples/dense_softmax_model.py
# Run full transformer (no softmax)
python examples/particle_transformer_no_softmax.py
# Run full transformer (with softmax)
python examples/particle_transformer.pyIf you use this framework in your research, please cite:
@article{koski2024transformer,
title={Reconfigurable Computing Challenge: Transformer for Jet Tagging on Versal AI Engines},
author={Koski, Gram and Lipps, Sean and Ma, Zhenghua and Abarajithan, G and Kastner, Ryan},
journal={FCCM},
year={2024}
}Open-source software released for research and development.
For questions or contributions, contact the authors at UC San Diego Department of Computer Science and Engineering.