| name | generate_report |
|---|---|
| description | Generate markdown report from profiling results. Identifies what was accelerated by SME2, identifies bottlenecks (operator-level and category-level), analyzes portable vs delegated operators, and provides kernel-level insights. Use when creating final reports, documenting profiling results, identifying optimization opportunities, or sharing performance analysis with stakeholders. |
Purpose: Generate markdown report from profiling results
When to use:
- After profiling completes (CSV files already generated by pipeline)
- When creating final documentation of profiling results
- When identifying optimization opportunities and bottlenecks
- When sharing performance analysis with stakeholders
- When analyzing what was accelerated and what needs optimization
Generates a markdown report from CSV files (derived from ETDump). The report includes:
- Accurate E2E latency extraction - Uses
Method::executeevents per run (avoids double-counting) - Operator category breakdown - Shows where time is spent (CONV, GEMM, Data Movement, etc.)
- Bottleneck identification - Identifies operators consuming significant % of E2E time
- Acceleration analysis - Shows what was accelerated by SME2 and by how much
- Portable vs delegated analysis - Identifies portable operators that should be delegated
- Kernel-level insights - Shows which kernels were used (SME2 vs standard)
- Optimization recommendations - Provides specific optimization suggestions
Key Insight: After SME2 accelerates CONV/GEMM operations, bottlenecks often shift to Data Movement (transpose, layout changes, memory copies) and portable operators (non-delegated operators running in ExecuTorch's portable runtime). The report makes these shifts visible.
Data Sources:
- Latency: From
*_all_runs_timeline.csv- extractsMethod::executeevents per run (accurate E2E latency, avoids double-counting) - Operator breakdown: From
*_ops_stats.csv(aggregated operator statistics with backend attribution) - Kernel insights: From
*_kernels.csvor*_xnntrace.log(kernel selection evidence) - Metadata: From
manifest.jsonand config files (runner paths, device info, ExecuTorch SHA)
CRITICAL - ETDump Event Structure: ETDump events are hierarchically nested. Method::execute is the outermost container representing E2E latency. Inner events (DELEGATE_CALL, OPERATOR_CALL, actual operators) are nested within it. Summing all rows counts nested events multiple times, inflating latency ~3x. Always extract latency using only Method::execute events per run. See "Understanding ETDump Event Data Structure" section below for detailed explanation.
Prerequisites:
.venv/activated- CSV files exist (generated automatically by pipeline, or manually via
analyze_results.py) manifest.jsonexists in run directory- For kernel-level insights: trace-enabled runs with
*_kernels.csvor*_xnntrace.logfiles
source .venv/bin/activatepython3 model_profiling/scripts/generate_report.py \
--run-dir model_profiling/out_<model>/runs/macDefault output: <run-dir>/report.md
With custom output path:
python3 model_profiling/scripts/generate_report.py \
--run-dir model_profiling/out_<model>/runs/mac \
--out model_profiling/out_<model>/runs/mac/custom_report.md \
--title "My Model Performance Analysis"Note: Pipeline automatically generates CSV files. No PTE file required - script reads from CSV files and manifest.json.
CRITICAL - Latency Extraction: The report script correctly extracts E2E latency using Method::execute events per run. This is essential because ETDump events are nested (see "Understanding ETDump Event Data Structure" section below). Summing all rows would inflate latency ~3x. Always verify latency values match *_pipeline_summary.json or metrics.json.
The base report provides category-level breakdown. For operator-specific bottlenecks, analyze:
# Analyze operator-specific bottlenecks (identifies top operators by E2E weight)
python3 model_profiling/tools/analyze_etdump_csv.py \
--timeline-csv model_profiling/out_<model>/runs/mac/<experiment>/*_all_runs_timeline.csv \
--output-dir model_profiling/out_<model>/runs/mac/ \
--verboseThis generates:
- Operator-specific latency reports - Top operators by total time and % of E2E
- Portable vs delegated breakdown - Identifies portable operators consuming significant time
- Bottleneck recommendations - Operators that should be delegated or optimized
Key Output: Identifies operators that:
- Consume >5% of E2E time (potential bottlenecks)
- Are portable (not delegated) and should be considered for delegation
- Show significant time even after SME2 acceleration
For kernel-level insights (which kernels were selected), use trace-enabled runs:
# Extract kernels from xnntrace logs
python3 model_profiling/tools/xnntrace_to_kernels.py \
--xnntrace-log model_profiling/out_<model>/runs/mac/<experiment>/*_xnntrace.log \
--out model_profiling/out_<model>/runs/mac/<experiment>/kernels.csv
# Generate kernel comparison view (SME2-on vs SME2-off)
python3 model_profiling/tools/generate_kernel_view.py \
--sme2-on-kernels model_profiling/out_<model>/runs/mac/mac_sme2_on/kernels.csv \
--sme2-off-kernels model_profiling/out_<model>/runs/mac/mac_sme2_off/kernels.csv \
--out model_profiling/out_<model>/runs/mac/kernel_view.mdThis shows:
- Which kernels were selected (SME2-accelerated vs standard)
- Kernel usage patterns (which operations benefit from SME2)
- Evidence of SME2 acceleration (kernels with
__neonsme2orsme2in name)
Note: Trace-enabled runs impact timing (logging overhead), so use separate runs for kernel analysis vs latency measurement.
For more robust statistical analysis (outlier detection, percentiles):
python3 model_profiling/tools/robust_latency_analysis.py \
--timeline-csv model_profiling/out_<model>/runs/mac/<experiment>/*_all_runs_timeline.csv \
--output-dir model_profiling/out_<model>/runs/mac/ \
--name "SME2-On" \
--verbose
# Compare SME2-on vs SME2-off
python3 model_profiling/tools/robust_latency_analysis.py \
--timeline-csv model_profiling/out_<model>/runs/mac/mac_sme2_off/*_all_runs_timeline.csv \
--compare model_profiling/out_<model>/runs/mac/mac_sme2_on/*_all_runs_timeline.csv \
--name1 "SME2-Off" \
--name2 "SME2-On" \
--output-dir model_profiling/out_<model>/runs/mac/ \
--verbose# Check report exists and is non-empty
test -f model_profiling/out_<model>/runs/mac/report.md && echo "✓ Report exists"
test -s model_profiling/out_<model>/runs/mac/report.md && echo "✓ Report is non-empty"
# Check key sections
grep -q "Latency Comparison" model_profiling/out_<model>/runs/mac/report.md && echo "✓ Contains latency comparison"
grep -q "Operator Category Breakdown" model_profiling/out_<model>/runs/mac/report.md && echo "✓ Contains category breakdown"
grep -q "Config File" model_profiling/out_<model>/runs/mac/report.md && echo "✓ Contains config path"
# Check for operator analysis (if operator analysis was run)
test -f model_profiling/out_<model>/runs/mac/operator_analysis.json && echo "✓ Operator-specific analysis available"
test -f model_profiling/out_<model>/runs/mac/kernel_view.md && echo "✓ Kernel view available"The report includes:
- Model path, config file, runner paths, device info, ExecuTorch SHA
- Analysis timestamp and CSV files analyzed
- Statistical table: Median, mean, min, max, std dev
- Speedup calculation: Shows acceleration factor
- Method: Uses
Method::executeevents per run (accurate E2E latency)
Key Insight: Compare median latencies to see overall speedup. High std dev indicates variability (thermal throttling, system load).
- Time spent per category: GEMM, Convolution, Data Movement, Elementwise, Other
- SME2-on vs SME2-off comparison: Shows which categories were accelerated
- Percentage breakdown: Shows where time is spent after acceleration
Key Insights:
- GEMM/Convolution shrinking: Indicates SME2 acceleration is working
- Data Movement growing as %: Expected after SME2 - reveals next bottleneck
- Other category high: May indicate portable operators that should be delegated
- Category-level speedup: Which categories saw the biggest speedup
- Operator-level speedup: Top accelerated operators (if operator analysis available)
Key Insight: GEMM and Convolution should show 3-15x speedup with SME2. If not, check:
- Device supports SME2 (Armv9, Apple M4+)
- Runners built with SME2 enabled
- Model operators are delegated to XNNPACK
- Data Movement dominant: After SME2, data movement often becomes the bottleneck
- Action: Focus on transpose elimination, layout optimization, memory access patterns
- Other category high: May indicate portable operators
- Action: Identify portable operators and consider delegation
- Top operators by E2E weight: Operators consuming >5% of E2E time
- Portable operators consuming time: Non-delegated operators that should be delegated
- Operators not accelerated: Operators that should benefit from SME2 but don't
Key Insight: After SME2 accelerates compute, focus on:
- Data movement operators (transpose, reshape, copy) - optimize layout, reduce copies
- Portable operators (not delegated) - consider delegation to XNNPACK or other backends
- Operators with low speedup - investigate why SME2 isn't helping
Critical for optimization: Identifies operators running in ExecuTorch's portable runtime vs delegated to XNNPACK.
From operator analysis:
- Portable operators: Running in ExecuTorch portable runtime (slower, not optimized)
- Delegated operators: Running in XNNPACK backend (faster, optimized)
Key Insight: If portable operators consume significant time (>5% of E2E), consider:
- Delegation: Check if operator can be delegated to XNNPACK
- Model architecture changes: Refactor to use delegated operators
- Custom backend: Implement custom backend for specific operators
Example: If aten::add (portable) consumes 10% of E2E time, consider:
- Replacing with XNNPACK-delegated equivalent
- Fusing with other operations
- Using different model architecture
From kernel view analysis:
- Kernel selection: Which kernels were used (SME2-accelerated vs standard)
- SME2 evidence: Kernels with
__neonsme2orsme2in name - Kernel usage patterns: Which operations benefit from SME2
Key Insight: Verify SME2 is actually being used by checking kernel names. If no SME2 kernels appear:
- Check device supports SME2
- Verify runners built with SME2 enabled
- Check operator configurations (dtype, quantization) match SME2 requirements
The report should conclude with specific, actionable recommendations:
- What to optimize next: Based on bottleneck analysis
- Model architecture changes: Operators to delegate or refactor
- Configuration changes: Quantization, dtype, backend partitioning
- Further analysis needed: If data is incomplete or unclear
# 1. Pipeline automatically generates CSV files
python3 model_profiling/scripts/mac_pipeline.py \
--config model_profiling/configs/my_run.json
# 2. Generate base report
python3 model_profiling/scripts/generate_report.py \
--run-dir model_profiling/out_<model>/runs/mac
# 3. Generate operator-specific bottleneck analysis (CRITICAL)
python3 model_profiling/tools/analyze_etdump_csv.py \
--timeline-csv model_profiling/out_<model>/runs/mac/mac_sme2_on/*_all_runs_timeline.csv \
--compare model_profiling/out_<model>/runs/mac/mac_sme2_off/*_all_runs_timeline.csv \
--name1 "SME2-Off" \
--name2 "SME2-On" \
--output-dir model_profiling/out_<model>/runs/mac/ \
--verbose
# 4. Generate kernel view (if trace-enabled runs available)
python3 model_profiling/tools/generate_kernel_view.py \
--sme2-on-kernels model_profiling/out_<model>/runs/mac/mac_sme2_on/kernels.csv \
--sme2-off-kernels model_profiling/out_<model>/runs/mac/mac_sme2_off/kernels.csv \
--out model_profiling/out_<model>/runs/mac/kernel_view.md# 1. Pipeline automatically generates CSV files
python3 model_profiling/scripts/android_pipeline.py \
--config model_profiling/configs/android_config.json
# 2. Generate base report
python3 model_profiling/scripts/generate_report.py \
--run-dir model_profiling/out_<model>/runs/android
# 3. Generate operator-specific bottleneck analysis (CRITICAL)
python3 model_profiling/tools/analyze_etdump_csv.py \
--timeline-csv model_profiling/out_<model>/runs/android/android_sme2_on/*_all_runs_timeline.csv \
--compare model_profiling/out_<model>/runs/android/android_sme2_off/*_all_runs_timeline.csv \
--name1 "SME2-Off" \
--name2 "SME2-On" \
--output-dir model_profiling/out_<model>/runs/android/ \
--verbose
# 4. Generate kernel view (if trace-enabled runs available)
python3 model_profiling/tools/generate_kernel_view.py \
--sme2-on-kernels model_profiling/out_<model>/runs/android/android_sme2_on/kernels.csv \
--sme2-off-kernels model_profiling/out_<model>/runs/android/android_sme2_off/kernels.csv \
--out model_profiling/out_<model>/runs/android/kernel_view.mdpython3 model_profiling/scripts/generate_combined_report.py \
--mac-dir model_profiling/out_<model>/runs/mac \
--android-dir model_profiling/out_<model>/runs/android \
--out model_profiling/out_<model>/runs/combined_report.mdCRITICAL: Understanding ETDump's nested event structure is essential for accurate latency extraction. Incorrect extraction leads to ~3x inflated latency values.
ETDump events are hierarchically nested. The structure follows this pattern:
Method::execute (E2E latency - outermost container)
├── Program::load_method (model loading overhead)
├── Method::init (initialization overhead)
└── DELEGATE_CALL (delegate invocation overhead)
└── OPERATOR_CALL (operator invocation overhead)
└── Actual Operator Execution (e.g., "aten::conv2d", "xnnpack::linear")
└── Kernel execution (inside XNNPACK backend)
Key Event Types:
Method::execute: Outermost container - represents the entire model inference call. This is the correct E2E latency.DELEGATE_CALL: Framework overhead for invoking a delegate (e.g., XNNPACK)OPERATOR_CALL: Framework overhead for invoking an operator- Actual operators: Real computation (e.g.,
aten::conv2d,xnnpack::linear,aten::add) - Framework overhead:
Program::load_method,Method::init(one-time setup, not per-run)
Common Mistake: Summing all duration_ms values in the timeline CSV.
Why this fails: ETDump events are nested. Each inner event's duration is already included in its parent's duration.
Example:
Method::execute: 100ms
└── DELEGATE_CALL: 95ms (within Method::execute)
└── OPERATOR_CALL: 90ms (within DELEGATE_CALL)
└── aten::conv2d: 85ms (within OPERATOR_CALL)
If you sum all rows:
Method::execute: 100msDELEGATE_CALL: 95ms (already counted in Method::execute)OPERATOR_CALL: 90ms (already counted in DELEGATE_CALL)aten::conv2d: 85ms (already counted in OPERATOR_CALL)- Total (WRONG): 370ms (3.7x inflated!)
Correct extraction: Use only Method::execute duration = 100ms
Step 1: Filter timeline CSV for Method::execute events only:
method_exec_rows = [row for row in rows if row.get("name") == "Method::execute"]Step 2: Group by run_index and sum durations per run:
run_totals = defaultdict(float)
for row in method_exec_rows:
run_idx = int(row.get("run_index", 0))
duration_ms = float(row.get("duration_ms", 0) or 0)
run_totals[run_idx] += duration_msStep 3: Return per-run latencies:
return [run_totals[i] for i in sorted(run_totals)]Why per-run: Each run may have multiple Method::execute events (if model is called multiple times). Sum per run to get total latency for that run.
The *_all_runs_timeline.csv file contains columns:
name: Event name (e.g., "Method::execute", "aten::conv2d")duration_ms: Event duration in millisecondsrun_index: Which run this event belongs to (0, 1, 2, ...)timestamp_ms: When the event occurred (relative to run start)parent_id: (Optional) Parent event ID (for nested structure)op_type: (Optional) Operator typebackend: (Optional) Backend name (e.g., "XNNPACK", "portable")
Critical columns for latency extraction:
name: Must be "Method::execute"duration_ms: Duration of the eventrun_index: Groups events by run
Always verify your latency extraction is correct:
-
Compare with pipeline summary:
# Check pipeline_summary.json cat model_profiling/out_<model>/runs/mac/*_pipeline_summary.json | grep -A 5 "median"
-
Compare with metrics.json:
# Check metrics.json cat model_profiling/out_<model>/runs/mac/metrics.json | grep -A 5 "median"
-
Sanity check: E2E latency should be:
- Greater than the sum of top operator categories (some overhead expected)
- Less than 3x the sum of all operator durations (if you mistakenly sum all rows)
- Consistent across runs (within 10-20% variance, depending on system load)
-
Use robust_latency_analysis.py for validation:
python3 model_profiling/tools/robust_latency_analysis.py \ --timeline-csv model_profiling/out_<model>/runs/mac/<experiment>/*_all_runs_timeline.csv \ --output-dir model_profiling/out_<model>/runs/mac/ \ --verbose
This tool uses the same
Method::executeextraction method and provides statistical validation.
| Mistake | Symptom | Fix |
|---|---|---|
| Summing all rows | Latency ~3x higher than expected | Use only Method::execute events |
| Including overhead events | Latency includes setup time | Filter out Program::load_method, Method::init |
| Not grouping by run_index | Single latency value instead of per-run | Group by run_index before summing |
| Using wrong CSV file | Missing runs or incorrect data | Use *_all_runs_timeline.csv (not *_run0_timeline.csv) |
| Not handling multiple Method::execute | Missing some runs | Sum all Method::execute events per run |
Critical: Always use Method::execute events per run, not sum of all rows.
Why: ETDump contains nested events. Summing all rows counts:
- Operator execution time
- Delegate call overhead
- Framework overhead
- Result: ~3x inflated latency
Correct method: Extract Method::execute duration per run from timeline CSV (see "Understanding ETDump Event Data Structure" section above).
Verification: Compare with *_pipeline_summary.json or metrics.json - should match within 5%.
Before SME2:
- CONV/GEMM: 60-80% of total time (expected)
- Data Movement: 10-20% (hidden behind compute)
- Other: 10-20%
After SME2:
- CONV/GEMM: 20-30% (accelerated, shrunk)
- Data Movement: 40-60% (now visible, became bottleneck)
- Other: 20-30% (may include portable operators)
Key Insight: If Data Movement grows significantly after SME2, focus optimization there.
Threshold: Operators consuming >5% of E2E time are potential bottlenecks.
Priority order:
- Portable operators consuming >5% - highest priority (should be delegated)
- Data movement operators consuming >10% - high priority (optimize layout)
- Delegated operators with low speedup - investigate why SME2 isn't helping
Action: Use operator analysis to identify specific operators and their E2E weight.
How to identify: Check operator backend in *_ops_stats.csv:
- Delegated: Backend = "XNNPACK" or "qnnpack"
- Portable: Backend = "portable" or empty
Key Insight: Portable operators run in ExecuTorch's portable runtime (not optimized). If they consume significant time, consider:
- Delegation: Check if XNNPACK supports the operator
- Model refactoring: Replace with delegated equivalent
- Custom backend: Implement optimized backend
Example: If aten::add (portable) consumes 10% of E2E:
- Check if XNNPACK supports elementwise add
- Consider fusing with other operations
- Consider model architecture changes
Purpose: Verify SME2 is actually being used.
Method: Check kernel names in *_kernels.csv or *_xnntrace.log:
- SME2 kernels: Contain
__neonsme2orsme2in name - Standard kernels: Contain
__neonor__aarch64but notsme2
Key Insight: If no SME2 kernels appear:
- Device may not support SME2
- Runners may not be built with SME2 enabled
- Operator configurations may not match SME2 requirements
Expected speedups:
- GEMM: 3-15x (depending on shape, dtype)
- Convolution: 3-10x (depending on kernel size, stride)
- Data Movement: No speedup (not accelerated by SME2)
- Elementwise: Minimal speedup (not compute-bound)
Key Insight: If speedup is lower than expected:
- Check device supports SME2
- Verify runners built correctly
- Check operator configurations (dtype, quantization)
- Verify operators are delegated to XNNPACK
| Issue | Symptom | Fix |
|---|---|---|
| No CSV files found | FileNotFoundError: No CSV files found |
Run pipeline or analyze_results.py to generate CSV files |
| Latency values too high | Latency ~3x higher than expected | Ensure extraction uses Method::execute events (not summing all rows) |
| CSV files missing | Report cannot find timeline/stats CSV | Check CSV files exist in same directory as ETDump files |
| Config file not found | Config path in manifest is invalid | Verify config file exists at path specified in manifest.json |
| No operator analysis | Missing operator-specific bottlenecks | Run analyze_etdump_csv.py for operator-level analysis |
| No kernel view | Missing kernel-level insights | Run trace-enabled pipeline and use generate_kernel_view.py |
| Portable operators not identified | Can't tell which operators are portable | Check *_ops_stats.csv for backend column, or use analyze_etdump_csv.py |
- Virtual environment activated
- CSV files exist (generated by pipeline or
analyze_results.py) -
manifest.jsonexists in run directory - Base report generated successfully
- Report contains latency comparison and operator breakdown
- Latency values align with
*_pipeline_summary.json(validation) - Operator-specific analysis run (identifies bottlenecks)
- Kernel view generated (if trace-enabled runs available)
- Portable vs delegated analysis completed
- Optimization recommendations identified
- Always run operator-specific analysis - Base report shows categories, but operator analysis shows specific bottlenecks
- Compare SME2-on vs SME2-off - Use comparison mode in
analyze_etdump_csv.pyto see acceleration - Identify portable operators - Check backend column in stats CSV to find portable operators
- Verify kernel selection - Use kernel view to confirm SME2 is actually being used
- Focus on high-impact optimizations - Prioritize operators consuming >5% of E2E time
- Document findings - Add insights to report for future reference
- Base report script:
model_profiling/scripts/generate_report.py - Operator analysis:
model_profiling/tools/analyze_etdump_csv.py - Kernel view:
model_profiling/tools/generate_kernel_view.py - Kernel extraction:
model_profiling/tools/xnntrace_to_kernels.py - Robust latency:
model_profiling/tools/robust_latency_analysis.py - Combined report:
model_profiling/scripts/generate_combined_report.py - CSV generation:
model_profiling/scripts/analyze_results.py
- CSV files:
*_all_runs_timeline.csv,*_ops_stats.csv - Kernel files:
*_kernels.csv,*_xnntrace.log(from trace-enabled runs) - Metadata:
manifest.json,metrics.json,*_pipeline_summary.json