name	generate_report
description	Generate markdown report from profiling results. Identifies what was accelerated by SME2, identifies bottlenecks (operator-level and category-level), analyzes portable vs delegated operators, and provides kernel-level insights. Use when creating final reports, documenting profiling results, identifying optimization opportunities, or sharing performance analysis with stakeholders.

Skill: Generate Performance Report

Purpose: Generate markdown report from profiling results

When to use:

After profiling completes (CSV files already generated by pipeline)
When creating final documentation of profiling results
When identifying optimization opportunities and bottlenecks
When sharing performance analysis with stakeholders
When analyzing what was accelerated and what needs optimization

Overview

Generates a markdown report from CSV files (derived from ETDump). The report includes:

Accurate E2E latency extraction - Uses Method::execute events per run (avoids double-counting)
Operator category breakdown - Shows where time is spent (CONV, GEMM, Data Movement, etc.)
Bottleneck identification - Identifies operators consuming significant % of E2E time
Acceleration analysis - Shows what was accelerated by SME2 and by how much
Portable vs delegated analysis - Identifies portable operators that should be delegated
Kernel-level insights - Shows which kernels were used (SME2 vs standard)
Optimization recommendations - Provides specific optimization suggestions

Key Insight: After SME2 accelerates CONV/GEMM operations, bottlenecks often shift to Data Movement (transpose, layout changes, memory copies) and portable operators (non-delegated operators running in ExecuTorch's portable runtime). The report makes these shifts visible.

Data Sources:

Latency: From *_all_runs_timeline.csv - extracts Method::execute events per run (accurate E2E latency, avoids double-counting)
Operator breakdown: From *_ops_stats.csv (aggregated operator statistics with backend attribution)
Kernel insights: From *_kernels.csv or *_xnntrace.log (kernel selection evidence)
Metadata: From manifest.json and config files (runner paths, device info, ExecuTorch SHA)

CRITICAL - ETDump Event Structure: ETDump events are hierarchically nested. Method::execute is the outermost container representing E2E latency. Inner events (DELEGATE_CALL, OPERATOR_CALL, actual operators) are nested within it. Summing all rows counts nested events multiple times, inflating latency ~3x. Always extract latency using only Method::execute events per run. See "Understanding ETDump Event Data Structure" section below for detailed explanation.

Prerequisites:

.venv/ activated
CSV files exist (generated automatically by pipeline, or manually via analyze_results.py)
manifest.json exists in run directory
For kernel-level insights: trace-enabled runs with *_kernels.csv or *_xnntrace.log files

Steps

1. Activate Virtual Environment

source .venv/bin/activate

2. Generate Base Report

python3 model_profiling/scripts/generate_report.py \
  --run-dir model_profiling/out_<model>/runs/mac

Default output: <run-dir>/report.md

With custom output path:

python3 model_profiling/scripts/generate_report.py \
  --run-dir model_profiling/out_<model>/runs/mac \
  --out model_profiling/out_<model>/runs/mac/custom_report.md \
  --title "My Model Performance Analysis"

Note: Pipeline automatically generates CSV files. No PTE file required - script reads from CSV files and manifest.json.

CRITICAL - Latency Extraction: The report script correctly extracts E2E latency using Method::execute events per run. This is essential because ETDump events are nested (see "Understanding ETDump Event Data Structure" section below). Summing all rows would inflate latency ~3x. Always verify latency values match *_pipeline_summary.json or metrics.json.

3. Generate Operator-Specific Bottleneck Analysis (Critical for Insights)

The base report provides category-level breakdown. For operator-specific bottlenecks, analyze:

# Analyze operator-specific bottlenecks (identifies top operators by E2E weight)
python3 model_profiling/tools/analyze_etdump_csv.py \
  --timeline-csv model_profiling/out_<model>/runs/mac/<experiment>/*_all_runs_timeline.csv \
  --output-dir model_profiling/out_<model>/runs/mac/ \
  --verbose

This generates:

Operator-specific latency reports - Top operators by total time and % of E2E
Portable vs delegated breakdown - Identifies portable operators consuming significant time
Bottleneck recommendations - Operators that should be delegated or optimized

Key Output: Identifies operators that:

Consume >5% of E2E time (potential bottlenecks)
Are portable (not delegated) and should be considered for delegation
Show significant time even after SME2 acceleration

4. Generate Kernel-Level View (If Trace-Enabled Runs Available)

For kernel-level insights (which kernels were selected), use trace-enabled runs:

# Extract kernels from xnntrace logs
python3 model_profiling/tools/xnntrace_to_kernels.py \
  --xnntrace-log model_profiling/out_<model>/runs/mac/<experiment>/*_xnntrace.log \
  --out model_profiling/out_<model>/runs/mac/<experiment>/kernels.csv

# Generate kernel comparison view (SME2-on vs SME2-off)
python3 model_profiling/tools/generate_kernel_view.py \
  --sme2-on-kernels model_profiling/out_<model>/runs/mac/mac_sme2_on/kernels.csv \
  --sme2-off-kernels model_profiling/out_<model>/runs/mac/mac_sme2_off/kernels.csv \
  --out model_profiling/out_<model>/runs/mac/kernel_view.md

This shows:

Which kernels were selected (SME2-accelerated vs standard)
Kernel usage patterns (which operations benefit from SME2)
Evidence of SME2 acceleration (kernels with __neonsme2 or sme2 in name)

Note: Trace-enabled runs impact timing (logging overhead), so use separate runs for kernel analysis vs latency measurement.

5. Generate Robust Latency Statistics (Optional - Enhanced Analysis)

For more robust statistical analysis (outlier detection, percentiles):

python3 model_profiling/tools/robust_latency_analysis.py \
  --timeline-csv model_profiling/out_<model>/runs/mac/<experiment>/*_all_runs_timeline.csv \
  --output-dir model_profiling/out_<model>/runs/mac/ \
  --name "SME2-On" \
  --verbose

# Compare SME2-on vs SME2-off
python3 model_profiling/tools/robust_latency_analysis.py \
  --timeline-csv model_profiling/out_<model>/runs/mac/mac_sme2_off/*_all_runs_timeline.csv \
  --compare model_profiling/out_<model>/runs/mac/mac_sme2_on/*_all_runs_timeline.csv \
  --name1 "SME2-Off" \
  --name2 "SME2-On" \
  --output-dir model_profiling/out_<model>/runs/mac/ \
  --verbose

6. Verify Report Completeness

# Check report exists and is non-empty
test -f model_profiling/out_<model>/runs/mac/report.md && echo "✓ Report exists"
test -s model_profiling/out_<model>/runs/mac/report.md && echo "✓ Report is non-empty"

# Check key sections
grep -q "Latency Comparison" model_profiling/out_<model>/runs/mac/report.md && echo "✓ Contains latency comparison"
grep -q "Operator Category Breakdown" model_profiling/out_<model>/runs/mac/report.md && echo "✓ Contains category breakdown"
grep -q "Config File" model_profiling/out_<model>/runs/mac/report.md && echo "✓ Contains config path"

# Check for operator analysis (if operator analysis was run)
test -f model_profiling/out_<model>/runs/mac/operator_analysis.json && echo "✓ Operator-specific analysis available"
test -f model_profiling/out_<model>/runs/mac/kernel_view.md && echo "✓ Kernel view available"

Report Contents

The report includes:

1. Metadata

Model path, config file, runner paths, device info, ExecuTorch SHA
Analysis timestamp and CSV files analyzed

2. Latency Comparison (SME2-On vs SME2-Off)

Statistical table: Median, mean, min, max, std dev
Speedup calculation: Shows acceleration factor
Method: Uses Method::execute events per run (accurate E2E latency)

Key Insight: Compare median latencies to see overall speedup. High std dev indicates variability (thermal throttling, system load).

3. Operator Category Breakdown

Time spent per category: GEMM, Convolution, Data Movement, Elementwise, Other
SME2-on vs SME2-off comparison: Shows which categories were accelerated
Percentage breakdown: Shows where time is spent after acceleration

Key Insights:

GEMM/Convolution shrinking: Indicates SME2 acceleration is working
Data Movement growing as %: Expected after SME2 - reveals next bottleneck
Other category high: May indicate portable operators that should be delegated

4. Acceleration Analysis (What Was Accelerated)

Category-level speedup: Which categories saw the biggest speedup
Operator-level speedup: Top accelerated operators (if operator analysis available)

Key Insight: GEMM and Convolution should show 3-15x speedup with SME2. If not, check:

Device supports SME2 (Armv9, Apple M4+)
Runners built with SME2 enabled
Model operators are delegated to XNNPACK

5. Bottleneck Identification

Category-Level Bottlenecks

Data Movement dominant: After SME2, data movement often becomes the bottleneck
- Action: Focus on transpose elimination, layout optimization, memory access patterns
Other category high: May indicate portable operators
- Action: Identify portable operators and consider delegation

Operator-Level Bottlenecks (from operator analysis)

Top operators by E2E weight: Operators consuming >5% of E2E time
Portable operators consuming time: Non-delegated operators that should be delegated
Operators not accelerated: Operators that should benefit from SME2 but don't

Key Insight: After SME2 accelerates compute, focus on:

Data movement operators (transpose, reshape, copy) - optimize layout, reduce copies
Portable operators (not delegated) - consider delegation to XNNPACK or other backends
Operators with low speedup - investigate why SME2 isn't helping

6. Portable vs Delegated Operator Analysis

Critical for optimization: Identifies operators running in ExecuTorch's portable runtime vs delegated to XNNPACK.

From operator analysis:

Portable operators: Running in ExecuTorch portable runtime (slower, not optimized)
Delegated operators: Running in XNNPACK backend (faster, optimized)

Key Insight: If portable operators consume significant time (>5% of E2E), consider:

Delegation: Check if operator can be delegated to XNNPACK
Model architecture changes: Refactor to use delegated operators
Custom backend: Implement custom backend for specific operators

Example: If aten::add (portable) consumes 10% of E2E time, consider:

Replacing with XNNPACK-delegated equivalent
Fusing with other operations
Using different model architecture

7. Kernel-Level Insights (If Available)

From kernel view analysis:

Kernel selection: Which kernels were used (SME2-accelerated vs standard)
SME2 evidence: Kernels with __neonsme2 or sme2 in name
Kernel usage patterns: Which operations benefit from SME2

Key Insight: Verify SME2 is actually being used by checking kernel names. If no SME2 kernels appear:

Check device supports SME2
Verify runners built with SME2 enabled
Check operator configurations (dtype, quantization) match SME2 requirements

8. Actionable Recommendations

The report should conclude with specific, actionable recommendations:

What to optimize next: Based on bottleneck analysis
Model architecture changes: Operators to delegate or refactor
Configuration changes: Quantization, dtype, backend partitioning
Further analysis needed: If data is incomplete or unclear

Workflows

macOS Analysis

# 1. Pipeline automatically generates CSV files
python3 model_profiling/scripts/mac_pipeline.py \
  --config model_profiling/configs/my_run.json

# 2. Generate base report
python3 model_profiling/scripts/generate_report.py \
  --run-dir model_profiling/out_<model>/runs/mac

# 3. Generate operator-specific bottleneck analysis (CRITICAL)
python3 model_profiling/tools/analyze_etdump_csv.py \
  --timeline-csv model_profiling/out_<model>/runs/mac/mac_sme2_on/*_all_runs_timeline.csv \
  --compare model_profiling/out_<model>/runs/mac/mac_sme2_off/*_all_runs_timeline.csv \
  --name1 "SME2-Off" \
  --name2 "SME2-On" \
  --output-dir model_profiling/out_<model>/runs/mac/ \
  --verbose

# 4. Generate kernel view (if trace-enabled runs available)
python3 model_profiling/tools/generate_kernel_view.py \
  --sme2-on-kernels model_profiling/out_<model>/runs/mac/mac_sme2_on/kernels.csv \
  --sme2-off-kernels model_profiling/out_<model>/runs/mac/mac_sme2_off/kernels.csv \
  --out model_profiling/out_<model>/runs/mac/kernel_view.md

Android Analysis

# 1. Pipeline automatically generates CSV files
python3 model_profiling/scripts/android_pipeline.py \
  --config model_profiling/configs/android_config.json

# 2. Generate base report
python3 model_profiling/scripts/generate_report.py \
  --run-dir model_profiling/out_<model>/runs/android

# 3. Generate operator-specific bottleneck analysis (CRITICAL)
python3 model_profiling/tools/analyze_etdump_csv.py \
  --timeline-csv model_profiling/out_<model>/runs/android/android_sme2_on/*_all_runs_timeline.csv \
  --compare model_profiling/out_<model>/runs/android/android_sme2_off/*_all_runs_timeline.csv \
  --name1 "SME2-Off" \
  --name2 "SME2-On" \
  --output-dir model_profiling/out_<model>/runs/android/ \
  --verbose

# 4. Generate kernel view (if trace-enabled runs available)
python3 model_profiling/tools/generate_kernel_view.py \
  --sme2-on-kernels model_profiling/out_<model>/runs/android/android_sme2_on/kernels.csv \
  --sme2-off-kernels model_profiling/out_<model>/runs/android/android_sme2_off/kernels.csv \
  --out model_profiling/out_<model>/runs/android/kernel_view.md

Combined Report (macOS + Android)

python3 model_profiling/scripts/generate_combined_report.py \
  --mac-dir model_profiling/out_<model>/runs/mac \
  --android-dir model_profiling/out_<model>/runs/android \
  --out model_profiling/out_<model>/runs/combined_report.md

Understanding ETDump Event Data Structure

CRITICAL: Understanding ETDump's nested event structure is essential for accurate latency extraction. Incorrect extraction leads to ~3x inflated latency values.

ETDump Event Hierarchy

ETDump events are hierarchically nested. The structure follows this pattern:

Method::execute (E2E latency - outermost container)
├── Program::load_method (model loading overhead)
├── Method::init (initialization overhead)
└── DELEGATE_CALL (delegate invocation overhead)
    └── OPERATOR_CALL (operator invocation overhead)
        └── Actual Operator Execution (e.g., "aten::conv2d", "xnnpack::linear")
            └── Kernel execution (inside XNNPACK backend)

Key Event Types:

Method::execute: Outermost container - represents the entire model inference call. This is the correct E2E latency.
DELEGATE_CALL: Framework overhead for invoking a delegate (e.g., XNNPACK)
OPERATOR_CALL: Framework overhead for invoking an operator
Actual operators: Real computation (e.g., aten::conv2d, xnnpack::linear, aten::add)
Framework overhead: Program::load_method, Method::init (one-time setup, not per-run)

Why Summing All Rows is Wrong

Common Mistake: Summing all duration_ms values in the timeline CSV.

Why this fails: ETDump events are nested. Each inner event's duration is already included in its parent's duration.

Example:

Method::execute: 100ms
  └── DELEGATE_CALL: 95ms (within Method::execute)
      └── OPERATOR_CALL: 90ms (within DELEGATE_CALL)
          └── aten::conv2d: 85ms (within OPERATOR_CALL)

If you sum all rows:

Method::execute: 100ms
DELEGATE_CALL: 95ms (already counted in Method::execute)
OPERATOR_CALL: 90ms (already counted in DELEGATE_CALL)
aten::conv2d: 85ms (already counted in OPERATOR_CALL)
Total (WRONG): 370ms (3.7x inflated!)

Correct extraction: Use only Method::execute duration = 100ms

Correct Latency Extraction Method

Step 1: Filter timeline CSV for Method::execute events only:

method_exec_rows = [row for row in rows if row.get("name") == "Method::execute"]

Step 2: Group by run_index and sum durations per run:

run_totals = defaultdict(float)
for row in method_exec_rows:
    run_idx = int(row.get("run_index", 0))
    duration_ms = float(row.get("duration_ms", 0) or 0)
    run_totals[run_idx] += duration_ms

Step 3: Return per-run latencies:

return [run_totals[i] for i in sorted(run_totals)]

Why per-run: Each run may have multiple Method::execute events (if model is called multiple times). Sum per run to get total latency for that run.

Timeline CSV Structure

The *_all_runs_timeline.csv file contains columns:

name: Event name (e.g., "Method::execute", "aten::conv2d")
duration_ms: Event duration in milliseconds
run_index: Which run this event belongs to (0, 1, 2, ...)
timestamp_ms: When the event occurred (relative to run start)
parent_id: (Optional) Parent event ID (for nested structure)
op_type: (Optional) Operator type
backend: (Optional) Backend name (e.g., "XNNPACK", "portable")

Critical columns for latency extraction:

name: Must be "Method::execute"
duration_ms: Duration of the event
run_index: Groups events by run

Verification Steps

Always verify your latency extraction is correct:

Compare with pipeline summary:

# Check pipeline_summary.json
cat model_profiling/out_<model>/runs/mac/*_pipeline_summary.json | grep -A 5 "median"

Compare with metrics.json:

# Check metrics.json
cat model_profiling/out_<model>/runs/mac/metrics.json | grep -A 5 "median"

Sanity check: E2E latency should be:
- Greater than the sum of top operator categories (some overhead expected)
- Less than 3x the sum of all operator durations (if you mistakenly sum all rows)
- Consistent across runs (within 10-20% variance, depending on system load)

Use robust_latency_analysis.py for validation:

python3 model_profiling/tools/robust_latency_analysis.py \
  --timeline-csv model_profiling/out_<model>/runs/mac/<experiment>/*_all_runs_timeline.csv \
  --output-dir model_profiling/out_<model>/runs/mac/ \
  --verbose

This tool uses the same Method::execute extraction method and provides statistical validation.

Common Pitfalls

Mistake	Symptom	Fix
Summing all rows	Latency ~3x higher than expected	Use only `Method::execute` events
Including overhead events	Latency includes setup time	Filter out `Program::load_method`, `Method::init`
Not grouping by run_index	Single latency value instead of per-run	Group by `run_index` before summing
Using wrong CSV file	Missing runs or incorrect data	Use `_all_runs_timeline.csv` (not `_run0_timeline.csv`)
Not handling multiple Method::execute	Missing some runs	Sum all `Method::execute` events per run

Key Insights for Report Interpretation

1. Accurate E2E Latency Extraction

Critical: Always use Method::execute events per run, not sum of all rows.

Why: ETDump contains nested events. Summing all rows counts:

Operator execution time
Delegate call overhead
Framework overhead
Result: ~3x inflated latency

Correct method: Extract Method::execute duration per run from timeline CSV (see "Understanding ETDump Event Data Structure" section above).

Verification: Compare with *_pipeline_summary.json or metrics.json - should match within 5%.

2. Operator Category Breakdown Interpretation

Before SME2:

CONV/GEMM: 60-80% of total time (expected)
Data Movement: 10-20% (hidden behind compute)
Other: 10-20%

After SME2:

CONV/GEMM: 20-30% (accelerated, shrunk)
Data Movement: 40-60% (now visible, became bottleneck)
Other: 20-30% (may include portable operators)

Key Insight: If Data Movement grows significantly after SME2, focus optimization there.

3. Bottleneck Identification (Operator-Level)

Threshold: Operators consuming >5% of E2E time are potential bottlenecks.

Priority order:

Portable operators consuming >5% - highest priority (should be delegated)
Data movement operators consuming >10% - high priority (optimize layout)
Delegated operators with low speedup - investigate why SME2 isn't helping

Action: Use operator analysis to identify specific operators and their E2E weight.

4. Portable vs Delegated Analysis

How to identify: Check operator backend in *_ops_stats.csv:

Delegated: Backend = "XNNPACK" or "qnnpack"
Portable: Backend = "portable" or empty

Key Insight: Portable operators run in ExecuTorch's portable runtime (not optimized). If they consume significant time, consider:

Delegation: Check if XNNPACK supports the operator
Model refactoring: Replace with delegated equivalent
Custom backend: Implement optimized backend

Example: If aten::add (portable) consumes 10% of E2E:

Check if XNNPACK supports elementwise add
Consider fusing with other operations
Consider model architecture changes

5. Kernel-Level Verification

Purpose: Verify SME2 is actually being used.

Method: Check kernel names in *_kernels.csv or *_xnntrace.log:

SME2 kernels: Contain __neonsme2 or sme2 in name
Standard kernels: Contain __neon or __aarch64 but not sme2

Key Insight: If no SME2 kernels appear:

Device may not support SME2
Runners may not be built with SME2 enabled
Operator configurations may not match SME2 requirements

6. Acceleration Analysis

Expected speedups:

GEMM: 3-15x (depending on shape, dtype)
Convolution: 3-10x (depending on kernel size, stride)
Data Movement: No speedup (not accelerated by SME2)
Elementwise: Minimal speedup (not compute-bound)

Key Insight: If speedup is lower than expected:

Check device supports SME2
Verify runners built correctly
Check operator configurations (dtype, quantization)
Verify operators are delegated to XNNPACK

Troubleshooting

Issue	Symptom	Fix
No CSV files found	`FileNotFoundError: No CSV files found`	Run pipeline or `analyze_results.py` to generate CSV files
Latency values too high	Latency ~3x higher than expected	Ensure extraction uses `Method::execute` events (not summing all rows)
CSV files missing	Report cannot find timeline/stats CSV	Check CSV files exist in same directory as ETDump files
Config file not found	Config path in manifest is invalid	Verify config file exists at path specified in `manifest.json`
No operator analysis	Missing operator-specific bottlenecks	Run `analyze_etdump_csv.py` for operator-level analysis
No kernel view	Missing kernel-level insights	Run trace-enabled pipeline and use `generate_kernel_view.py`
Portable operators not identified	Can't tell which operators are portable	Check `*_ops_stats.csv` for backend column, or use `analyze_etdump_csv.py`

Implementation Checklist

Recommendations

Always run operator-specific analysis - Base report shows categories, but operator analysis shows specific bottlenecks
Compare SME2-on vs SME2-off - Use comparison mode in analyze_etdump_csv.py to see acceleration
Identify portable operators - Check backend column in stats CSV to find portable operators
Verify kernel selection - Use kernel view to confirm SME2 is actually being used
Focus on high-impact optimizations - Prioritize operators consuming >5% of E2E time
Document findings - Add insights to report for future reference

References

Base report script: model_profiling/scripts/generate_report.py
Operator analysis: model_profiling/tools/analyze_etdump_csv.py
Kernel view: model_profiling/tools/generate_kernel_view.py
Kernel extraction: model_profiling/tools/xnntrace_to_kernels.py
Robust latency: model_profiling/tools/robust_latency_analysis.py
Combined report: model_profiling/scripts/generate_combined_report.py
CSV generation: model_profiling/scripts/analyze_results.py

Assets

CSV files: *_all_runs_timeline.csv, *_ops_stats.csv
Kernel files: *_kernels.csv, *_xnntrace.log (from trace-enabled runs)
Metadata: manifest.json, metrics.json, *_pipeline_summary.json

FilesExpand file tree

07_report_generation.md

Latest commit

History