Skip to content

Latest commit

 

History

History
573 lines (435 loc) · 25.2 KB

File metadata and controls

573 lines (435 loc) · 25.2 KB
name generate_report
description Generate markdown report from profiling results. Identifies what was accelerated by SME2, identifies bottlenecks (operator-level and category-level), analyzes portable vs delegated operators, and provides kernel-level insights. Use when creating final reports, documenting profiling results, identifying optimization opportunities, or sharing performance analysis with stakeholders.

Skill: Generate Performance Report

Purpose: Generate markdown report from profiling results

When to use:

  • After profiling completes (CSV files already generated by pipeline)
  • When creating final documentation of profiling results
  • When identifying optimization opportunities and bottlenecks
  • When sharing performance analysis with stakeholders
  • When analyzing what was accelerated and what needs optimization

Overview

Generates a markdown report from CSV files (derived from ETDump). The report includes:

  1. Accurate E2E latency extraction - Uses Method::execute events per run (avoids double-counting)
  2. Operator category breakdown - Shows where time is spent (CONV, GEMM, Data Movement, etc.)
  3. Bottleneck identification - Identifies operators consuming significant % of E2E time
  4. Acceleration analysis - Shows what was accelerated by SME2 and by how much
  5. Portable vs delegated analysis - Identifies portable operators that should be delegated
  6. Kernel-level insights - Shows which kernels were used (SME2 vs standard)
  7. Optimization recommendations - Provides specific optimization suggestions

Key Insight: After SME2 accelerates CONV/GEMM operations, bottlenecks often shift to Data Movement (transpose, layout changes, memory copies) and portable operators (non-delegated operators running in ExecuTorch's portable runtime). The report makes these shifts visible.

Data Sources:

  • Latency: From *_all_runs_timeline.csv - extracts Method::execute events per run (accurate E2E latency, avoids double-counting)
  • Operator breakdown: From *_ops_stats.csv (aggregated operator statistics with backend attribution)
  • Kernel insights: From *_kernels.csv or *_xnntrace.log (kernel selection evidence)
  • Metadata: From manifest.json and config files (runner paths, device info, ExecuTorch SHA)

CRITICAL - ETDump Event Structure: ETDump events are hierarchically nested. Method::execute is the outermost container representing E2E latency. Inner events (DELEGATE_CALL, OPERATOR_CALL, actual operators) are nested within it. Summing all rows counts nested events multiple times, inflating latency ~3x. Always extract latency using only Method::execute events per run. See "Understanding ETDump Event Data Structure" section below for detailed explanation.

Prerequisites:

  • .venv/ activated
  • CSV files exist (generated automatically by pipeline, or manually via analyze_results.py)
  • manifest.json exists in run directory
  • For kernel-level insights: trace-enabled runs with *_kernels.csv or *_xnntrace.log files

Steps

1. Activate Virtual Environment

source .venv/bin/activate

2. Generate Base Report

python3 model_profiling/scripts/generate_report.py \
  --run-dir model_profiling/out_<model>/runs/mac

Default output: <run-dir>/report.md

With custom output path:

python3 model_profiling/scripts/generate_report.py \
  --run-dir model_profiling/out_<model>/runs/mac \
  --out model_profiling/out_<model>/runs/mac/custom_report.md \
  --title "My Model Performance Analysis"

Note: Pipeline automatically generates CSV files. No PTE file required - script reads from CSV files and manifest.json.

CRITICAL - Latency Extraction: The report script correctly extracts E2E latency using Method::execute events per run. This is essential because ETDump events are nested (see "Understanding ETDump Event Data Structure" section below). Summing all rows would inflate latency ~3x. Always verify latency values match *_pipeline_summary.json or metrics.json.

3. Generate Operator-Specific Bottleneck Analysis (Critical for Insights)

The base report provides category-level breakdown. For operator-specific bottlenecks, analyze:

# Analyze operator-specific bottlenecks (identifies top operators by E2E weight)
python3 model_profiling/tools/analyze_etdump_csv.py \
  --timeline-csv model_profiling/out_<model>/runs/mac/<experiment>/*_all_runs_timeline.csv \
  --output-dir model_profiling/out_<model>/runs/mac/ \
  --verbose

This generates:

  • Operator-specific latency reports - Top operators by total time and % of E2E
  • Portable vs delegated breakdown - Identifies portable operators consuming significant time
  • Bottleneck recommendations - Operators that should be delegated or optimized

Key Output: Identifies operators that:

  • Consume >5% of E2E time (potential bottlenecks)
  • Are portable (not delegated) and should be considered for delegation
  • Show significant time even after SME2 acceleration

4. Generate Kernel-Level View (If Trace-Enabled Runs Available)

For kernel-level insights (which kernels were selected), use trace-enabled runs:

# Extract kernels from xnntrace logs
python3 model_profiling/tools/xnntrace_to_kernels.py \
  --xnntrace-log model_profiling/out_<model>/runs/mac/<experiment>/*_xnntrace.log \
  --out model_profiling/out_<model>/runs/mac/<experiment>/kernels.csv

# Generate kernel comparison view (SME2-on vs SME2-off)
python3 model_profiling/tools/generate_kernel_view.py \
  --sme2-on-kernels model_profiling/out_<model>/runs/mac/mac_sme2_on/kernels.csv \
  --sme2-off-kernels model_profiling/out_<model>/runs/mac/mac_sme2_off/kernels.csv \
  --out model_profiling/out_<model>/runs/mac/kernel_view.md

This shows:

  • Which kernels were selected (SME2-accelerated vs standard)
  • Kernel usage patterns (which operations benefit from SME2)
  • Evidence of SME2 acceleration (kernels with __neonsme2 or sme2 in name)

Note: Trace-enabled runs impact timing (logging overhead), so use separate runs for kernel analysis vs latency measurement.

5. Generate Robust Latency Statistics (Optional - Enhanced Analysis)

For more robust statistical analysis (outlier detection, percentiles):

python3 model_profiling/tools/robust_latency_analysis.py \
  --timeline-csv model_profiling/out_<model>/runs/mac/<experiment>/*_all_runs_timeline.csv \
  --output-dir model_profiling/out_<model>/runs/mac/ \
  --name "SME2-On" \
  --verbose

# Compare SME2-on vs SME2-off
python3 model_profiling/tools/robust_latency_analysis.py \
  --timeline-csv model_profiling/out_<model>/runs/mac/mac_sme2_off/*_all_runs_timeline.csv \
  --compare model_profiling/out_<model>/runs/mac/mac_sme2_on/*_all_runs_timeline.csv \
  --name1 "SME2-Off" \
  --name2 "SME2-On" \
  --output-dir model_profiling/out_<model>/runs/mac/ \
  --verbose

6. Verify Report Completeness

# Check report exists and is non-empty
test -f model_profiling/out_<model>/runs/mac/report.md && echo "✓ Report exists"
test -s model_profiling/out_<model>/runs/mac/report.md && echo "✓ Report is non-empty"

# Check key sections
grep -q "Latency Comparison" model_profiling/out_<model>/runs/mac/report.md && echo "✓ Contains latency comparison"
grep -q "Operator Category Breakdown" model_profiling/out_<model>/runs/mac/report.md && echo "✓ Contains category breakdown"
grep -q "Config File" model_profiling/out_<model>/runs/mac/report.md && echo "✓ Contains config path"

# Check for operator analysis (if operator analysis was run)
test -f model_profiling/out_<model>/runs/mac/operator_analysis.json && echo "✓ Operator-specific analysis available"
test -f model_profiling/out_<model>/runs/mac/kernel_view.md && echo "✓ Kernel view available"

Report Contents

The report includes:

1. Metadata

  • Model path, config file, runner paths, device info, ExecuTorch SHA
  • Analysis timestamp and CSV files analyzed

2. Latency Comparison (SME2-On vs SME2-Off)

  • Statistical table: Median, mean, min, max, std dev
  • Speedup calculation: Shows acceleration factor
  • Method: Uses Method::execute events per run (accurate E2E latency)

Key Insight: Compare median latencies to see overall speedup. High std dev indicates variability (thermal throttling, system load).

3. Operator Category Breakdown

  • Time spent per category: GEMM, Convolution, Data Movement, Elementwise, Other
  • SME2-on vs SME2-off comparison: Shows which categories were accelerated
  • Percentage breakdown: Shows where time is spent after acceleration

Key Insights:

  • GEMM/Convolution shrinking: Indicates SME2 acceleration is working
  • Data Movement growing as %: Expected after SME2 - reveals next bottleneck
  • Other category high: May indicate portable operators that should be delegated

4. Acceleration Analysis (What Was Accelerated)

  • Category-level speedup: Which categories saw the biggest speedup
  • Operator-level speedup: Top accelerated operators (if operator analysis available)

Key Insight: GEMM and Convolution should show 3-15x speedup with SME2. If not, check:

  • Device supports SME2 (Armv9, Apple M4+)
  • Runners built with SME2 enabled
  • Model operators are delegated to XNNPACK

5. Bottleneck Identification

Category-Level Bottlenecks

  • Data Movement dominant: After SME2, data movement often becomes the bottleneck
    • Action: Focus on transpose elimination, layout optimization, memory access patterns
  • Other category high: May indicate portable operators
    • Action: Identify portable operators and consider delegation

Operator-Level Bottlenecks (from operator analysis)

  • Top operators by E2E weight: Operators consuming >5% of E2E time
  • Portable operators consuming time: Non-delegated operators that should be delegated
  • Operators not accelerated: Operators that should benefit from SME2 but don't

Key Insight: After SME2 accelerates compute, focus on:

  1. Data movement operators (transpose, reshape, copy) - optimize layout, reduce copies
  2. Portable operators (not delegated) - consider delegation to XNNPACK or other backends
  3. Operators with low speedup - investigate why SME2 isn't helping

6. Portable vs Delegated Operator Analysis

Critical for optimization: Identifies operators running in ExecuTorch's portable runtime vs delegated to XNNPACK.

From operator analysis:

  • Portable operators: Running in ExecuTorch portable runtime (slower, not optimized)
  • Delegated operators: Running in XNNPACK backend (faster, optimized)

Key Insight: If portable operators consume significant time (>5% of E2E), consider:

  • Delegation: Check if operator can be delegated to XNNPACK
  • Model architecture changes: Refactor to use delegated operators
  • Custom backend: Implement custom backend for specific operators

Example: If aten::add (portable) consumes 10% of E2E time, consider:

  • Replacing with XNNPACK-delegated equivalent
  • Fusing with other operations
  • Using different model architecture

7. Kernel-Level Insights (If Available)

From kernel view analysis:

  • Kernel selection: Which kernels were used (SME2-accelerated vs standard)
  • SME2 evidence: Kernels with __neonsme2 or sme2 in name
  • Kernel usage patterns: Which operations benefit from SME2

Key Insight: Verify SME2 is actually being used by checking kernel names. If no SME2 kernels appear:

  • Check device supports SME2
  • Verify runners built with SME2 enabled
  • Check operator configurations (dtype, quantization) match SME2 requirements

8. Actionable Recommendations

The report should conclude with specific, actionable recommendations:

  1. What to optimize next: Based on bottleneck analysis
  2. Model architecture changes: Operators to delegate or refactor
  3. Configuration changes: Quantization, dtype, backend partitioning
  4. Further analysis needed: If data is incomplete or unclear

Workflows

macOS Analysis

# 1. Pipeline automatically generates CSV files
python3 model_profiling/scripts/mac_pipeline.py \
  --config model_profiling/configs/my_run.json

# 2. Generate base report
python3 model_profiling/scripts/generate_report.py \
  --run-dir model_profiling/out_<model>/runs/mac

# 3. Generate operator-specific bottleneck analysis (CRITICAL)
python3 model_profiling/tools/analyze_etdump_csv.py \
  --timeline-csv model_profiling/out_<model>/runs/mac/mac_sme2_on/*_all_runs_timeline.csv \
  --compare model_profiling/out_<model>/runs/mac/mac_sme2_off/*_all_runs_timeline.csv \
  --name1 "SME2-Off" \
  --name2 "SME2-On" \
  --output-dir model_profiling/out_<model>/runs/mac/ \
  --verbose

# 4. Generate kernel view (if trace-enabled runs available)
python3 model_profiling/tools/generate_kernel_view.py \
  --sme2-on-kernels model_profiling/out_<model>/runs/mac/mac_sme2_on/kernels.csv \
  --sme2-off-kernels model_profiling/out_<model>/runs/mac/mac_sme2_off/kernels.csv \
  --out model_profiling/out_<model>/runs/mac/kernel_view.md

Android Analysis

# 1. Pipeline automatically generates CSV files
python3 model_profiling/scripts/android_pipeline.py \
  --config model_profiling/configs/android_config.json

# 2. Generate base report
python3 model_profiling/scripts/generate_report.py \
  --run-dir model_profiling/out_<model>/runs/android

# 3. Generate operator-specific bottleneck analysis (CRITICAL)
python3 model_profiling/tools/analyze_etdump_csv.py \
  --timeline-csv model_profiling/out_<model>/runs/android/android_sme2_on/*_all_runs_timeline.csv \
  --compare model_profiling/out_<model>/runs/android/android_sme2_off/*_all_runs_timeline.csv \
  --name1 "SME2-Off" \
  --name2 "SME2-On" \
  --output-dir model_profiling/out_<model>/runs/android/ \
  --verbose

# 4. Generate kernel view (if trace-enabled runs available)
python3 model_profiling/tools/generate_kernel_view.py \
  --sme2-on-kernels model_profiling/out_<model>/runs/android/android_sme2_on/kernels.csv \
  --sme2-off-kernels model_profiling/out_<model>/runs/android/android_sme2_off/kernels.csv \
  --out model_profiling/out_<model>/runs/android/kernel_view.md

Combined Report (macOS + Android)

python3 model_profiling/scripts/generate_combined_report.py \
  --mac-dir model_profiling/out_<model>/runs/mac \
  --android-dir model_profiling/out_<model>/runs/android \
  --out model_profiling/out_<model>/runs/combined_report.md

Understanding ETDump Event Data Structure

CRITICAL: Understanding ETDump's nested event structure is essential for accurate latency extraction. Incorrect extraction leads to ~3x inflated latency values.

ETDump Event Hierarchy

ETDump events are hierarchically nested. The structure follows this pattern:

Method::execute (E2E latency - outermost container)
├── Program::load_method (model loading overhead)
├── Method::init (initialization overhead)
└── DELEGATE_CALL (delegate invocation overhead)
    └── OPERATOR_CALL (operator invocation overhead)
        └── Actual Operator Execution (e.g., "aten::conv2d", "xnnpack::linear")
            └── Kernel execution (inside XNNPACK backend)

Key Event Types:

  • Method::execute: Outermost container - represents the entire model inference call. This is the correct E2E latency.
  • DELEGATE_CALL: Framework overhead for invoking a delegate (e.g., XNNPACK)
  • OPERATOR_CALL: Framework overhead for invoking an operator
  • Actual operators: Real computation (e.g., aten::conv2d, xnnpack::linear, aten::add)
  • Framework overhead: Program::load_method, Method::init (one-time setup, not per-run)

Why Summing All Rows is Wrong

Common Mistake: Summing all duration_ms values in the timeline CSV.

Why this fails: ETDump events are nested. Each inner event's duration is already included in its parent's duration.

Example:

Method::execute: 100ms
  └── DELEGATE_CALL: 95ms (within Method::execute)
      └── OPERATOR_CALL: 90ms (within DELEGATE_CALL)
          └── aten::conv2d: 85ms (within OPERATOR_CALL)

If you sum all rows:

  • Method::execute: 100ms
  • DELEGATE_CALL: 95ms (already counted in Method::execute)
  • OPERATOR_CALL: 90ms (already counted in DELEGATE_CALL)
  • aten::conv2d: 85ms (already counted in OPERATOR_CALL)
  • Total (WRONG): 370ms (3.7x inflated!)

Correct extraction: Use only Method::execute duration = 100ms

Correct Latency Extraction Method

Step 1: Filter timeline CSV for Method::execute events only:

method_exec_rows = [row for row in rows if row.get("name") == "Method::execute"]

Step 2: Group by run_index and sum durations per run:

run_totals = defaultdict(float)
for row in method_exec_rows:
    run_idx = int(row.get("run_index", 0))
    duration_ms = float(row.get("duration_ms", 0) or 0)
    run_totals[run_idx] += duration_ms

Step 3: Return per-run latencies:

return [run_totals[i] for i in sorted(run_totals)]

Why per-run: Each run may have multiple Method::execute events (if model is called multiple times). Sum per run to get total latency for that run.

Timeline CSV Structure

The *_all_runs_timeline.csv file contains columns:

  • name: Event name (e.g., "Method::execute", "aten::conv2d")
  • duration_ms: Event duration in milliseconds
  • run_index: Which run this event belongs to (0, 1, 2, ...)
  • timestamp_ms: When the event occurred (relative to run start)
  • parent_id: (Optional) Parent event ID (for nested structure)
  • op_type: (Optional) Operator type
  • backend: (Optional) Backend name (e.g., "XNNPACK", "portable")

Critical columns for latency extraction:

  • name: Must be "Method::execute"
  • duration_ms: Duration of the event
  • run_index: Groups events by run

Verification Steps

Always verify your latency extraction is correct:

  1. Compare with pipeline summary:

    # Check pipeline_summary.json
    cat model_profiling/out_<model>/runs/mac/*_pipeline_summary.json | grep -A 5 "median"
  2. Compare with metrics.json:

    # Check metrics.json
    cat model_profiling/out_<model>/runs/mac/metrics.json | grep -A 5 "median"
  3. Sanity check: E2E latency should be:

    • Greater than the sum of top operator categories (some overhead expected)
    • Less than 3x the sum of all operator durations (if you mistakenly sum all rows)
    • Consistent across runs (within 10-20% variance, depending on system load)
  4. Use robust_latency_analysis.py for validation:

    python3 model_profiling/tools/robust_latency_analysis.py \
      --timeline-csv model_profiling/out_<model>/runs/mac/<experiment>/*_all_runs_timeline.csv \
      --output-dir model_profiling/out_<model>/runs/mac/ \
      --verbose

    This tool uses the same Method::execute extraction method and provides statistical validation.

Common Pitfalls

Mistake Symptom Fix
Summing all rows Latency ~3x higher than expected Use only Method::execute events
Including overhead events Latency includes setup time Filter out Program::load_method, Method::init
Not grouping by run_index Single latency value instead of per-run Group by run_index before summing
Using wrong CSV file Missing runs or incorrect data Use *_all_runs_timeline.csv (not *_run0_timeline.csv)
Not handling multiple Method::execute Missing some runs Sum all Method::execute events per run

Key Insights for Report Interpretation

1. Accurate E2E Latency Extraction

Critical: Always use Method::execute events per run, not sum of all rows.

Why: ETDump contains nested events. Summing all rows counts:

  • Operator execution time
  • Delegate call overhead
  • Framework overhead
  • Result: ~3x inflated latency

Correct method: Extract Method::execute duration per run from timeline CSV (see "Understanding ETDump Event Data Structure" section above).

Verification: Compare with *_pipeline_summary.json or metrics.json - should match within 5%.

2. Operator Category Breakdown Interpretation

Before SME2:

  • CONV/GEMM: 60-80% of total time (expected)
  • Data Movement: 10-20% (hidden behind compute)
  • Other: 10-20%

After SME2:

  • CONV/GEMM: 20-30% (accelerated, shrunk)
  • Data Movement: 40-60% (now visible, became bottleneck)
  • Other: 20-30% (may include portable operators)

Key Insight: If Data Movement grows significantly after SME2, focus optimization there.

3. Bottleneck Identification (Operator-Level)

Threshold: Operators consuming >5% of E2E time are potential bottlenecks.

Priority order:

  1. Portable operators consuming >5% - highest priority (should be delegated)
  2. Data movement operators consuming >10% - high priority (optimize layout)
  3. Delegated operators with low speedup - investigate why SME2 isn't helping

Action: Use operator analysis to identify specific operators and their E2E weight.

4. Portable vs Delegated Analysis

How to identify: Check operator backend in *_ops_stats.csv:

  • Delegated: Backend = "XNNPACK" or "qnnpack"
  • Portable: Backend = "portable" or empty

Key Insight: Portable operators run in ExecuTorch's portable runtime (not optimized). If they consume significant time, consider:

  • Delegation: Check if XNNPACK supports the operator
  • Model refactoring: Replace with delegated equivalent
  • Custom backend: Implement optimized backend

Example: If aten::add (portable) consumes 10% of E2E:

  • Check if XNNPACK supports elementwise add
  • Consider fusing with other operations
  • Consider model architecture changes

5. Kernel-Level Verification

Purpose: Verify SME2 is actually being used.

Method: Check kernel names in *_kernels.csv or *_xnntrace.log:

  • SME2 kernels: Contain __neonsme2 or sme2 in name
  • Standard kernels: Contain __neon or __aarch64 but not sme2

Key Insight: If no SME2 kernels appear:

  • Device may not support SME2
  • Runners may not be built with SME2 enabled
  • Operator configurations may not match SME2 requirements

6. Acceleration Analysis

Expected speedups:

  • GEMM: 3-15x (depending on shape, dtype)
  • Convolution: 3-10x (depending on kernel size, stride)
  • Data Movement: No speedup (not accelerated by SME2)
  • Elementwise: Minimal speedup (not compute-bound)

Key Insight: If speedup is lower than expected:

  • Check device supports SME2
  • Verify runners built correctly
  • Check operator configurations (dtype, quantization)
  • Verify operators are delegated to XNNPACK

Troubleshooting

Issue Symptom Fix
No CSV files found FileNotFoundError: No CSV files found Run pipeline or analyze_results.py to generate CSV files
Latency values too high Latency ~3x higher than expected Ensure extraction uses Method::execute events (not summing all rows)
CSV files missing Report cannot find timeline/stats CSV Check CSV files exist in same directory as ETDump files
Config file not found Config path in manifest is invalid Verify config file exists at path specified in manifest.json
No operator analysis Missing operator-specific bottlenecks Run analyze_etdump_csv.py for operator-level analysis
No kernel view Missing kernel-level insights Run trace-enabled pipeline and use generate_kernel_view.py
Portable operators not identified Can't tell which operators are portable Check *_ops_stats.csv for backend column, or use analyze_etdump_csv.py

Implementation Checklist

  • Virtual environment activated
  • CSV files exist (generated by pipeline or analyze_results.py)
  • manifest.json exists in run directory
  • Base report generated successfully
  • Report contains latency comparison and operator breakdown
  • Latency values align with *_pipeline_summary.json (validation)
  • Operator-specific analysis run (identifies bottlenecks)
  • Kernel view generated (if trace-enabled runs available)
  • Portable vs delegated analysis completed
  • Optimization recommendations identified

Recommendations

  1. Always run operator-specific analysis - Base report shows categories, but operator analysis shows specific bottlenecks
  2. Compare SME2-on vs SME2-off - Use comparison mode in analyze_etdump_csv.py to see acceleration
  3. Identify portable operators - Check backend column in stats CSV to find portable operators
  4. Verify kernel selection - Use kernel view to confirm SME2 is actually being used
  5. Focus on high-impact optimizations - Prioritize operators consuming >5% of E2E time
  6. Document findings - Add insights to report for future reference

References

  • Base report script: model_profiling/scripts/generate_report.py
  • Operator analysis: model_profiling/tools/analyze_etdump_csv.py
  • Kernel view: model_profiling/tools/generate_kernel_view.py
  • Kernel extraction: model_profiling/tools/xnntrace_to_kernels.py
  • Robust latency: model_profiling/tools/robust_latency_analysis.py
  • Combined report: model_profiling/scripts/generate_combined_report.py
  • CSV generation: model_profiling/scripts/analyze_results.py

Assets

  • CSV files: *_all_runs_timeline.csv, *_ops_stats.csv
  • Kernel files: *_kernels.csv, *_xnntrace.log (from trace-enabled runs)
  • Metadata: manifest.json, metrics.json, *_pipeline_summary.json