Skip to content

Add Devnet-4 metrics: block production, gossip sizes, sync status#279

Open
pablodeymo wants to merge 3 commits intomainfrom
devnet4-metrics
Open

Add Devnet-4 metrics: block production, gossip sizes, sync status#279
pablodeymo wants to merge 3 commits intomainfrom
devnet4-metrics

Conversation

@pablodeymo
Copy link
Copy Markdown
Collaborator

Motivation

Implements the metrics defined in leanMetrics PR #29 (Devnet-4 metrics). These metrics provide observability into block production performance, gossip message sizes, and node sync status — critical for monitoring multi-client devnets.

Description

Block Production Metrics (5 new)

Instrumented in propose_block() (lib.rs) and produce_block_with_signatures() (store.rs):

Metric Type Buckets Instrumentation point
lean_block_building_time_seconds Histogram 0.01–1s RAII guard over entire propose_block()
lean_block_building_payload_aggregation_time_seconds Histogram 0.1–4s RAII guard over build_block() call
lean_block_aggregated_payloads Histogram 1–128 Observed after build_block() returns
lean_block_building_success_total Counter After successful process_block()
lean_block_building_failures_total Counter On produce_block_with_signatures or process_block error

Gossip Message Size Metrics (3 new)

Instrumented in handle_gossipsub_message() (handler.rs), measuring compressed (wire) size:

Metric Type Buckets
lean_gossip_block_size_bytes Histogram 10KB–5MB
lean_gossip_attestation_size_bytes Histogram 512B–16KB
lean_gossip_aggregation_size_bytes Histogram 1KB–1MB

Sync Status Gauge (1 new)

Metric Type Labels
lean_node_sync_status IntGaugeVec status=idle|syncing|synced

The metric and API (set_sync_status()) are defined but not yet wired — will be activated when #246 (skip validator duties while syncing) merges.

Bucket Update (1 change)

lean_committee_signatures_aggregation_time_seconds buckets updated from [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 0.75, 1.0] to [0.05, 0.1, 0.25, 0.5, 0.75, 1.0, 2.0, 3.0, 4.0] — the old upper bound of 1s was too low for production aggregation times.

Files changed

File Changes
crates/blockchain/src/metrics.rs 6 new statics, 7 API functions, 7 init lines, 1 bucket edit
crates/blockchain/src/lib.rs Timing guard + success/failure counters in propose_block()
crates/blockchain/src/store.rs Payload aggregation timing + payload count observation
crates/net/p2p/src/metrics.rs 3 gossip size histograms + 3 API functions
crates/net/p2p/src/gossipsub/handler.rs 3 size observations on message receive

Test plan

  • cargo fmt --all -- --check passes
  • cargo clippy --workspace -- -D warnings passes
  • cargo test --workspace --release passes (all 97 tests)
  • Deploy to devnet and verify new metrics appear on /metrics endpoint
  • Verify block production histograms record during normal operation
  • Verify gossip size histograms populate when receiving blocks/attestations

…d bucket update (leanMetrics PR #29)

  - Block production: building time, payload aggregation time, aggregated payload count,
    success/failure counters
  - Gossip message sizes: block, attestation, and aggregation compressed bytes
  - Sync status gauge with idle/syncing/synced labels (wiring deferred to PR #246)
  - Update committee_signatures_aggregation buckets from [0.005..1s] to [0.05..4s]
@github-actions
Copy link
Copy Markdown

🤖 Codex Code Review

No blocking findings.

The patch appears telemetry-only: it does not modify fork choice, attestation validation, STF, SSZ decoding, or XMSS verification behavior. I did not see a consensus or security regression in the added code paths.

One non-blocking observability note:

  1. The metric names are narrower than the code they currently time. crates/blockchain/src/lib.rs:225 starts time_block_building(), but that guard spans proposer-attestation signing, local block import, and the gossip publish attempt before it drops; it is not just “build block”. Similarly, crates/blockchain/src/store.rs:817 times all of build_block(), which includes process_slots/process_block, not only payload aggregation, despite the help text at crates/blockchain/src/metrics.rs:281 and crates/blockchain/src/metrics.rs:291. Either narrow the guard scope or rename the metrics so operators do not draw the wrong conclusions from them.

Residual risk: I reviewed the diff and surrounding code, but I did not run tests.


Automated review by OpenAI Codex · gpt-5.4 · custom prompt

@github-actions
Copy link
Copy Markdown

🤖 Kimi Code Review

The PR adds comprehensive metrics for block building and gossipsub message sizes. Overall the implementation is correct and follows Rust best practices, but there are two items to address:

1. Type safety issue in sync status metric (crates/blockchain/src/metrics.rs, lines 553–562)

The set_sync_status function accepts a raw &str, which allows typos to silently set all gauge values to 0 without updating any status to 1. This creates an invalid metrics state.

// Current: typo results in all zeros
pub fn set_sync_status(status: &str) {
    for label in &["idle", "syncing", "synced"] {
        LEAN_NODE_SYNC_STATUS
            .with_label_values(&[label])
            .set(i64::from(*label == status));  // All false if typo
    }
}

Recommendation: Use an enum to enforce valid states at compile time:

pub enum SyncStatus { Idle, Syncing, Synced }

pub fn set_sync_status(status: SyncStatus) {
    let status_str = match status { /* ... */ };
    for label in &["idle", "syncing", "synced"] {
        let value = i64::from(label == &status_str);
        LEAN_NODE_SYNC_STATUS.with_label_values(&[label]).set(value);
    }
}

2. Missing initialization for P2P metrics (crates/net/p2p/src/metrics.rs)

The new gossip size histograms (LEAN_GOSSIP_BLOCK_SIZE_BYTES, etc.) are not forced in any init() function visible in the diff. If the crate follows the pattern used in blockchain/src/metrics.rs (lines 358–370), these should be added to ensure metrics appear in Prometheus with 0 values before first use.

Minor observations:

  • crates/blockchain/src/store.rs (line 823): The explicit drop(_payload_timing) before observing the payload count is correct and prevents the histogram from including the observation overhead. Good practice.

  • crates/blockchain/src/lib.rs (lines 228, 286): The failure counter logic correctly distinguishes between block production failures and local processing failures without double-counting. The early returns ensure inc_block_building_success() (line 291) is only reached when both stages succeed.

  • Gossip message sizes: Observing `message


Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 14, 2026

Greptile Summary

This PR adds nine new Prometheus metrics covering block production timing, gossip wire sizes, and node sync status, plus a bucket-range correction for lean_committee_signatures_aggregation_time_seconds. All new metrics follow existing codebase conventions (lean_ prefix, LazyLock statics, TimingGuard for RAII timing) and are correctly wired into the instrumentation points described in the PR.

Confidence Score: 5/5

  • Safe to merge; the only finding is a minor P2 metrics inconsistency that does not affect consensus or block production correctness.
  • All findings are P2 or lower. The missing inc_block_building_failures() on the sign_attestation path creates a counter/histogram count mismatch but has no impact on node behaviour. Everything else — bucket ranges, metric names, init registration, gossip size placement — is correct.
  • crates/blockchain/src/lib.rs — see comment on sign_attestation failure path.

Important Files Changed

Filename Overview
crates/blockchain/src/lib.rs Adds block-building RAII timer and success/failure counters to propose_block(); the sign_attestation failure path is missing a inc_block_building_failures() call, creating a counter/histogram count mismatch.
crates/blockchain/src/metrics.rs Adds 5 new block-production metrics (2 histograms, 2 counters, 1 payload histogram) and 1 IntGaugeVec for sync status; all are force-initialized in init() and have appropriate bucket ranges. Bucket update for lean_committee_signatures_aggregation_time_seconds is correct.
crates/blockchain/src/store.rs Adds payload-aggregation RAII timer around build_block() and observes signatures.len() (one proof per attestation, confirmed by verify_signatures invariant) as the payload count; explicit drop to stop timer before the cheap Ok return is clean and correct.
crates/net/p2p/src/gossipsub/handler.rs Wires three gossip-size observations before decompression on each topic branch; measuring compressed wire size is consistent with the PR's stated intent.
crates/net/p2p/src/metrics.rs Adds three gossip-size histograms with sensible bucket ranges (block 10KB–5MB, attestation 512B–16KB, aggregation 1KB–1MB); follows existing lazy-init pattern of the p2p metrics module.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[propose_block] -->|start| T1["_block_timing = time_block_building()"]
    T1 --> B[produce_block_with_signatures]
    B -->|start| T2["_payload_timing = time_block_building_payload_aggregation()"]
    T2 --> C[build_block]
    C --> D["drop(_payload_timing) — stops payload timer"]
    D --> E["observe_block_aggregated_payloads(signatures.len())"]
    E -->|Error| F["inc_block_building_failures() + return"]
    E -->|Ok| G[sign_attestation]
    G -->|Error| H["return ⚠️ no counter"]
    G -->|Ok| I[process_block]
    I -->|Error| J["inc_block_building_failures() + return"]
    I -->|Ok| K["inc_block_building_success()"]
    K --> L[publish to gossip]
    L --> M["_block_timing drops — records full duration"]
Loading

Comments Outside Diff (1)

  1. crates/blockchain/src/lib.rs, line 259-268 (link)

    P2 Missing failure counter on sign_attestation error path

    If sign_attestation fails, _block_timing is dropped and records a duration, but neither inc_block_building_failures() nor inc_block_building_success() is called. This means lean_block_building_time_seconds._count will exceed lean_block_building_success_total + lean_block_building_failures_total, making it impossible to reconcile the two metrics accurately.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: crates/blockchain/src/lib.rs
    Line: 259-268
    
    Comment:
    **Missing failure counter on `sign_attestation` error path**
    
    If `sign_attestation` fails, `_block_timing` is dropped and records a duration, but neither `inc_block_building_failures()` nor `inc_block_building_success()` is called. This means `lean_block_building_time_seconds._count` will exceed `lean_block_building_success_total + lean_block_building_failures_total`, making it impossible to reconcile the two metrics accurately.
    
    
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: crates/blockchain/src/lib.rs
Line: 259-268

Comment:
**Missing failure counter on `sign_attestation` error path**

If `sign_attestation` fails, `_block_timing` is dropped and records a duration, but neither `inc_block_building_failures()` nor `inc_block_building_success()` is called. This means `lean_block_building_time_seconds._count` will exceed `lean_block_building_success_total + lean_block_building_failures_total`, making it impossible to reconcile the two metrics accurately.

```suggestion
        let Ok(proposer_signature) = self
            .key_manager
            .sign_attestation(validator_id, &proposer_attestation.data)
            .inspect_err(|err| {
                metrics::inc_block_building_failures();
                error!(%slot, %validator_id, %err, "Failed to sign proposer attestation");
            })
        else {
            return;
        };
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "Add Devnet-4 metrics: block production, ..." | Re-trigger Greptile

@github-actions
Copy link
Copy Markdown

🤖 Claude Code Review

Now I have enough context to write a thorough review.


Review: PR #279 — Devnet-4 Metrics

Summary

Clean, well-scoped implementation. The structure is consistent with existing metrics patterns, the init() additions are correct, and the one-hot gauge encoding for sync status is idiomatic. A few issues worth addressing.


Findings

crates/blockchain/src/lib.rs

  • Missing failure counter on signing error (lines 260–268): The sign_attestation failure path returns early without calling metrics::inc_block_building_failures():

    let Ok(proposer_signature) = self
        .key_manager
        .sign_attestation(validator_id, &proposer_attestation.data)
        .inspect_err(
            |err| error!(%slot, %validator_id, %err, "Failed to sign proposer attestation"),
        )
    else {
        return;  // no metrics::inc_block_building_failures() here
    };

    lean_block_building_failures_total will silently undercount when key signing fails (e.g., key rotation boundary, exhausted OTS index). This is a real failure mode for XMSS-based signing. Needs a metrics::inc_block_building_failures() call in that else branch.

  • lean_block_building_time_seconds upper bucket is 1.0s, but the covered scope can exceed it. The RAII guard spans all of propose_block(), which includes produce_block_with_signatures — whose own inner timer (lean_block_building_payload_aggregation_time_seconds) goes up to 4s. On a production node doing slow XMSS aggregation, block building will regularly overflow into the +Inf bucket with no fine-grained resolution. Consider aligning the upper bound with the aggregation timer (e.g. extend to [0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.0, 3.0, 4.0]) or document that +Inf overflow is expected.


crates/blockchain/src/store.rs

  • _payload_timing naming with explicit drop: The _ prefix conventionally signals "intentionally unused — drop immediately on bind." Here the variable lives until the explicit drop() call, which is the right design. Using payload_timing (no underscore) avoids the misleading signal and suppresses no warning because drop() counts as usage.

crates/net/p2p/src/metrics.rs

  • Gossip metric descriptions omit "compressed." The PR description correctly states these measure compressed wire size, but the metric help strings say only "Bytes size of a gossip block/attestation message." Dashboard authors reading /metrics won't know whether the value is pre- or post-decompression. Adding "compressed" to the description strings avoids ambiguity (e.g. "Compressed (wire) bytes size of a received gossip block message").

  • No eager initialization path. The blockchain metrics are force-initialized in init() so they appear at /metrics from startup. The three new p2p gossip histograms lack that, so they won't appear until the first gossip message of each type is received. This is inconsistent with the existing pattern — dashboards querying these metrics immediately after startup will get an empty series. If p2p/metrics.rs has no init() at all, it's worth adding one and calling it from startup.


crates/blockchain/src/metrics.rs

  • set_sync_status with unknown input silently zeroes all gauges. If called with a status string outside ["idle", "syncing", "synced"], all three gauges go to 0 with no error. Since this function isn't yet wired, it's a low-priority concern, but worth using a typed enum instead of &str before it lands — the function signature gives no indication of the valid values.

  • Bucket update for lean_committee_signatures_aggregation_time_seconds (old [0.005…1.0] → new [0.05…4.0]): The change is well-justified by observed production times. The loss of resolution below 50ms for fast paths (if any) is acceptable given the XMSS aggregation cost.


Non-issues

  • drop(_payload_timing) in store.rs is correct and intentional — ? on build_block would also correctly drop the timer on error. The explicit drop cleanly separates timing from payload count observation.
  • The success/failure counter coverage at produce_block_with_signatures and process_block is correct and mutually exclusive.
  • The observe_block_aggregated_payloads(signatures.len()) correctly reflects the number of aggregated signature payloads in the built block.
  • The one-hot pattern in set_sync_status is idiomatic for Prometheus enum-style gauges.

Automated review by Claude (Anthropic) · sonnet · custom prompt


let known_block_roots = store.get_block_roots();

let _payload_timing = metrics::time_block_building_payload_aggregation();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should go only when we are aggregating payloads

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Moved the timer inside build_block, so it only starts when !aggregated_payloads.is_empty().

…et_sync_status

  Address PR #279 review comments: the payload aggregation timing guard
  now only starts when there are payloads to aggregate, and the unused
  set_sync_status function and LEAN_NODE_SYNC_STATUS metric are removed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants