Skip to content

[cuebot] Add memory stranded cores metric#2394

Merged
DiegoTavares merged 4 commits into
AcademySoftwareFoundation:masterfrom
DiegoTavares:fragmentation_metric
Jun 4, 2026
Merged

[cuebot] Add memory stranded cores metric#2394
DiegoTavares merged 4 commits into
AcademySoftwareFoundation:masterfrom
DiegoTavares:fragmentation_metric

Conversation

@DiegoTavares

@DiegoTavares DiegoTavares commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

Add three new prometheus metrics. cue_cores_total, cue_cores_idle cue_cores_memory_stranded. These metrics are intended to evaluate how many cores of the farm are being wasted per allocation due to not having enough memory to pick up more frames.

Summary by CodeRabbit

  • New Features
    • Prometheus monitoring now exports per-allocation core metrics: total, idle, and memory-stranded cores for active hosts.
  • Configuration
    • Metrics collector wired to host data so stranded-core values are included in metric exports.
  • Tests
    • Added tests to validate stranded-core statistics and per-allocation reporting.

Add three new prometheus metrics. cue_cores_total, cue_cores_idle cue_cores_memory_stranded. These
metrics are intended to evaluate how many cores of the farm are being wasted per allocation due to
not having enough memory to pick up more frames.
@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 60199483-0951-449a-a901-ed12c070432e

📥 Commits

Reviewing files that changed from the base of the PR and between e6d0ef5 and cfea074.

📒 Files selected for processing (1)
  • cuebot/src/main/java/com/imageworks/spcue/StrandedCoreStats.java
🚧 Files skipped from review as they are similar to previous changes (1)
  • cuebot/src/main/java/com/imageworks/spcue/StrandedCoreStats.java

📝 Walkthrough

Walkthrough

This PR introduces per-allocation stranded core statistics that are collected from the database and exposed as Prometheus gauge metrics. The changes add a new data model, DAO query, service layer, metrics collection logic, Spring wiring, and test coverage to measure memory-constrained cores across allocations.

Changes

Memory-stranded Core Metrics Collection

Layer / File(s) Summary
Data contract for stranded core statistics
cuebot/src/main/java/com/imageworks/spcue/StrandedCoreStats.java
Defines StrandedCoreStats with public final fields for allocation name and three core counts (total, idle, stranded) in units of 100, including Javadoc describing the metric semantics.
DAO layer for stranded core statistics
cuebot/src/main/java/com/imageworks/spcue/dao/HostDao.java, cuebot/src/main/java/com/imageworks/spcue/dao/postgres/HostDaoJdbc.java
Exposes getStrandedCoreStats() in HostDao interface and implements it in HostDaoJdbc with a SQL query that aggregates per-allocation core metrics from host_stat joined with alloc, filtering by memory threshold and host/lock states, mapping results to StrandedCoreStats objects.
Service layer for stranded core statistics
cuebot/src/main/java/com/imageworks/spcue/service/HostManager.java, cuebot/src/main/java/com/imageworks/spcue/service/HostManagerService.java
Adds getStrandedCoreStats() method to HostManager interface and implements it in HostManagerService as a read-only transactional method that delegates to the DAO layer.
Prometheus metrics collection and exposure
cuebot/src/main/java/com/imageworks/spcue/PrometheusMetricsCollector.java
Adds Log4j logging, a HostManager field, and three Prometheus gauges (cue_cores_total, cue_cores_idle, cue_cores_memory_stranded) labeled by environment, host, and allocation. Extends collectPrometheusMetrics() to fetch stranded core stats, scale from per-100 units to whole cores, clear and repopulate gauges, and catch/log errors without disrupting queue metrics. Adds setHostManager(...) setter.
Spring dependency configuration
cuebot/src/main/resources/conf/spring/applicationContext-service.xml
Wires the hostManager bean as a property dependency on the prometheusMetricsCollector bean.
DAO layer test coverage
cuebot/src/test/java/com/imageworks/spcue/test/dao/postgres/HostDaoTests.java
Adds testGetStrandedCoreStats to verify stranded core stats before and after changing idle memory, asserting total cores, idle cores, stranded cores transition, and allocation names. Includes findAllocStats helper to locate allocation stats in results.

Sequence Diagram(s)

sequenceDiagram
  participant PrometheusMetricsCollector
  participant HostManager
  participant HostManagerService
  participant HostDaoJdbc
  participant Database
  PrometheusMetricsCollector->>HostManager: getStrandedCoreStats()
  HostManager->>HostManagerService: delegate getStrandedCoreStats()
  HostManagerService->>HostDaoJdbc: getStrandedCoreStats()
  HostDaoJdbc->>Database: execute GET_STRANDED_CORE_STATS SQL
  Database-->>HostDaoJdbc: result rows
  HostDaoJdbc-->>HostManagerService: List<StrandedCoreStats>
  HostManagerService-->>HostManager: List<StrandedCoreStats>
  HostManager-->>PrometheusMetricsCollector: List<StrandedCoreStats>
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • lithorus
  • ramonfigueiredo

🐰 Stranded cores now take center stage,
From database pools to Prometheus gauge,
Allocation by allocation they're tracked,
Memory-starved metrics fully unpacked!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding a new memory stranded cores metric to cuebot, which aligns with all file changes and the PR's core objective.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cuebot/src/main/java/com/imageworks/spcue/PrometheusMetricsCollector.java`:
- Around line 281-284: The stranded-core gauges (coresTotal, coresIdle,
coresMemoryStranded) are being cleared before calling
hostManager.getStrandedCoreStats(), so if that DAO call fails we drop all
series; change the flow to first call hostManager.getStrandedCoreStats() and
capture the result (handle exceptions), and only clear and repopulate the three
gauges after the read succeeds (e.g. if stats list is non-null/returned without
exception), using the same variables coresTotal, coresIdle, coresMemoryStranded
to update values; this preserves existing series on DAO failure.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ab649a14-dd59-4b14-8c44-a1ca05dca291

📥 Commits

Reviewing files that changed from the base of the PR and between 50a6044 and f1dde8e.

📒 Files selected for processing (8)
  • cuebot/src/main/java/com/imageworks/spcue/PrometheusMetricsCollector.java
  • cuebot/src/main/java/com/imageworks/spcue/StrandedCoreStats.java
  • cuebot/src/main/java/com/imageworks/spcue/dao/HostDao.java
  • cuebot/src/main/java/com/imageworks/spcue/dao/postgres/HostDaoJdbc.java
  • cuebot/src/main/java/com/imageworks/spcue/service/HostManager.java
  • cuebot/src/main/java/com/imageworks/spcue/service/HostManagerService.java
  • cuebot/src/main/resources/conf/spring/applicationContext-service.xml
  • cuebot/src/test/java/com/imageworks/spcue/test/dao/postgres/HostDaoTests.java

Comment thread cuebot/src/main/java/com/imageworks/spcue/PrometheusMetricsCollector.java Outdated
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Signed-off-by: Diego Tavares <dtavares@imageworks.com>

@ramonfigueiredo ramonfigueiredo left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @DiegoTavares

Approved with 2 minor changes required/suggested.

  • Missing the Apache license header in two files.

Comment thread cuebot/src/main/java/com/imageworks/spcue/StrandedCoreStats.java
DiegoTavares and others added 2 commits June 4, 2026 13:40
Co-authored-by: Ramon Figueiredo <ramon.fgrd@gmail.com>
Signed-off-by: Diego Tavares <dtavares@imageworks.com>
@DiegoTavares DiegoTavares merged commit c8618f6 into AcademySoftwareFoundation:master Jun 4, 2026
24 of 25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants