Skip to content

Add sentinel_peer_info metric for Sentinel peer discovery#1097

Merged
oliver006 merged 9 commits intooliver006:masterfrom
tomatopunk:master
Apr 7, 2026
Merged

Add sentinel_peer_info metric for Sentinel peer discovery#1097
oliver006 merged 9 commits intooliver006:masterfrom
tomatopunk:master

Conversation

@tomatopunk
Copy link
Copy Markdown
Contributor

Add sentinel_peer_info metric for Sentinel peer discovery

Summary

When the exporter is pointed at a Redis Sentinel, it now exposes a new metric sentinel_peer_info that lists other Sentinel peers discovered via SENTINEL SENTINELS. This allows discovering all Sentinel addresses by scraping a single Sentinel instance.

Motivation

  • Operators need to know the full set of Sentinel peers (e.g. for monitoring, failover tooling, or service discovery).
  • Scraping one Sentinel is enough: SENTINEL SENTINELS <master_name> returns the other Sentinels monitoring that master. Each peer is emitted as one time series, so there is no overlap or overwriting when multiple Sentinels are monitored.

Changes

New metric: sentinel_peer_info

  • Type: Gauge (value 1, info-style).
  • Labels: master_name, master_address, name, ip, port, runid, flags.
  • Only stable, low-cardinality labels are used to avoid time-series churn (e.g. no down_after_milliseconds or voted_leader).

Example output

# HELP redis_sentinel_peer_info Other Sentinel peers discovered via SENTINEL SENTINELS (one scrape from one Sentinel)
# TYPE redis_sentinel_peer_info gauge
redis_sentinel_peer_info{master_name="mymaster",master_address="127.0.0.1:6379",name="abc...",ip="192.168.1.2",port="26379",runid="abc...",flags="sentinel"} 1

Testing

  • Existing Sentinel tests remain valid; the new metric is emitted in addition to sentinel_master_ok_sentinels.
  • No new dependencies; behavior is backward compatible (new metric only when scraping a Sentinel).

@oliver006
Copy link
Copy Markdown
Owner

Thanks for the PR - I'll try to review this in the coming days.

One thing I noticed - I don't see a single new test for this. Can you please make sure that new functionality has complete code coverage?

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 6, 2026

Codecov Report

❌ Patch coverage is 93.93939% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.11%. Comparing base (d7bb38f) to head (f710954).
⚠️ Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
main.go 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1097      +/-   ##
==========================================
+ Coverage   81.88%   82.11%   +0.22%     
==========================================
  Files          20       20              
  Lines        2700     2728      +28     
==========================================
+ Hits         2211     2240      +29     
  Misses        370      370              
+ Partials      119      118       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@coveralls
Copy link
Copy Markdown

coveralls commented Mar 6, 2026

Coverage Report for CI Build 24058978132

Warning

Build has drifted: This PR's base is out of sync with its target branch, so coverage data may include unrelated changes.
Quick fix: rebase this PR. Learn more →

Coverage increased (+0.2%) to 85.849%

Details

  • Coverage increased (+0.2%) from the base build.
  • Patch coverage: 2 uncovered changes across 1 file (39 of 41 lines covered, 95.12%).
  • No coverage regressions found.

Uncovered Changes

File Changed Covered %
main.go 2 0 0.0%

Coverage Regressions

No coverage regressions found.


Coverage Stats

Coverage Status
Relevant Lines: 3293
Covered Lines: 2827
Line Coverage: 85.85%
Coverage Strength: 14010.85 hits per line

💛 - Coveralls

@oliver006
Copy link
Copy Markdown
Owner

quick bump - are you still planning on finishing this?

@tomatopunk
Copy link
Copy Markdown
Contributor Author

quick bump - are you still planning on finishing this?

Sorry, I’ve been quite busy lately and had actually forgotten about this. But I’ve added the missing part now.

That said, regarding the high-cardinality metrics, I think it would be good for us to discuss this a bit further. The reason I included run_id and the IP/port information is to help capture certain transient issues more accurately. For example, Redis might restart on its own outside of our alerting window, or an instance could be replaced. In cases like these, the total instance count may remain the same, while the actual instances themselves have changed. In other words, it would no longer be the same run_id, and the cluster may already have completed a new election cycle.

My intention was to make the cluster state more transparent and easier to observe.

That said, I completely agree that high cardinality is a real concern. If the metrics are stored in a system like Prometheus, which does not handle high-cardinality data very well, it could lead to significant write amplification.

Do you think it’s necessary for us to expose this information? If not, I could also add an enable flag so that it is disabled by default.

@oliver006
Copy link
Copy Markdown
Owner

Thanks for getting back to this PR!

Do you think it’s necessary for us to expose this information? If not, I could also add an enable flag so that it is disabled by default.

I don't think it's needed, want to just go without it for now?

@oliver006
Copy link
Copy Markdown
Owner

fwiw, tests failed

@tomatopunk
Copy link
Copy Markdown
Contributor Author

fwiw, tests failed

I added the InclSentinelPeerInfo flag, defaulting to false. The sentinel_peer_info metric is only registered and emitted when it is set to true.

I also fixed the unit test issue from this round, so it should work properly now.

One thing to call out: I removed log.SetLevel(log.DebugLevel) from exporter/info_test.go, since after TestKeyspaceStringParser runs, it leaves the log level set to debug for all following tests, which causes a large amount of log output.

By the way, I can’t trigger CI from my side, so could you please help run it? Thanks!

if v, ok := sentinelDetailMap["runid"]; ok {
runid = v
}
flags := ""
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use this down in line 175 and onwards instead of extracting it again down there

masterOkSentinels := 1
masterName := ""
masterAddr := ""
if len(labels) >= 2 {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still want to emit the metric if labels is empty?
(plus we outright use labels down in 197 for the other metric)
would be good to be consistent here and maybe just return if labels is empty?

e.metricDescriptions[k] = newMetricDescr(opts.Namespace, k, desc.txt, desc.lbls)
}

if !opts.InclSentinelPeerInfo {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to delete the description (and we don't do it for any other flags), just not emitting the metric should be enough?

@oliver006 oliver006 merged commit 035deef into oliver006:master Apr 7, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants