More reliable peers check by ardocrat · Pull Request #3824 · mimblewimble/grin

ardocrat · 2026-04-03T23:08:08Z

removed marking of random peer as healthy
added Unknown state for new peers, received from PEER_LIST request
add last_attempt field to peers, update when state is changing to Defunct or Healthy
check random 128 (64 healthy, 32-64 unknown, 32-128 defunct) peers no more often than 1h since last_attempt
mark peer as Defunct when ping not passed and it got disconnected
do not save connection time for new peers (a chance to clean them up to do not wait for 2 weeks)
do not crash on grin-config.toml parse with DNS failure
reconnect to seeds when 1st request failed on empty database to avoid sync stuck
do not save outbound peers to connected list when there is enough outbound peers + disconnect immediately (useful for peers check)

… (128 healthy non-connected + 128 defuncts + 128 unknown), mark peer as defunct when ping not passed, do not crash on toml parse with dns failure

…ist only when there is not enough peers + disconnect extra peer immediately, reconnect to seeds at monitor to avoid stuck, update only defunct state to unknown when received existing peer address

DavidBurkett · 2026-04-08T17:47:33Z

-		config.clone(),
-	);
-
-	if peers.enough_outbound_peers() {


Why are we removing this guard? If we have enough outbound peers, we shouldn't be trying to connect to more peers.

DavidBurkett · 2026-04-08T17:58:01Z

+					flags: State::Unknown,
+					last_banned: 0,
+					ban_reason: ReasonForBan::None,
+					last_connected: 0,


You should change this back to Utc::now().timestamp() or else they'll likely be removed on the next hourly expiry sweep instead of staying around for the 14-day retry window.

DavidBurkett · 2026-04-08T18:01:44Z

+						entry.to_socket_addrs().map_err(|_| {
+							serde::de::Error::custom(format!("Unable to resolve DNS: {}", entry))
+						});
+					if let Ok(socket_addrs) = socket_addrs {


I think this is used in too many places to just drop entries on dns failure. The failure could be temporary, but could lead to all seeds, peers_allow, peers_preferred, etc being dropped.

DavidBurkett · 2026-04-08T18:07:56Z

I guess I should've read the description closer before commenting. The things I called out are apparently intentional. I'm missing the whole point of your change, I guess. Can you explain the issue we're having now and why this change solves it? I'm not really seeing the big picture.

ardocrat · 2026-04-08T19:09:04Z

I guess I should've read the description closer before commenting. The things I called out are apparently intentional. I'm missing the whole point of your change, I guess. Can you explain the issue we're having now and why this change solves it? I'm not really seeing the big picture.

Currently we have a lot unhealthy peers at database and not checking them after gaining enough outbound to have real picture, also we are just storing a lot defunct peers and marking them as healthy randomly (1 per 20 sec), whats also not true. Nodes should spread only really healthy peers to connect by design, some nodes are getting a lot of defunct peers and its getting hard to connect to them, especially if p2p port is not open and its impossible to get inbound peers. Its a reason why I mark new peers as Unknown and not specified connected time for them, also checking Defunct peers time to time.

DavidBurkett · 2026-04-08T19:48:27Z

Ok, I at least understand where the code is coming from now, but the change is too aggressive. Your proposed fix is to repeatedly try 128 peers every 20 seconds whether we need them or not and then immediately disconnect, causing significant connection churn. That creates unnecessary load for us and for healthy peers.

It sounds like what we need is a separate probe path for checking peer status. The Unknown status is optional, but we should add things like last_seen, last_attempt, last_success, fail_count, and retry_after to the Peer DB records. Then we don't need to mark a random peer as healthy, and we'll stop sharing those bad peers with others.

ardocrat · 2026-04-08T20:02:36Z

last_seen, last_attempt, last_success, fail_count, and retry_after

we have last_seen for ping pong for already connected peers now just as information not used anywhere, so you think we should check Healthy disconnected peers only after some delay (1 hour since last_connected can be enough for this?) to reduce load for online peers.

I see its worth to set last_attempt for Defunct peers to not include same peers at check after such delay. After success we can mark Defunct as Healthy and try again looking at last_connected after 1 hour (we check peers expiration at this time exactly).

…hour, store last connection attempt, do not ask for more peers when there is enough outbound

ardocrat · 2026-04-09T01:30:44Z

last_attempt

Added last_attempt field to peer at database to check disconnected Healthy or Defunct peers no more often than once per hour to avoid extra load, also do not ask more peers when there is enough outbound on this check (forgot to check this at previous commits).

ardocrat added 2 commits April 4, 2026 01:58

peer: unknown state for new peers, check peers state on every monitor…

ab9715d

… (128 healthy non-connected + 128 defuncts + 128 unknown), mark peer as defunct when ping not passed, do not crash on toml parse with dns failure

p2p: cleanup before selection at monitor, add outbound to connected l…

9585c02

…ist only when there is not enough peers + disconnect extra peer immediately, reconnect to seeds at monitor to avoid stuck, update only defunct state to unknown when received existing peer address

ardocrat force-pushed the peers_fix branch from 80b279f to 9585c02 Compare April 5, 2026 22:44

ardocrat requested a review from koyulin87-cyber April 5, 2026 23:02

ardocrat mentioned this pull request Apr 6, 2026

Agenda: Development, 07 April 2026 mimblewimble/grin-pm#453

Open

DavidBurkett self-assigned this Apr 8, 2026

DavidBurkett requested changes Apr 8, 2026

View reviewed changes

p2p: reduced amount of total peers to check at monitor

91f6ddc

ardocrat added 2 commits April 9, 2026 04:15

p2p: do not check healthy and defunct peers more often than once per …

1258b84

…hour, store last connection attempt, do not ask for more peers when there is enough outbound

peer: update last_attempt when changing peer state to other than Banned

b172256

ardocrat force-pushed the peers_fix branch from 3fdcb39 to b172256 Compare April 9, 2026 15:37

fix: log of peers amount to check

34a96d5

DavidBurkett merged commit 90dab5f into mimblewimble:master Apr 9, 2026
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More reliable peers check#3824

More reliable peers check#3824
DavidBurkett merged 6 commits intomimblewimble:masterfrom
ardocrat:peers_fix

ardocrat commented Apr 3, 2026 •

edited

Loading

Uh oh!

DavidBurkett Apr 8, 2026

Uh oh!

Uh oh!

DavidBurkett Apr 8, 2026

Uh oh!

DavidBurkett Apr 8, 2026

Uh oh!

DavidBurkett commented Apr 8, 2026

Uh oh!

ardocrat commented Apr 8, 2026

Uh oh!

DavidBurkett commented Apr 8, 2026

Uh oh!

ardocrat commented Apr 8, 2026 •

edited

Loading

Uh oh!

ardocrat commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ardocrat commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DavidBurkett Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

DavidBurkett Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

DavidBurkett Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

DavidBurkett commented Apr 8, 2026

Uh oh!

ardocrat commented Apr 8, 2026

Uh oh!

DavidBurkett commented Apr 8, 2026

Uh oh!

ardocrat commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ardocrat commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ardocrat commented Apr 3, 2026 •

edited

Loading

ardocrat commented Apr 8, 2026 •

edited

Loading