Skip to content

More reliable peers check#3824

Merged
DavidBurkett merged 6 commits intomimblewimble:masterfrom
ardocrat:peers_fix
Apr 9, 2026
Merged

More reliable peers check#3824
DavidBurkett merged 6 commits intomimblewimble:masterfrom
ardocrat:peers_fix

Conversation

@ardocrat
Copy link
Copy Markdown
Contributor

@ardocrat ardocrat commented Apr 3, 2026

  • removed marking of random peer as healthy
  • added Unknown state for new peers, received from PEER_LIST request
  • add last_attempt field to peers, update when state is changing to Defunct or Healthy
  • check random 128 (64 healthy, 32-64 unknown, 32-128 defunct) peers no more often than 1h since last_attempt
  • mark peer as Defunct when ping not passed and it got disconnected
  • do not save connection time for new peers (a chance to clean them up to do not wait for 2 weeks)
  • do not crash on grin-config.toml parse with DNS failure
  • reconnect to seeds when 1st request failed on empty database to avoid sync stuck
  • do not save outbound peers to connected list when there is enough outbound peers + disconnect immediately (useful for peers check)

ardocrat added 2 commits April 4, 2026 01:58
… (128 healthy non-connected + 128 defuncts + 128 unknown), mark peer as defunct when ping not passed, do not crash on toml parse with dns failure
…ist only when there is not enough peers + disconnect extra peer immediately, reconnect to seeds at monitor to avoid stuck, update only defunct state to unknown when received existing peer address
Comment thread servers/src/grin/seed.rs
config.clone(),
);

if peers.enough_outbound_peers() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we removing this guard? If we have enough outbound peers, we shouldn't be trying to connect to more peers.

Comment thread servers/src/grin/seed.rs Outdated
Comment thread p2p/src/peers.rs
flags: State::Unknown,
last_banned: 0,
ban_reason: ReasonForBan::None,
last_connected: 0,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should change this back to Utc::now().timestamp() or else they'll likely be removed on the next hourly expiry sweep instead of staying around for the 14-day retry window.

Comment thread p2p/src/types.rs
entry.to_socket_addrs().map_err(|_| {
serde::de::Error::custom(format!("Unable to resolve DNS: {}", entry))
});
if let Ok(socket_addrs) = socket_addrs {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is used in too many places to just drop entries on dns failure. The failure could be temporary, but could lead to all seeds, peers_allow, peers_preferred, etc being dropped.

@DavidBurkett
Copy link
Copy Markdown
Member

I guess I should've read the description closer before commenting. The things I called out are apparently intentional. I'm missing the whole point of your change, I guess. Can you explain the issue we're having now and why this change solves it? I'm not really seeing the big picture.

@ardocrat
Copy link
Copy Markdown
Contributor Author

ardocrat commented Apr 8, 2026

I guess I should've read the description closer before commenting. The things I called out are apparently intentional. I'm missing the whole point of your change, I guess. Can you explain the issue we're having now and why this change solves it? I'm not really seeing the big picture.

Currently we have a lot unhealthy peers at database and not checking them after gaining enough outbound to have real picture, also we are just storing a lot defunct peers and marking them as healthy randomly (1 per 20 sec), whats also not true. Nodes should spread only really healthy peers to connect by design, some nodes are getting a lot of defunct peers and its getting hard to connect to them, especially if p2p port is not open and its impossible to get inbound peers. Its a reason why I mark new peers as Unknown and not specified connected time for them, also checking Defunct peers time to time.

@DavidBurkett
Copy link
Copy Markdown
Member

Ok, I at least understand where the code is coming from now, but the change is too aggressive. Your proposed fix is to repeatedly try 128 peers every 20 seconds whether we need them or not and then immediately disconnect, causing significant connection churn. That creates unnecessary load for us and for healthy peers.

It sounds like what we need is a separate probe path for checking peer status. The Unknown status is optional, but we should add things like last_seen, last_attempt, last_success, fail_count, and retry_after to the Peer DB records. Then we don't need to mark a random peer as healthy, and we'll stop sharing those bad peers with others.

@ardocrat
Copy link
Copy Markdown
Contributor Author

ardocrat commented Apr 8, 2026

last_seen, last_attempt, last_success, fail_count, and retry_after

we have last_seen for ping pong for already connected peers now just as information not used anywhere, so you think we should check Healthy disconnected peers only after some delay (1 hour since last_connected can be enough for this?) to reduce load for online peers.

I see its worth to set last_attempt for Defunct peers to not include same peers at check after such delay. After success we can mark Defunct as Healthy and try again looking at last_connected after 1 hour (we check peers expiration at this time exactly).

@ardocrat
Copy link
Copy Markdown
Contributor Author

ardocrat commented Apr 9, 2026

last_attempt

Added last_attempt field to peer at database to check disconnected Healthy or Defunct peers no more often than once per hour to avoid extra load, also do not ask more peers when there is enough outbound on this check (forgot to check this at previous commits).

@DavidBurkett DavidBurkett merged commit 90dab5f into mimblewimble:master Apr 9, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants