More reliable peers check#3824
Conversation
… (128 healthy non-connected + 128 defuncts + 128 unknown), mark peer as defunct when ping not passed, do not crash on toml parse with dns failure
…ist only when there is not enough peers + disconnect extra peer immediately, reconnect to seeds at monitor to avoid stuck, update only defunct state to unknown when received existing peer address
| config.clone(), | ||
| ); | ||
|
|
||
| if peers.enough_outbound_peers() { |
There was a problem hiding this comment.
Why are we removing this guard? If we have enough outbound peers, we shouldn't be trying to connect to more peers.
| flags: State::Unknown, | ||
| last_banned: 0, | ||
| ban_reason: ReasonForBan::None, | ||
| last_connected: 0, |
There was a problem hiding this comment.
You should change this back to Utc::now().timestamp() or else they'll likely be removed on the next hourly expiry sweep instead of staying around for the 14-day retry window.
| entry.to_socket_addrs().map_err(|_| { | ||
| serde::de::Error::custom(format!("Unable to resolve DNS: {}", entry)) | ||
| }); | ||
| if let Ok(socket_addrs) = socket_addrs { |
There was a problem hiding this comment.
I think this is used in too many places to just drop entries on dns failure. The failure could be temporary, but could lead to all seeds, peers_allow, peers_preferred, etc being dropped.
|
I guess I should've read the description closer before commenting. The things I called out are apparently intentional. I'm missing the whole point of your change, I guess. Can you explain the issue we're having now and why this change solves it? I'm not really seeing the big picture. |
Currently we have a lot unhealthy peers at database and not checking them after gaining enough outbound to have real picture, also we are just storing a lot defunct peers and marking them as healthy randomly (1 per 20 sec), whats also not true. Nodes should spread only really healthy peers to connect by design, some nodes are getting a lot of defunct peers and its getting hard to connect to them, especially if p2p port is not open and its impossible to get inbound peers. Its a reason why I mark new peers as Unknown and not specified connected time for them, also checking Defunct peers time to time. |
|
Ok, I at least understand where the code is coming from now, but the change is too aggressive. Your proposed fix is to repeatedly try 128 peers every 20 seconds whether we need them or not and then immediately disconnect, causing significant connection churn. That creates unnecessary load for us and for healthy peers. It sounds like what we need is a separate probe path for checking peer status. The Unknown status is optional, but we should add things like |
we have I see its worth to set |
…hour, store last connection attempt, do not ask for more peers when there is enough outbound
Added |
Unknownstate for new peers, received fromPEER_LISTrequestlast_attemptfield to peers, update when state is changing toDefunctorHealthylast_attemptDefunctwhen ping not passed and it got disconnectedgrin-config.tomlparse with DNS failure