Multinode Cluster Management#119
Conversation
3375e7a to
f6d8318
Compare
| sender.flush().await?; | ||
| Ok(()) |
There was a problem hiding this comment.
| sender.flush().await?; | |
| Ok(()) | |
| sender.flush().await |
| /// The sender docket handling for the remote queue server check if we can unite this and the | ||
| /// receiver with the other one, by using traits. |
There was a problem hiding this comment.
| /// The sender docket handling for the remote queue server check if we can unite this and the | |
| /// receiver with the other one, by using traits. | |
| /// The sender socket handling for the remote queue server. | |
| // TODO: check if we can unite this and the receiver with the other client, by using traits. |
| .await | ||
| .is_err() | ||
| { | ||
| // connection lost, the receiver side will trigger the teardown |
There was a problem hiding this comment.
Why not notify from here, the double message is not an issue, since we are tearing down anyway.
I don't see a downside to notifying as soon as possible.
| .await | ||
| .unwrap(); | ||
| .is_err() | ||
| { |
There was a problem hiding this comment.
Should be re-enqueueing the work here if the sender fails.
Otherwise we have unexpected loss of invocations.
| sender_handle.abort(); | ||
| receiver_handle.abort(); | ||
| notification_handle.abort(); | ||
| offload_handle.abort(); |
There was a problem hiding this comment.
If the logic loop did remove the offload sender from the queue, then the offload loop should end when it sees the sender dropped.
I think it would make more sense to keep that loop running to reenqueue all the work that might still be in the channel to make sure nothing got dropped, than to abort here.
| ) { | ||
| loop { | ||
| receiver.changed().await.unwrap(); | ||
| if receiver.changed().await.is_err() { |
| .is_err() | ||
| { | ||
| // Could not even send the initial message, let the caller retry the connection. | ||
| warn!("Failed to send initial message to remote queue, connection lost"); |
There was a problem hiding this comment.
We currently do not abort any of the functions we are locally executing, it would make sense to also add a way to release all the local debts, so we don't execute functions that are not awaited anymore.
| } | ||
| } | ||
|
|
||
| fn fetch_bytes(&self, data_id: u64) -> DandelionResult<ExportedData> { |
There was a problem hiding this comment.
I removed this function, because it was never called. I think you accidentally readded it.
| /// Drops all exported data. Used when the connection to the node that manages these | ||
| /// contexts is lost, so the worker does not hold on to contexts that will never be | ||
| /// fetched or explicitly deleted anymore. | ||
| pub fn clear_exported_data(&self) { |
There was a problem hiding this comment.
This assumes a single remote that controls all exported data.
Should make sure that is clearly documented or add a way to track the remote owner, so we can only remove those items.
| } | ||
|
|
||
| /// How long to wait between attempts to (re-)connect to the master node. | ||
| const RECONNECT_INTERVAL: std::time::Duration = std::time::Duration::from_secs(1); |
There was a problem hiding this comment.
Could become a config parameter at some point, but fine for now.
Contains: