Skip to content

Multinode Cluster Management#119

Open
tostocker wants to merge 2 commits into
mainfrom
dev/multinode-management
Open

Multinode Cluster Management#119
tostocker wants to merge 2 commits into
mainfrom
dev/multinode-management

Conversation

@tostocker

@tostocker tostocker commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Contains:

  • adds multinode reconnect logic
  • clears remote contexts if connection to master is lost
  • some logging improvements

@tostocker tostocker force-pushed the dev/multinode-management branch from 3375e7a to f6d8318 Compare June 24, 2026 13:10
@tostocker tostocker linked an issue Jun 24, 2026 that may be closed by this pull request
@tostocker tostocker requested a review from tom-kuchler June 24, 2026 13:13
Comment thread multinode/src/client.rs
Comment on lines +112 to +113
sender.flush().await?;
Ok(())

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
sender.flush().await?;
Ok(())
sender.flush().await

Comment thread multinode/src/client.rs
Comment on lines +199 to +200
/// The sender docket handling for the remote queue server check if we can unite this and the
/// receiver with the other one, by using traits.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// The sender docket handling for the remote queue server check if we can unite this and the
/// receiver with the other one, by using traits.
/// The sender socket handling for the remote queue server.
// TODO: check if we can unite this and the receiver with the other client, by using traits.

Comment thread multinode/src/client.rs
.await
.is_err()
{
// connection lost, the receiver side will trigger the teardown

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not notify from here, the double message is not an issue, since we are tearing down anyway.
I don't see a downside to notifying as soon as possible.

Comment thread multinode/src/client.rs
.await
.unwrap();
.is_err()
{

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be re-enqueueing the work here if the sender fails.
Otherwise we have unexpected loss of invocations.

Comment thread multinode/src/client.rs
sender_handle.abort();
receiver_handle.abort();
notification_handle.abort();
offload_handle.abort();

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the logic loop did remove the offload sender from the queue, then the offload loop should end when it sees the sender dropped.
I think it would make more sense to keep that loop running to reenqueue all the work that might still be in the channel to make sure nothing got dropped, than to abort here.

Comment thread multinode/src/client.rs
) {
loop {
receiver.changed().await.unwrap();
if receiver.changed().await.is_err() {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the above.

Comment thread multinode/src/client.rs
.is_err()
{
// Could not even send the initial message, let the caller retry the connection.
warn!("Failed to send initial message to remote queue, connection lost");

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently do not abort any of the functions we are locally executing, it would make sense to also add a way to release all the local debts, so we don't execute functions that are not awaited anymore.

Comment thread multinode/src/data.rs
}
}

fn fetch_bytes(&self, data_id: u64) -> DandelionResult<ExportedData> {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this function, because it was never called. I think you accidentally readded it.

Comment thread multinode/src/data.rs
/// Drops all exported data. Used when the connection to the node that manages these
/// contexts is lost, so the worker does not hold on to contexts that will never be
/// fetched or explicitly deleted anymore.
pub fn clear_exported_data(&self) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes a single remote that controls all exported data.
Should make sure that is clearly documented or add a way to track the remote owner, so we can only remove those items.

Comment thread server/src/main.rs
}

/// How long to wait between attempts to (re-)connect to the master node.
const RECONNECT_INTERVAL: std::time::Duration = std::time::Duration::from_secs(1);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could become a config parameter at some point, but fine for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Multinode Cluster Management

2 participants