Multinode Cluster Management by tostocker · Pull Request #119 · eth-easl/dandelion

tostocker · 2026-06-24T13:07:23Z

Contains:

adds multinode reconnect logic
clears remote contexts if connection to master is lost
some logging improvements

…better logging

tom-kuchler · 2026-06-24T13:34:10Z

+    sender.flush().await?;
+    Ok(())


Suggested change

sender.flush().await?;

Ok(())

sender.flush().await

tom-kuchler · 2026-06-24T13:53:20Z

+/// The sender docket handling for the remote queue server check if we can unite this and the
+/// receiver with the other one, by using traits.


Suggested change

/// The sender docket handling for the remote queue server check if we can unite this and the

/// receiver with the other one, by using traits.

/// The sender socket handling for the remote queue server.

// TODO: check if we can unite this and the receiver with the other client, by using traits.

tom-kuchler · 2026-06-24T13:55:02Z

+            .await
+            .is_err()
+        {
+            // connection lost, the receiver side will trigger the teardown


Why not notify from here, the double message is not an issue, since we are tearing down anyway.
I don't see a downside to notifying as soon as possible.

tom-kuchler · 2026-06-24T14:02:31Z

            .await
-            .unwrap();
+            .is_err()
+        {


Should be re-enqueueing the work here if the sender fails.
Otherwise we have unexpected loss of invocations.

tom-kuchler · 2026-06-24T14:49:10Z

+    sender_handle.abort();
+    receiver_handle.abort();
+    notification_handle.abort();
+    offload_handle.abort();


If the logic loop did remove the offload sender from the queue, then the offload loop should end when it sees the sender dropped.
I think it would make more sense to keep that loop running to reenqueue all the work that might still be in the channel to make sure nothing got dropped, than to abort here.

tom-kuchler · 2026-06-24T14:54:51Z

 ) {
    loop {
-        receiver.changed().await.unwrap();
+        if receiver.changed().await.is_err() {


Same as the above.

tom-kuchler · 2026-06-24T15:00:22Z

+        .is_err()
+    {
+        // Could not even send the initial message, let the caller retry the connection.
+        warn!("Failed to send initial message to remote queue, connection lost");


We currently do not abort any of the functions we are locally executing, it would make sense to also add a way to release all the local debts, so we don't execute functions that are not awaited anymore.

tom-kuchler · 2026-06-24T15:01:35Z

+        }
+    }
+
+    fn fetch_bytes(&self, data_id: u64) -> DandelionResult<ExportedData> {


I removed this function, because it was never called. I think you accidentally readded it.

tom-kuchler · 2026-06-24T15:03:52Z

+    /// Drops all exported data. Used when the connection to the node that manages these
+    /// contexts is lost, so the worker does not hold on to contexts that will never be
+    /// fetched or explicitly deleted anymore.
+    pub fn clear_exported_data(&self) {


This assumes a single remote that controls all exported data.
Should make sure that is clearly documented or add a way to track the remote owner, so we can only remove those items.

tom-kuchler · 2026-06-24T15:07:05Z

 }

+/// How long to wait between attempts to (re-)connect to the master node.
+const RECONNECT_INTERVAL: std::time::Duration = std::time::Duration::from_secs(1);


Could become a config parameter at some point, but fine for now.

tostocker added 2 commits June 24, 2026 07:10

Multinode reconnect logic, clear remote contexts on connection lost, …

cb2ed7a

…better logging

Added multinode reconnect test

f6d8318

tostocker force-pushed the dev/multinode-management branch from 3375e7a to f6d8318 Compare June 24, 2026 13:10

tostocker linked an issue Jun 24, 2026 that may be closed by this pull request

Multinode Cluster Management #114

Open

tostocker requested a review from tom-kuchler June 24, 2026 13:13

tom-kuchler requested changes Jun 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multinode Cluster Management#119

Multinode Cluster Management#119
tostocker wants to merge 2 commits into
mainfrom
dev/multinode-management

tostocker commented Jun 24, 2026 •

edited

Loading

Uh oh!

tom-kuchler Jun 24, 2026

Uh oh!

tom-kuchler Jun 24, 2026

Uh oh!

tom-kuchler Jun 24, 2026

Uh oh!

tom-kuchler Jun 24, 2026

Uh oh!

tom-kuchler Jun 24, 2026

Uh oh!

tom-kuchler Jun 24, 2026

Uh oh!

tom-kuchler Jun 24, 2026

Uh oh!

tom-kuchler Jun 24, 2026

Uh oh!

tom-kuchler Jun 24, 2026

Uh oh!

tom-kuchler Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		/// The sender docket handling for the remote queue server check if we can unite this and the
		/// receiver with the other one, by using traits.

Uh oh!

Conversation

tostocker commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tostocker commented Jun 24, 2026 •

edited

Loading