feat(health): Experimental streaming log support via SSE (and it's messy implications) by mkoci · Pull Request #711 · NVIDIA/ncx-infra-controller-core

mkoci · 2026-03-24T21:48:20Z

i## Description
Adds support for streaming Redfish LogEntries through SSE (Server-Sent Events). Introduces an async OTLP log export pipeline that batches and ships collected log events to an OpenTelemetry collector over gRPC.

But Why?

Periodic collection is not well suited for Redfish LogEntries or Events. In BMC's, completely relying on the periodic collection can cause resource intensive bursty behavior as Logs and Events are accumulating behind a seemingly harmless Redfish endpoint. Once this accumulation occurs, a periodic scrape of the endpoint stresses network I/O on a BMC causing red-lining behavior and downtime.

This change attempts to address this by employing SSE and modifying the existing periodic and synchronous behavior of the Health crate. To ensure the Health service is not overwhelmed, we have implemented backpressure in an async pipeline such that in times of high load, the Health service distributes the system stress across BMC's and does not drop entries. Using the same DBus journal that created load issues when using periodic collection, we leverage the DBus journal to distribute the load away from the Health service as a kind of leaky bucket approach to make log/event flow more consistent.

Dude, where's my sync Sink?

To make an omelette, sometimes you have to break some eggs. In this case, the eggs were the synchronous nature of our pipeline. SSE coming from nv-redfish is already async, likewise the collectors themselves are tokio::spawn tasks. If we input streaming logs/events into our synchronous pipeline, we have no way to suspend too much pressure from the BMC's or an overwhelmed OTLP client on the other end of the OTLPSink without dropping information.I really didn't want to make such impactful changes, but here is a non-exhaustive list of things I attempted:

Why not use try_send in the DataSink?
Does not block and drops logs - no backpressure.

Isn't there some way to do this with an mpsc::SyncSender?
Not without blocking the OS thread and stalling the tokio worker tasks - all tasks assigned to the worker...

What about something like the OverrideQueue?
The OverrideQueue uses a unique key to loosely bound Overrides. This works because there's a finite number of MachineId's and ReportSource's and we only care about the latest. For logs / events this leads to dropping, not to mention the queue size for logs / events has much more freedom compared to the bounded machine/report uniqueness dimension.

Breaking Changes

This PR contains breaking changes

Type of Change

Add - New feature or capability
Change - Changes in existing functionality
Fix - Bug fixes
Remove - Removed features or deprecated functionality
Internal - Internal changes (refactoring, tests, docs, etc.)

Testing

Unit tests added/updated
Integration tests added/updated
Manual testing performed
No testing required (docs, internal refactor, etc.)

Additional Notes

The workspace pins tonic to 0.14 globally. In Health, we need to leverage tonic 0.14 Channel. The problem arises because opentelemetry-proto pins tonic to 0.12 where that Channel is incompatible with 0.14's Channel. I looked into upgrading opentelemetry-proto to 0.31 which pins tonic 0.14, and saw a refactor spilling into 8 or more crates.... likely out of scope for this PR, but needs to be done at some point.
In addition to unit tests, I've tested on rack. Added benches, but there is certainly room for optimization.
OtlpDrainTask currently flushes batches on count (512) or time (2s), whichever comes first. Retries with exponential backoff on transient gRPC errors. This will likely need tuning.
Periodic log collection preserved and accessible through config - carved into its own logs/periodic.rs module.

…ueRest. Update example and add doc comments.

…thError enum

…ng how the DataSink is implemented

… remove stuttering from metric family names

…th corrections

…r testing NVUE REST scraping in isolation

…onfig, update test

… comment out the rest not in tests

…a module

…ection

…ing patterns in PeriodicCollector with injection/config

…functionality. fix type

… with bounded channels

…ssingPipeline to avoid sync/async conv with DataSink+streaming

copy-pr-bot · 2026-03-24T21:48:25Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

…. add early return to not block collector spawn by returning None

Matthias247

I have not looked at everything so far.

But one high level question upfront: Is there a specific reason that SSE connections need to make the internal event handling asynchronous? My understanding is it could still be synchronous (the events are written to channels synchronously and can then be processed on the other side at arbitrary speed).

Is the reasoning here backpressure? If yes, it could likely be handled other ways too. E.g. we could implement a channel where we just drop the newest (or oldest) events if the processing infrastructure can't keep up. But @yoks is certainly the expert on how its designed so far and can provide more input.

Matthias247 · 2026-03-31T18:38:37Z

crates/health/src/collectors/runtime.rs

+
+/// Trait for collectors that maintain a long-lived stream (SSE, gRPC, etc.)
+/// runtime.rs creates the BMC client and injects it, the collector opens the stream and maps payloads to events
+pub trait StreamingCollector<B: Bmc>: Send + 'static {


Is there any reason this isn't using async_trait? That's much more ergonomic

Threre was very specific reason i was unable to use async_trait (which i forgot), not sure if that the case here. I think we need to re-evaluate tat.

Yes, I've updated the description with a bit more detail on the why of backpressure. If the BMC's already buffer naturally with DBus, what's the upside of dropping logs/events?

I don't see anything blocking swapping in async_trait, let me know and i'll add it to this PR.

I think there is no value in backpressure from source, as you do not controll source (you could not backpressure it). So only solution for slow sink is to buffer/drop tail if buffer is full. This removes whole async pipeline and makes it easier.
Or i fail to see bigger picture here?

Also SSE->File does not require backpressure. or SSE->Tracing. So i think it is characterstic of the Sink (OTEL).

Yes I think you are right. And in a system with 2 variable ends that you don't control (neither the frequency at which logs are produced nor the frequency at which the sink accepts logs), there is pretty much no way besides dropping something at some point in time.

So I think keeping the pipeline synchronous for the moment is ok. We can still change it in the future once we learn more about the problem.

@yoks I can understand what you're saying about not controlling the source.

At the same time "you could not backpressure it" I partially agree... OpenBMC buffers 10MB on the HTTP sender side. I have no idea if HPE, Dell, or others that support SSE (some only support POST) have generous buffers like OpenBMC. Point taken, and at the same time this back-pressure does work for OpenBMC and follows normal HTTP and underlying TCP patterns for sender/receiver blocking.

Big question: Are we supporting Redfish Servers that support SSE, but don't follow proper async guidelines?
HTTP/2 is not new, am I wrong in thinking that client-side blocking is not supported for SSE on Redfish servers? We are not talking about primitive Redfish servers that offer push support via POST, we're talking about applying back-pressure to redfish servers that support SSE subscriptions only, correct? That is the only place I'm applying any back-pressure.

@Matthias247 So we are totally fine keeping health synchronous and just dropping events? If so, this is much easier. Note, some of these events could be leak events, not just OOB logs 😂

The OTLPSink side is a completely different animal. Depending on how sharding is implemented in the health crate and what type of scaling happens in the OTLP client-side, there are knobs that can be tuned there, no?

There is several points here. First is backpressure from client buffer could work, but it is not realy solves our problem. Log event is something which already happened, you still need to consume it. If you backpressure at the rate where your consumer is slower than producer, you will lose data, and worse - you will loose recent data, not old one (depending on how it build). So backpressure is not really possible here.

Also second point - your async pipeline can be split into log/tracing/OTEL. Only OTEL is slow consumer, so you trying async backpressure propogation just because of OTEL.

I believe building same buffer at before the slow sink gives us more controll on how we work with slow consumer.One solution is very simmple buffer which drops tail (oldest message) if it at capacity. Another solution, which is we already kinda have, is to write all to file and launch otel exporter as sidecar (or seprate process), which will read from file and publush it down the line. So file becomes this buffer.

And finnaly, HTTP/2 while smells like something you can do backpressure, but HTTP/2 is not flow-aware, it works on TCP level. So you would not be able to build real backpressure system.

To make backpressure work properly you want closed loop system where producer can wait, usually this is lossless producer (Kafka is simple example). Logs over HTTP/2 are lossy be definition. so backpressure brings more problems than it is worth.

If you want to make your logs durable, and not lossy, you need to write them to durable storage as soon as possible and transfer them to logging system from there. But my impression what we are not treating them as critical (leak event is good example, but we do not act on event, we act on sensor data).

Matthias247 · 2026-03-31T18:42:34Z

crates/health/src/collectors/runtime.rs

+        ]);
+
+        let registry = collector_registry.registry();
+        let metrics = StreamMetrics::new(registry, collector_registry.prefix(), const_labels)?;


Creating new metric objects per stream would lead to accumulation of a lot of similar objects with would then conflict during metric scraping - unless there is a per endpoint label.
It would also lead to leaking the metric objects when the endpoint is no longer monitored (host force-deleted).

We should find a way to create the metric registries only once upfront, and make it possible for the collectors to update the latest data.

Collectors create per-endpoint, so this should be fine.

There is two types of metrics, Prometheus sink ones and general service. I already started to moving them into ComponentMetrics. Matt added this StreamMetrics, i think we should consolidate them, and make clear of use of Service level metrics from Sink level metrics.

@Matthias247 agreed, it's not ideal.

@yoks Perhaps I could track this as a Github issue to be addressed later or would you rather I just tackle it in this PR?

Lets create issue. Too much to tackle in single PR

Matthias247 · 2026-03-31T18:53:35Z

crates/health/src/collectors/runtime.rs

+        let connection_state = Gauge::with_opts(
+            Opts::new(
+                format!("{prefix}_stream_connection_state"),
+                "Stream connection state per SSE readyState: 0=CONNECTING, 1=OPEN, 2=CLOSED",


we shouldn't be bound what to what the browser and JS are doing, and could just use the labels in text form?

I'm not sure we even need this metric. Seems oddly specific

Oof, this was an artifact of testing on rack and trying to validate what was going on with connections (after pulling my hair out before discovering this)

Will remove.

Matthias247 · 2026-03-31T18:57:12Z

crates/health/src/discovery/spawn.rs

-                format!("log_collector_{}", endpoint.addr.hash_key()),
-                metrics_prefix,
-            )?);
+                    match create_log_file_writer(


I don't have the historic background on this. But I'M surprised we create the file writer in only one of the branches (periodic).

My take is that whether we collect logs via polling or SSE is an implementation detail on the BMC side => We do whatever is more efficient and quicker and what is supported. In any case, the result will be logs in our pipeline.

And these logs can then be written to sinks (e..g. files)

==> I'd decouple the sinks, set them up once, and made sure the logs from both path somehow get there.

Agree, file writer sink is good way to separate them. And thens tream logs via Periodic or SSE collector (to both file/OTEL etc).

Makes sense to me

yoks · 2026-03-31T19:20:10Z

I think OTLP should be modelled as a sink, as it async and can have backpressure i think you can reuse OverrideQueue for that. Very similar on how HealthOverrides works

yoks · 2026-03-31T19:28:07Z

I have not looked at everything so far.

But one high level question upfront: Is there a specific reason that SSE connections need to make the internal event handling asynchronous? My understanding is it could still be synchronous (the events are written to channels synchronously and can then be processed on the other side at arbitrary speed).

Is the reasoning here backpressure? If yes, it could likely be handled other ways too. E.g. we could implement a channel where we just drop the newest (or oldest) events if the processing infrastructure can't keep up. But @yoks is certainly the expert on how its designed so far and can provide more input.

Yes, they should be synchronios as long as they are not hitting slow sink (e.g. API/OTEL). I think File sink should be fine as well. Answered above, but i design similar on how HealthOverride handled looks like the way how they can be handled.

…alid URI gracefully (don't crash the drain)

…quired at runtime. Fix comment

…s curl for OTLP proto def

…_sse_logs

github-actions · 2026-04-02T02:55:55Z

🔐 TruffleHog Secret Scan

✅ No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

_{🕐 Last updated: 2026-04-02 02:55:54 UTC | Commit: 616afc7}

Copilot

Pull request overview

Adds experimental Redfish LogEntries streaming via SSE to reduce bursty periodic log scrapes, and introduces an async OTLP (gRPC) export path with bounded-channel backpressure to protect the health service under load.

Changes:

Introduces an EventPipeline that preserves the existing synchronous sink behavior while optionally forwarding OTLP-relevant events through a bounded async channel for backpressure.
Adds a streaming-collector runtime abstraction + SSE log collector, with logs collection now configurable as sse (default) or periodic.
Implements OTLP log export (proto generation, event→OTLP conversion, drain task with batching/flush/retry).

Reviewed changes

Copilot reviewed 24 out of 26 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
Makefile.toml	Formatting adjustments for env vars and rustfmt tasks.
crates/ssh-console/Cargo.toml	Reformat `tokio` features list.
crates/health/src/processor/mod.rs	Adds `handle_and_collect` to capture original + derived events for downstream forwarding.
crates/health/src/pipeline.rs	New async `EventPipeline` that forwards OTLP-relevant events via bounded channel.
crates/health/src/otlp/mod.rs	New OTLP module with generated proto includes + re-exports.
crates/health/src/otlp/drain.rs	New OTLP drain task: connect, batch, flush, retry w/ backoff.
crates/health/src/otlp/convert.rs	Converts internal events/logs into OTLP `ExportLogsServiceRequest`.
crates/health/src/lib.rs	Wires `EventPipeline`, OTLP drain lifecycle, and SSE-related error type.
crates/health/src/discovery/spawn.rs	Spawns logs collectors based on mode (SSE vs periodic) and injects pipeline.
crates/health/src/discovery/iteration.rs	Threads `EventPipeline` through discovery iteration/spawn.
crates/health/src/config.rs	Adds OTLP sink config + logs mode config/validation and updates tests.
crates/health/src/collectors/sensors.rs	Switches sensor collector emission to async pipeline.
crates/health/src/collectors/runtime.rs	Adds streaming collector trait/runtime, SSE open helper, and backoff utilities.
crates/health/src/collectors/nvue/rest/collector.rs	Switches NVUE collector emission to async pipeline.
crates/health/src/collectors/nmxt.rs	Switches NMX-T collector emission to async pipeline.
crates/health/src/collectors/mod.rs	Re-exports streaming runtime types and new SSE log collector.
crates/health/src/collectors/logs/sse.rs	New SSE log collector mapping Redfish EventService payloads into log events.
crates/health/src/collectors/logs/periodic.rs	Periodic logs updated to use pipeline + optional file writer + rotation tweak.
crates/health/src/collectors/logs/mod.rs	New logs module split (periodic + sse).
crates/health/src/collectors/firmware.rs	Switches firmware collector emission to async pipeline.
crates/health/example/config.example.toml	Updates example to document `collectors.logs.mode` and periodic sub-table.
crates/health/Cargo.toml	Adds prost/tonic-prost deps and build dep for proto compilation.
crates/health/build.rs	New build script fetching/compiling OTLP protos for generated gRPC client.
crates/health/benches/processor_pipeline.rs	Adds bench for `handle_and_collect` overhead.
crates/bmc-explorer/Cargo.toml	Reformat `nv-redfish` feature list.
Cargo.lock	Locks new prost/tonic-prost deps (and related lockfile churn).

Comments suppressed due to low confidence (1)

crates/health/src/collectors/logs/periodic.rs:612

last_seen_ids is updated even when writing the log batch to disk fails. If the log file is a required output, this can permanently skip entries on the next iteration (data loss on disk). Consider only advancing last_seen_ids (and total_log_count) after a successful write, or otherwise making the failure semantics explicit (e.g., retry on next run when the writer errors).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-02T02:59:16Z

crates/health/src/lib.rs

+    if sinks.is_empty() {
+        return Ok(None);
+    }
+
+    let composite_sink: Arc<dyn DataSink> =
+        Arc::new(CompositeDataSink::new(sinks, metrics_manager.clone()));
+    let inner = EventProcessingPipeline::new(processors, composite_sink, metrics_manager);


build_event_pipeline returns Ok(None) when sinks is empty, before considering sinks.otlp. This makes OTLP-only deployments impossible (events will never be forwarded to the OTLP drain when all synchronous sinks are disabled). Consider treating OTLP as a consumer for pipeline creation (e.g., build the pipeline when OTLP is enabled even if sinks is empty, using a no-op/composite sink for the synchronous side).

Copilot · 2026-04-02T02:59:16Z

crates/health/src/collectors/runtime.rs

+        dyn futures::TryStream<
+                Ok = EventStreamPayload,
+                Error = HealthError,
+                Item = Result<EventStreamPayload, HealthError>,


SseStream is defined with dyn futures::TryStream<..., Item = ...>, but TryStream doesn’t have an Item associated type (it inherits Stream<Item = Result<Ok, Error>>). This will not compile. Drop the Item = ... constraint (or express it via an additional Stream<Item = ...> bound if needed).

Suggested change

Item = Result<EventStreamPayload, HealthError>,

Copilot · 2026-04-02T02:59:17Z

crates/health/build.rs

+// production CI/Docker builds should set OTLP_PROTO_DIR to a pre-fetched Dockerfile layer to avoid runtime network deps
+// the curl fallback below should only be used for local development.
+fn fetch_otlp_protos(out_dir: &Path) -> PathBuf {
+    if let Ok(dir) = std::env::var("OTLP_PROTO_DIR") {
+        let path = PathBuf::from(dir);
+        if path.exists() {
+            return path;
+        }
+    }
+
+    let proto_dir = out_dir.join("otlp-proto");
+
+    for proto_file in OTLP_PROTO_FILES {
+        let dest = proto_dir.join(proto_file);
+        if dest.exists() {
+            continue;
+        }
+
+        std::fs::create_dir_all(dest.parent().unwrap()).expect("create proto parent dirs");
+
+        let url = format!("{OTLP_PROTO_BASE_URL}/{OTLP_PROTO_VERSION}/{proto_file}");
+
+        let status = Command::new("curl")
+            .args(["-sSfL", "--create-dirs", "-o"])
+            .arg(&dest)
+            .arg(&url)
+            .status()
+            .expect("curl must be available to download OTLP proto files");
+


The build script falls back to downloading proto files via curl at build time and only declares rerun-if-env-changed=OTLP_PROTO_DIR. This can make builds non-reproducible/offline-hostile and also won’t re-run when the contents of OTLP_PROTO_DIR change. Consider (a) emitting cargo:rerun-if-changed= for the proto inputs, and (b) gating the network download behind an explicit opt-in (or vendoring the protos) so CI/release builds never depend on external network/curl availability.

Copilot · 2026-04-02T02:59:17Z

crates/health/src/otlp/drain.rs

+        loop {
+            tokio::select! {
+                _ = self.cancel.cancelled() => return None,
+                result = endpoint.connect() => {


endpoint.connect() is called inside a retry loop on a single endpoint value created by Channel::from_shared(...). In tonic, connect() typically consumes the endpoint/connector, so this pattern is likely to fail to compile or only allow a single attempt. Consider constructing/cloning the endpoint per attempt (e.g., keep an Endpoint/builder and call endpoint.clone().connect().await) so retries don’t move the original value.

Suggested change

result = endpoint.connect() => {

result = endpoint.clone().connect() => {

In tonic, connect() typically consumes the endpoint/connector

No copilot, no. If it consumed it, this would fail to compile. bad bot.

Copilot · 2026-04-02T02:59:17Z

crates/health/src/config.rs

+/// SSE is the preferred mode for real-time log streaming.
+/// Periodic polling is retained as a fallback for BMCs that lack SSE support.
+#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)]
+#[serde(rename_all = "lowercase")]
+pub enum LogCollectionMode {
+    Sse,
+    Periodic,
+}


The docs on LogCollectionMode suggest periodic polling is a fallback for BMCs that lack SSE support, but the implementation selects a single global mode (collectors.logs.mode) and does not appear to auto-fallback per-endpoint when SSE is unavailable (e.g., HealthError::SseNotAvailable). Consider clarifying the comment (manual/operator fallback) or implementing an automatic downgrade path when SSE isn’t supported.

mkoci · 2026-04-04T01:16:33Z

I think OTLP should be modelled as a sink, as it async and can have backpressure i think you can reuse OverrideQueue for that. Very similar on how HealthOverrides works

From the PR description I added after this comment:

What about something like the OverrideQueue?
The OverrideQueue uses a unique key to loosely bound Overrides. This works because there's a finite number of MachineId's and ReportSource's and we only care about the latest. For logs / events this leads to dropping, not to mention the queue size for logs / events has much more freedom compared to the bounded machine/report uniqueness dimension.

If we don't care about dropping, this is much easier to implement similar to the OverrideQueue. Is your expectation that we only hold the latest events in the queue per machineid to push to the OTLPSink and drop the rest?

kensimon

Haven't looked at the code yet but this really stuck out to me, I think we ought to vendor these files rather than shelling out to curl on every build.

kensimon · 2026-04-06T14:05:11Z

crates/health/build.rs

+
+const OTLP_PROTO_VERSION: &str = "v1.5.0";
+const OTLP_PROTO_BASE_URL: &str =
+    "https://raw.githubusercontent.com/open-telemetry/opentelemetry-proto";


Can we put these in the repo rather than downloading them as part of build.rs? I feel like builds should be as "sealed" as possible and the external dependency management should all be in one place (ie. Cargo.lock and the cargo cache)... shelling out to curl in a build.rs is super surprising behavior, especially if it's just avoid vendoring 4 proto files...

kensimon

These are all nitpicks and can be ignored, only the curl-in-build-rs issue is a "request changes" from me.

kensimon · 2026-04-06T14:29:35Z

crates/health/src/otlp/drain.rs

+        let sleep = tokio::time::sleep(self.flush_interval);
+        tokio::pin!(sleep);
+
+        loop {
+            tokio::select! {
+                _ = self.cancel.cancelled() => {
+                    self.flush(&mut client, &mut batch).await;
+                    break;
+                }
+                maybe_event = self.receiver.recv() => {
+                    match maybe_event {
+                        Some(event) => {
+                            batch.push(event);
+                            if batch.len() >= self.batch_size {
+                                self.flush(&mut client, &mut batch).await;
+                                sleep.as_mut().reset(tokio::time::Instant::now() + self.flush_interval);
+                            }
+                        }
+                        None => {
+                            self.flush(&mut client, &mut batch).await;
+                            break;
+                        }
+                    }
+                }
+                _ = &mut sleep => {
+                    if !batch.is_empty() {
+                        self.flush(&mut client, &mut batch).await;
+                    }
+                    sleep.as_mut().reset(tokio::time::Instant::now() + self.flush_interval);


Nit: Consider using tokio::time's Interval here rather than hand-rolling one with tokio::time::sleep():

Suggested change

let sleep = tokio::time::sleep(self.flush_interval);

tokio::pin!(sleep);

loop {

tokio::select! {

_ = self.cancel.cancelled() => {

self.flush(&mut client, &mut batch).await;

break;

}

maybe_event = self.receiver.recv() => {

match maybe_event {

Some(event) => {

batch.push(event);

if batch.len() >= self.batch_size {

self.flush(&mut client, &mut batch).await;

sleep.as_mut().reset(tokio::time::Instant::now() + self.flush_interval);

}

}

None => {

self.flush(&mut client, &mut batch).await;

break;

}

}

}

_ = &mut sleep => {

if !batch.is_empty() {

self.flush(&mut client, &mut batch).await;

}

sleep.as_mut().reset(tokio::time::Instant::now() + self.flush_interval);

let mut interval = tokio::time::interval(self.flush_interval);

loop {

tokio::select! {

_ = self.cancel.cancelled() => {

self.flush(&mut client, &mut batch).await;

break;

}

maybe_event = self.receiver.recv() => {

match maybe_event {

Some(event) => {

batch.push(event);

if batch.len() >= self.batch_size {

self.flush(&mut client, &mut batch).await;

interval.reset();

}

}

None => {

self.flush(&mut client, &mut batch).await;

break;

}

}

}

_ = interval.tick() => {

if !batch.is_empty() {

self.flush(&mut client, &mut batch).await;

}

kensimon · 2026-04-06T14:33:57Z

crates/health/src/otlp/drain.rs

+        loop {
+            tokio::select! {
+                _ = self.cancel.cancelled() => return None,
+                result = endpoint.connect() => {


In tonic, connect() typically consumes the endpoint/connector

No copilot, no. If it consumed it, this would fail to compile. bad bot.

kensimon · 2026-04-06T14:45:36Z

crates/health/src/otlp/drain.rs

+        loop {
+            tokio::select! {
+                _ = self.cancel.cancelled() => return None,
+                result = endpoint.connect() => {
+                    match result {
+                        Ok(channel) => {
+                            tracing::info!(endpoint = %self.endpoint, "connected to otlp collector");
+                            return Some(LogsServiceClient::new(channel));
+                        }
+                        Err(error) => {
+                            let delay = backoff.next_delay();
+                            tracing::warn!(
+                                ?error,
+                                endpoint = %self.endpoint,
+                                retry_in = ?delay,
+                                "failed to connect to otlp collector"
+                            );
+                            tokio::select! {
+                                _ = self.cancel.cancelled() => return None,
+                                _ = tokio::time::sleep(delay) => {}
+                            }
+                        }
+                    }
+                }
+            }
+        }


Nit: tokio::select! is a bit of a footgun sometimes due to cancel-safety issues... this blog post covers a lot of them.

This code doesn't have any problems, but it's nice to avoid the select altogether if you can, since it's simpler to audit select! calls if there are fewer of them:

Suggested change

loop {

tokio::select! {

_ = self.cancel.cancelled() => return None,

result = endpoint.connect() => {

match result {

Ok(channel) => {

tracing::info!(endpoint = %self.endpoint, "connected to otlp collector");

return Some(LogsServiceClient::new(channel));

}

Err(error) => {

let delay = backoff.next_delay();

tracing::warn!(

?error,

endpoint = %self.endpoint,

retry_in = ?delay,

"failed to connect to otlp collector"

);

tokio::select! {

_ = self.cancel.cancelled() => return None,

_ = tokio::time::sleep(delay) => {}

}

}

}

}

}

}

loop {

match self.cancel.run_until_cancelled(endpoint.connect()).await? {

Ok(channel) => {

tracing::info!(endpoint = %self.endpoint, "connected to otlp collector");

return Some(LogsServiceClient::new(channel));

}

Err(error) => {

let delay = backoff.next_delay();

tracing::warn!(

?error,

endpoint = %self.endpoint,

retry_in = ?delay,

"failed to connect to otlp collector"

);

self.cancel

.run_until_cancelled(tokio::time::sleep(delay))

.await?;

}

};

}

kensimon · 2026-04-06T14:49:30Z

crates/health/src/otlp/drain.rs

+                    tokio::select! {
+                        _ = self.cancel.cancelled() => break,
+                        _ = tokio::time::sleep(delay) => continue,
+                    }


Nit: this can be

Suggested change

tokio::select! {

_ = self.cancel.cancelled() => break,

_ = tokio::time::sleep(delay) => continue,

}

if self

.cancel

.run_until_cancelled(tokio::time::sleep(delay))

.await

.is_none()

{

break;

}

but if you disagree feel free to ignore, I don't feel strongly about it.

kensimon · 2026-04-06T15:18:44Z

crates/health/src/otlp/drain.rs

+            return;
+        }
+
+        let request = build_export_request(batch);


Nit: It looks like all code paths are calling batch.clear() before returning... maybe we should just do that here after building the request, so that we don't have to do it in multiple return paths below?

In fact, you could have build_export_request(batch) take the batch by value instead of reference, and just do .drain(..) here. Something like:

pub fn build_export_request( batch: impl IntoIterator<Item = (EventContext, CollectorEvent)>, ) -> ExportLogsServiceRequest { // ... }

and here:

let request = build_export_request(batch.drain(..));

Then convert_event() and resource_attributes() can take an owned CollectorEvent and EventContext, and you can avoid some of the clones.

kensimon · 2026-04-06T15:37:29Z

crates/health/src/collectors/runtime.rs

+                let stream = tokio::select! {
+                    _ = cancel_clone.cancelled() => {
+                        metrics.connection_state.set(STREAM_STATE_CLOSED);
+                        return;
+                    }
+                    result = collector.connect() => result,
+                };


Nit: Perhaps more readable and avoids the select:

Suggested change

let stream = tokio::select! {

_ = cancel_clone.cancelled() => {

metrics.connection_state.set(STREAM_STATE_CLOSED);

return;

}

result = collector.connect() => result,

};

let Some(stream) = cancel_clone.run_until_cancelled(collector.connect()).await

else {

metrics.connection_state.set(STREAM_STATE_CLOSED);

return;

};

kensimon · 2026-04-06T15:43:32Z

crates/health/src/collectors/runtime.rs

+                        tokio::select! {
+                            _ = cancel_clone.cancelled() => {
+                                metrics.connection_state.set(STREAM_STATE_CLOSED);
+                                return;
+                            }
+                            _ = tokio::time::sleep(delay) => continue,
+                        }


Nit

Suggested change

tokio::select! {

_ = cancel_clone.cancelled() => {

metrics.connection_state.set(STREAM_STATE_CLOSED);

return;

}

_ = tokio::time::sleep(delay) => continue,

}

if cancel_clone

.run_until_cancelled(tokio::time::sleep(delay))

.await

.is_none()

{

metrics.connection_state.set(STREAM_STATE_CLOSED);

return;

}

kensimon · 2026-04-06T15:44:03Z

crates/health/src/collectors/runtime.rs

+                    let item = tokio::select! {
+                        _ = cancel_clone.cancelled() => {
+                            metrics.connection_state.set(STREAM_STATE_CLOSED);
+                            tracing::info!(
+                                collector_type,
+                                endpoint = ?endpoint.addr,
+                                "streaming collector shutting down"
+                            );
+                            return;
+                        }
+                        item = stream.next() => item,
+                    };


Nit

Suggested change

let item = tokio::select! {

_ = cancel_clone.cancelled() => {

metrics.connection_state.set(STREAM_STATE_CLOSED);

tracing::info!(

collector_type,

endpoint = ?endpoint.addr,

"streaming collector shutting down"

);

return;

}

item = stream.next() => item,

};

let Some(item) = cancel_clone.run_until_cancelled(stream.next()).await else {

metrics.connection_state.set(STREAM_STATE_CLOSED);

tracing::info!(

collector_type,

endpoint = ?endpoint.addr,

"streaming collector shutting down"

);

return;

};

kensimon · 2026-04-06T15:44:59Z

crates/health/src/collectors/runtime.rs

+                tokio::select! {
+                    _ = cancel_clone.cancelled() => {
+                        metrics.connection_state.set(STREAM_STATE_CLOSED);
+                        return;
+                    }
+                    _ = tokio::time::sleep(delay) => {}
+                }


Nit

Suggested change

tokio::select! {

_ = cancel_clone.cancelled() => {

metrics.connection_state.set(STREAM_STATE_CLOSED);

return;

}

_ = tokio::time::sleep(delay) => {}

}

if cancel_clone

.run_until_cancelled(tokio::time::sleep(delay))

.await

.is_none()

{

metrics.connection_state.set(STREAM_STATE_CLOSED);

return;

}

kensimon · 2026-04-06T15:55:40Z

crates/health/src/collectors/runtime.rs

+const STREAM_STATE_CLOSED: f64 = 2.0;
+
+pub struct StreamMetrics {
+    connection_state: Gauge,


Nit: This can be an IntGauge and then you don't need the awkwardness of representing discrete states as an f64 (they can become integers at least.)

mkoci added 28 commits March 13, 2026 03:59

feature(health): Carve out NVUE HTTP collection into seperate PR

1552d34

feat(health): Add request timeout to NMX-T and match patterns with Nv…

093a767

…ueRest. Update example and add doc comments.

fix(health): discard protocol in HttpError to fit tuple style in Heal…

230ba97

…thError enum

fix(health, nmxt): fix mistake duplicating metrics by not understandi…

1758e50

…ng how the DataSink is implemented

fix(health, nvue_rest): remove poor pattern from nvue_rest collector.…

703f723

… remove stuttering from metric family names

fix(health,nvue_rest): full pass over expected response structures wi…

9825f9e

…th corrections

chore(health): fix comment. remove redundant request_timeout

fdad554

feature(health): Add support for switch_serial in endpoint_sources fo…

661a4ce

…r testing NVUE REST scraping in isolation

fix(health): refactor with upstream changes

3bc1652

chore(health): add switch static bmc endpoint for switch to example c…

fd35f57

…onfig, update test

chore(health): reuse enums for header

90ce1f3

feat(health): support empty passwords in NVUE rest HTTP auth

955b870

nit(health): cleanup URL construction and error handling for consistency

31a4edd

tweak(health): use #[cfg(test)] for unused NVUE response fields and…

232d721

… comment out the rest not in tests

feat(health): Add StreamingCollector trait. carve periodic logs into …

8cbf089

…a module

feat(health): Add config carveout for periodic/sse modes for log coll…

44e19d4

…ection

feat(health): tweak StreamingCollector runtime to better follow exist…

a579f0c

…ing patterns in PeriodicCollector with injection/config

feat(health): add SseLogCollector and wire it

74a5cee

feat(health): wire SSE log collection into discovery

9ba8ba6

fix(health): more specific errors

6551f9c

tweak(health): better error text

8cf7bf6

chore(health): lint

3a0ccc7

fix(health): add log config validation from periodic/streaming split …

abc72d1

…functionality. fix type

chore(health): format

95c9ccc

feat(health): add OTLP proto, add pipeline, add EventPipeline wrapper…

23f486b

… with bounded channels

feat(health): add drain task count/time batching, layer in EventProce…

635429f

…ssingPipeline to avoid sync/async conv with DataSink+streaming

feat(health): migrate collectors to EventPipeline

b84a98b

merge main into feat_health_streaming_sse_logs

073f43d

mkoci changed the title ~~Feat health streaming sse logs~~ feat(health): Experimental streaming log support via SSE (and it's messy implications) Mar 24, 2026

mkoci added 5 commits March 24, 2026 22:54

fix(health): mismatched types

8f27b9e

Merge branch 'main' into feat_health_streaming_sse_logs

a3713be

chore(health): bump nv-redfish version to fix SSE bug

dc9345a

fix(health): Swap curl for reqwest. Add reqwest blocking in build deps

2b8d20c

fix(health): fix rotation for max_backups, linting and cleaner naming…

f44286b

…. add early return to not block collector spawn by returning None

Matthias247 reviewed Mar 31, 2026

View reviewed changes

mkoci added 10 commits April 1, 2026 14:58

fix(health): explain now_or_never usage for SSE and handle edge case

5146dfa

fix(health): hoist endpoint decl outside of retry loop and handle inv…

db23bfc

…alid URI gracefully (don't crash the drain)

chore(health): de-dupe create_bmc

871aa54

fix(health): validate otlp drain endpoint at config time, as it is re…

8639e3f

…quired at runtime. Fix comment

chore(health): add TODO

67e8c78

fix(health): :::chasing my tail::: injecting proto at build time need…

b2c70b4

…s curl for OTLP proto def

chore(health): build, lint, test...

8b74f65

Merge remote-tracking branch 'origin/main' into feat_health_streaming…

697797e

…_sse_logs

fix(health): match upstream sig (processors/DataSink/MetricsManager)

2542047

chore(health): fix failing tests and add a noop processor

616afc7

mkoci marked this pull request as ready for review April 2, 2026 02:54

mkoci requested a review from a team as a code owner April 2, 2026 02:54

Copilot AI review requested due to automatic review settings April 2, 2026 02:54

Copilot started reviewing on behalf of mkoci April 2, 2026 02:54 View session

Copilot AI reviewed Apr 2, 2026

View reviewed changes

Merge branch 'main' into feat_health_streaming_sse_logs

d98bea9

kensimon requested changes Apr 6, 2026

View reviewed changes

kensimon reviewed Apr 6, 2026

View reviewed changes

	result = endpoint.connect() => {
	result = endpoint.clone().connect() => {

Conversation

mkoci commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

But Why?

Dude, where's my sync Sink?

Breaking Changes

Type of Change

Testing

Additional Notes

Uh oh!

copy-pr-bot bot commented Mar 24, 2026

Uh oh!

Matthias247 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yoks Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mkoci Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yoks commented Mar 31, 2026

Uh oh!

yoks commented Mar 31, 2026

Uh oh!

github-actions bot commented Apr 2, 2026

🔐 TruffleHog Secret Scan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

mkoci commented Mar 24, 2026 •

edited

Loading

yoks Apr 4, 2026 •

edited

Loading

mkoci Apr 2, 2026 •

edited

Loading

kensimon Apr 6, 2026 •

edited

Loading

kensimon Apr 6, 2026 •

edited

Loading