feat(audit): stream proxy network events to NDJSON live#831
Conversation
c1e381c to
c6e2638
Compare
There was a problem hiding this comment.
Code Review
This pull request implements live streaming of network audit events to prevent data loss during crashes and avoid in-memory buffer limits. It introduces a NetworkAuditSink trait and a RecorderStreamingSink implementation, along with logic to close the audit log to ensure Merkle root integrity. Feedback was provided regarding the synchronous file I/O in the record implementation, which may block asynchronous request handlers and should be offloaded to a background task.
| impl nono_proxy::audit::NetworkAuditSink for RecorderStreamingSink { | ||
| fn record(&self, event: &NetworkAuditEvent) { | ||
| match self.recorder.lock() { | ||
| Ok(mut recorder) => { | ||
| if let Err(e) = recorder.record_network_event(event.clone()) { | ||
| tracing::warn!("Failed to stream network audit event: {}", e); | ||
| } | ||
| } | ||
| Err(e) => { | ||
| tracing::warn!( | ||
| "Audit recorder mutex poisoned while streaming network event: {}", | ||
| e | ||
| ); | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
The record implementation performs synchronous file I/O and a flush() call (via record_network_event -> append_event) while holding a Mutex lock on the AuditRecorder. Since this is called from the proxy's asynchronous request handlers, it may introduce significant latency and become a performance bottleneck under high network load. Consider offloading the recording to a background task or using an asynchronous channel to avoid blocking the proxy's worker threads.
There was a problem hiding this comment.
Thanks for flagging this — pushing back on implementing it now, with reasoning:
-
Not a new pattern.
push_eventalready took astd::sync::Mutexfrom the same async handlers before this PR (crates/nono-proxy/src/audit.rspre-PR). The sink call sits in front of that existing lock; the threading model is unchanged. -
Granularity is per-request, not per-byte. One event = one proxy decision. The proxy is already doing the TLS handshake and body relay around each event; a ~1–2 KB NDJSON line +
flush()is on the order of 100 µs on any modern SSD, well below the network round-trip the proxy already adds. -
Ordering is load-bearing.
append_eventcomputeschain_hash = H(prev_chain ‖ leaf_hash)and writes records strictly serially —verify_audit_logwalks that chain.spawn_blockingper event still serializes on the recorder mutex (plus task-hop overhead). A single-writer task + channel just relocates the lock; it does not remove it. -
A channel reintroduces the bug this PR fixes. Buffered events lost on crash is exactly the loss-on-crash gap Stream proxy network audit events to NDJSON as they occur #798 set out to close. Preserving durability would require a bounded channel with backpressure (producer blocks when full = the same blocking, just renamed) and an explicit drain in
close_audit_log. Net: more complexity for a theoretical win. -
The issue body called this out explicitly: "if proxy throughput becomes a bottleneck this should move to
spawn_blocking". The plan is to defer until benchmark data shows the sink dominates, then redesign with ordering + durability constraints in mind — not speculatively.
Happy to revisit if you have a profile/bench showing the sink is the bottleneck under realistic load.
lukehinds
left a comment
There was a problem hiding this comment.
Thanks @sikor111 , only concern is this introduces sync to async tokio threads. It would then be a hot path called for every proxied network event care of the Mutex<AuditRecorder> lock + the disk write via RecorderStreamingSink::record before returning. As push_event is called from an async Tokio task (lhandle_connection), that blocking I/O will occupy an async worker thread for the duration of each write. This would impact streaming downloads (which we plan to introduce), as each proxied connection generates at least one audit event, and that disk write now sits on the hot path. It could add measurable latency per connection under a busyu agent.
One way to address is to offload the sink call in push_event with tokio::task::spawn_blocking - but this may need push_event to become async or to fire-and-forget
c6e2638 to
7ebae11
Compare
Addresses maintainer feedback on always-further#831: `RecorderStreamingSink::record` took `Mutex<AuditRecorder>` and called `file.write_all + flush` while on a Tokio async worker thread. Every proxied connection went through that critical section, so concurrent traffic serialized on one lock and each async worker was blocked for the duration of a sync disk write — measurable latency under a busy agent and a blocker for the planned streaming downloads. Move the write off the hot path: - `NetworkAuditSink` gains a default `flush()` hook so sinks that buffer asynchronously can drain on demand. - `AuditLog::close()` invokes `sink.flush()` after flipping `closed`, so any further append by `finalize_supervised_exit` (e.g. `session_ended`) sees the queue fully drained. - `RecorderStreamingSink` is rebuilt around a bounded `sync_channel(16_384)` drained by a dedicated `nono-audit-writer` OS thread. `record()` is a non-blocking `try_send` (lock-free hot path); overflow increments a counter and warns at exponential intervals. The writer is the sole holder of the recorder mutex on the audit path, so concurrent proxy tasks no longer contend. Order preservation is mandatory because `AuditRecorder::append_event` builds a hash chain (`prev_chain -> chain_hash`) and a Merkle accumulator over `leaf_hashes`. A single-consumer channel preserves FIFO, so `verify_audit_log` continues to pass — fire-and-forget `spawn_blocking` would have raced and broken the chain. A new concurrent-writers test exercises 8 threads x 100 events and asserts chain verification succeeds end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Maciej Sikora <itmaciej.sikora@gmail.com>
|
Thanks for the review @lukehinds. You're right that the original I considered the Instead I rebuilt the sink around a bounded While doing this I noticed a subtler race in the original PR: a producer that has already passed If the writer thread itself fails to spawn, Coverage added (
Rebased onto |
Network audit events from nono-proxy were buffered in Arc<Mutex<Vec<NetworkAuditEvent>>> and only persisted to audit-events.ndjson at session exit. This dropped all events on crash, silently lost events past MAX_AUDIT_EVENTS = 4096 on long sessions, and prevented live observability via tail -f. Route each event through an optional NetworkAuditSink as it occurs. The in-memory Vec is kept only to populate SessionMetadata.network_events at exit; the durable file is what the sink writes. - nono-proxy::audit: new NetworkAuditSink trait; SharedAuditLog evolves to Arc<AuditLog> holding the buffer, an OnceLock<Arc<dyn NetworkAuditSink>>, and an AtomicBool close flag. push_event streams first, then buffers. - nono-proxy::server::ProxyHandle: set_audit_streaming_sink, audit_streaming_active, close_audit_log. - nono-cli::audit_integrity: RecorderStreamingSink wraps Arc<Mutex<AuditRecorder>>, forwards each event through record_network_event. Errors are logged via tracing::warn, never propagated; audit must not break network requests. - nono-cli::supervised_runtime: attach the sink after the recorder is created. Recorder is now Arc<Mutex<...>> so the sink can hold a clone. - nono-cli::rollback_runtime::finalize_supervised_exit: close the audit log before draining (prevents post-finalize events from extending the file past the committed Merkle root) and skip the per-event recorder push when streaming was active (no duplicates). No on-disk format change. No new CLI flag — default-on whenever the audit recorder is present. Standalone proxy use without a sink is unchanged. Refs: always-further#798 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Maciej Sikora <itmaciej.sikora@gmail.com>
Addresses maintainer feedback on always-further#831: `RecorderStreamingSink::record` took `Mutex<AuditRecorder>` and called `file.write_all + flush` while on a Tokio async worker thread. Every proxied connection went through that critical section, so concurrent traffic serialized on one lock and each async worker was blocked for the duration of a sync disk write — measurable latency under a busy agent and a blocker for the planned streaming downloads. Move the write off the hot path: - `NetworkAuditSink` gains a default `flush()` hook so sinks that buffer asynchronously can drain on demand. - `AuditLog::close()` invokes `sink.flush()` after flipping `closed`, so any further append by `finalize_supervised_exit` (e.g. `session_ended`) sees the queue fully drained. - `RecorderStreamingSink` is rebuilt around a bounded `sync_channel(16_384)` drained by a dedicated `nono-audit-writer` OS thread. `record()` is a non-blocking `try_send` (lock-free hot path); overflow increments a counter and warns at exponential intervals. The writer is the sole holder of the recorder mutex on the audit path, so concurrent proxy tasks no longer contend. Order preservation is mandatory because `AuditRecorder::append_event` builds a hash chain (`prev_chain -> chain_hash`) and a Merkle accumulator over `leaf_hashes`. A single-consumer channel preserves FIFO, so `verify_audit_log` continues to pass — fire-and-forget `spawn_blocking` would have raced and broken the chain. A new concurrent-writers test exercises 8 threads x 100 events and asserts chain verification succeeds end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Maciej Sikora <itmaciej.sikora@gmail.com>
Addresses open review items on the streaming-audit sink: Race fix: post-flush events could previously slip past `AuditLog::close` and land in `audit-events.ndjson` after `record_session_ended`, extending the file past the Merkle root committed to `SessionMetadata`. The race window is small but real: a producer can pass `is_closed()` and only reach `sink.record()` after `close()` has flushed and returned. The writer thread now sets a sticky `done` flag while processing the `Flush` command and silently drops any subsequent `Event`, so the seal holds regardless of producer/close interleaving. `verify_audit_log` keeps verifying end-to-end. Lifecycle and observability: - `RecorderStreamingSink` exposes `dropped_events()` and logs the final count on `Drop`, so the operator never silently loses count of queue-full or writer-gone drops. - `Disconnected` send errors now rate-limit their `warn!` (exponential backoff) to match the `Full` path; a crashed writer no longer floods the log. - `flush()` is idempotent: extra calls re-set `done`, ack, and return — exercised by a new test. - If the writer thread itself fails to spawn, `supervised_runtime` degrades to the legacy "drain at session_ended" path with a `warn!` instead of failing the run. Streaming durability is best-effort. Tests (+4): - queue-full overflow drops events and bumps the counter; total (recorded + dropped) equals the burst submitted. - post-flush events are silently dropped and the recorder length stays pinned across a sleep that would otherwise let the writer append. - `Drop` on the sink joins the writer cleanly: recorder `Arc` strong count drops to 1 and the recorder remains fully usable. - `flush()` is idempotent and the seal stays sticky across repeated flushes. Docstrings: `NetworkAuditSink::flush`, `AuditLog::close`, and `RecorderStreamingSink` now spell out the seal contract and the `close`/`flush` race that motivates it. `WriterCmd` variants have docstrings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Maciej Sikora <itmaciej.sikora@gmail.com>
9faf7c8 to
e7d5c0a
Compare
Closes #798
Disclosure: This pull request is being opened by Claude (an AI coding agent) operating on behalf of @sikor111, per the Coding Agent Contribution Policy in
CLAUDE.md.Summary
Network audit events from
nono-proxywere buffered inArc<Mutex<Vec<NetworkAuditEvent>>>and only persisted toaudit-events.ndjsonat session exit. That meant any crash/OOM/hard kill of the supervisor lost every event of the session, long sessions silently dropped events pastMAX_AUDIT_EVENTS = 4096, and operators could nottail -fto watch decisions live.This PR routes each event through an optional
NetworkAuditSinkas it occurs. The in-memoryVecis kept solely to populateSessionMetadata.network_eventsat exit; the durable file is what the sink writes.Approach
nono-proxy::audit: newNetworkAuditSinktrait.SharedAuditLogevolves toArc<AuditLog>holding the existing buffer,OnceLock<Arc<dyn NetworkAuditSink>>, andAtomicBoolclose flag.push_eventstreams first, then buffers; respects the close flag.nono-proxy::server::ProxyHandle:set_audit_streaming_sink,audit_streaming_active,close_audit_log.nono-cli::audit_integrity:RecorderStreamingSinkwrapsArc<Mutex<AuditRecorder>>and forwards each event throughrecord_network_event. Errors are logged viatracing::warn, never propagated — audit recording must not break an in-flight network request.nono-cli::supervised_runtime: attaches the sink after the recorder is created. Recorder is nowArc<Mutex<...>>so the sink can hold a clone.nono-cli::rollback_runtime::finalize_supervised_exit: closes the audit log before draining (prevents post-finalize events from extending the file past the committed Merkle root) and skips the per-event recorder push when streaming was active (no duplicates).No on-disk format change. No new CLI flag — default-on whenever the audit recorder is present. Standalone proxy use without a sink is unchanged.
Files consulted
crates/nono-proxy/src/audit.rs—SharedAuditLog,push_event,MAX_AUDIT_EVENTScrates/nono-proxy/src/server.rs—ProxyHandle,drain_audit_eventscrates/nono-cli/src/audit_integrity.rs—AuditRecorder::record_network_event,append_event,verify_audit_logcrates/nono-cli/src/supervised_runtime.rs— recorder constructioncrates/nono-cli/src/rollback_runtime.rs—finalize_supervised_exitTest plan
make ci) green-9mid-flight, confirmaudit-events.ndjsoncontains pre-kill eventstail -f audit-events.ndjsonduring a session shows events liveverify_audit_logstill succeeds end-of-session (no events past Merkle root)Agent Compliance Check
Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com