Skip to content

internal/logger: tiered cleanup / rewrite (infallible API, typed OTEL bridge, CP-safety) #387

Description

@schmitthub

Scope

internal/logger carries broader quality problems beyond the (now-fixed) exit
lag. This issue tracks the tiered cleanup / rewrite plan below.

Tier 0 — the CLI exit lag — is fixed and shipped in #388. Logger.Close
is now caller-owned (Close(ctx); the hardcoded 5s context.Background() is
gone, the caller's ctx is the flush deadline), plus review follow-ups that
removed the context.Canceled/DeadlineExceeded suppression so Close
returns the true provider.Shutdown outcome, and made clawker-cp surface
its close error to stderr without poisoning the teardown-integrity exit code.
The collector-down Close timeout path now has regression coverage
(TestClose_CanceledContext_ReturnsPromptly,
TestClose_LiveContext_BoundedByExportTimeout). Tiers 1–4 below remain.

Architectural intent that still governs the rewrite: enabling telemetry must
never make the CLI hang or crash when the collector is down, and this package is
imported on the CP boot path, where panic/os.Exit footguns matter given
the "CP crash is a security incident" invariant.


internal/logger package review — cleanup plan (tiered)

A package-isolated Go best-practices review (5 lenses), grounded against pinned
deps (otel/sdk/log@v0.20.0, otelslog@v0.19.0, rs/zerolog@v1.35.1) and stdlib
log/slog. Tracking the findings here as a sequenced cleanup.

Tier 1 — API ergonomics

  • Fallible construction + fallible accessor force the
    if log, err := f.Logger(); err == nil { ... } guard at ~34 call sites
    (and silently drop the log line on the error path). Make the logger
    infallible: the accessor returns a bare, never-nil *Logger (degrade to
    Nop()/stderr internally); keep the error only at the single factory wiring
    site.
  • *zerolog.Event leaks through the public API (Debug()/Info()/… return it)
    → the wrapper isn't a real boundary, the backend isn't swappable, and it
    necessitates the Zerolog() escape hatch.
  • Compress bool zero-value lie: doc says default true, but the zero value
    is false and there's no coercer (unlike the numeric rotation fields).

Tier 2 — OTEL↔zerolog bridge correctness (otel_writer.go)

The bridge serializes zerolog → JSON bytes, json.Unmarshals, and re-emits as a
log.Record. Anti-pattern — no official bridge does this (otelslog/otelzap
read typed fields directly → typed log.KeyValue). The round-trip causes real
data corruption:

  • int64↔float64 type flipping; >2^53 precision loss on IDs/byte-counts
    (everything decodes to float64, then is heuristically guessed back to int).
  • nested objects/arrays re-stringified instead of log.MapValue/SliceValue.
  • []byte → base64 string instead of KindBytes.
  • timestamps truncated to whole seconds (zerolog default
    TimeFieldFormat = RFC3339; the package never overrides it).
  • zerolog.TraceLevel dropped → SeverityUndefined.

CLAUDE.md's "all structured fields preserved" is inaccurate.

Grounded nuance (deepwiki on rs/zerolog): zerolog — unlike slog/zap — does
not expose typed fields to a Hook or LevelWriter (fields are mid-flight
JSON in e.buf; substance of rs/zerolog#493). A LevelWriter delivers the
level directly (structurally fixes the severity mapping), but full type
fidelity requires capturing typed fields at clawker's own *Logger boundary
before zerolog serializes
— i.e. this rewrite depends on the Tier 1 API
change
(stop passing through the raw *zerolog.Event).

Tier 3 — safety / hygiene (matters on the CP boot path)

  • otel.SetErrorHandler(...) is a process-global mutation inside the
    constructor
    (last-writer-wins, test pollution).
  • Fatal() on a shared library logger → os.Exit footgun; per the CP-crash
    invariant an unintended exit strands eBPF. (3 external callers.)
  • With(...) panics on odd-count / non-string key — a log call must never
    crash a (CP) goroutine; slog degrades with !BADKEY.

Tier 4 — nits + test gaps

  • interface{}any (the file is the lone holdout); dead/reserved mTLS
    file-path triple (YAGNI); the sync.Pool in the writer buys little;
    thread-safety doc overstated (post-Close emits are silently dropped).
  • Test gaps: no concurrency test under -race despite the thread-safety
    claim; OTEL fallback asserted only as "didn't error"; low-value accessor tests
    (Options_Custom, LogFilePath, Compress).

Proposed sequencing

Recommendation: B now, C as an immediate focused follow-up. TDD, adding the
missing -race concurrency test and the typed-field bridge fidelity tests.

Related (out of scope here)

A separate, smaller command-exit concern: the changelog notifier has no HTTP
cache/TTL and fetches raw.githubusercontent.com on every interactive run (the
update check is 24h-TTL-gated). Tracked separately — not part of this logger issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggoPull requests that update go codetech-debtCode quality, refactoring, and architectural cleanup

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions