Skip to content

Show child process cpu usage in dtop#1880

Open
aclauer wants to merge 7 commits intodevfrom
andrew/feat/dtop-subprocess-cpu-usage
Open

Show child process cpu usage in dtop#1880
aclauer wants to merge 7 commits intodevfrom
andrew/feat/dtop-subprocess-cpu-usage

Conversation

@aclauer
Copy link
Copy Markdown
Collaborator

@aclauer aclauer commented Apr 18, 2026

Problem

dtop only shows cpu usage for Python workers spawned by DimOS. Any native modules spawned by that worker don't show up in the cpu statistics. Also add --log flag to log dtop statistics and dtop-plot to generate plots of cpu usage.

Closes DIM-XXX

Solution

Read the pids of any processes spawned and include their cpu usage in a drop down of the main worker.

dtop

Breaking Changes

None

How to Test

dimos --dtop --replay --replay-db=go2_bigoffice run unitree-go2

and

dtop

When dimos spawns the viewer, it will show up as a subprocess of the rerun bridge worker.

Contributor License Agreement

  • I have read and approved the CLA.

@aclauer aclauer changed the title Initial subprocess display Show child process cpu usage in dtop Apr 18, 2026
Comment on lines +130 to +142
try:
proc = _get_process(pid)
for child in proc.children(recursive=False):
child_proc = _get_process(child.pid)
try:
name = child_proc.name()
cpu = child_proc.cpu_percent(interval=None)
result.append(ChildProcessStats(pid=child.pid, name=name, cpu_percent=cpu))
except (psutil.NoSuchProcess, psutil.AccessDenied):
pass
except (psutil.NoSuchProcess, psutil.AccessDenied):
pass
return result
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably _get_process raises (psutil.NoSuchProcess, psutil.AccessDenied). Then surround just that function call with try-except. Nesting the try-excepts is confusing.

Comment thread dimos/utils/cli/dtop.py Outdated
parser.add_argument(
"--log",
nargs="?",
const=f"dtop_{time.strftime('%Y%m%d_%H%M%S')}.jsonl",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it's automatically git-ignored.

Suggested change
const=f"dtop_{time.strftime('%Y%m%d_%H%M%S')}.jsonl",
const=f"dtop_{time.strftime('%Y%m%d_%H%M%S')}.ignore.jsonl",

@aclauer aclauer marked this pull request as ready for review April 24, 2026 21:13
@aclauer aclauer requested a review from paul-nechifor April 24, 2026 21:13
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 24, 2026

Greptile Summary

This PR extends dtop to show CPU usage of child processes spawned by each worker (e.g. native modules), adds a --log flag to record stats as JSONL, and introduces a new dtop-plot CLI to generate matplotlib plots from those logs.

  • P1 — dtop_plot.py line 54: msg.get("workers", []) on a pandas Series returns NaN (not []) when the column value is null; iterating over it raises TypeError. Use msg.get("workers") or [] to guard against this.

Confidence Score: 4/5

Safe to merge with the NaN guard in dtop_plot._load fixed; the TUI changes are solid.

One P1 finding (TypeError crash in dtop-plot on edge-case log files) caps the score at 4. The core TUI and monitoring changes are well-structured with no critical bugs.

dimos/utils/cli/dtop_plot.py — _load NaN handling for missing "workers" rows.

Important Files Changed

Filename Overview
dimos/core/resource_monitor/stats.py Adds ChildProcessStats dataclass and collect_children_stats; inconsistently skips _proc_cache cleanup for dead children unlike collect_process_stats.
dimos/core/resource_monitor/monitor.py Aggregates child CPU into the parent worker's cpu_percent and passes the children list through to WorkerStats; straightforward and correct.
dimos/utils/cli/dtop.py Adds child-process sub-rows, --log argparse flag, and refactors CPU rendering into _cpu_metric; pid variable shadowing between outer and inner loops is a minor readability concern.
dimos/utils/cli/dtop_plot.py New plotting tool; msg.get("workers", []) on a pandas Series returns NaN instead of [] for null rows, causing a TypeError crash on edge-case log files.
pyproject.toml Registers the new dtop-plot entry point; no issues.

Sequence Diagram

sequenceDiagram
    participant SM as StatsMonitor
    participant S as stats.py
    participant LCM as LCM Bus
    participant DT as dtop (ResourceSpyApp)
    participant DP as dtop-plot

    SM->>S: collect_process_stats(worker_pid)
    S-->>SM: ProcessStats (worker cpu, mem…)
    SM->>S: collect_children_stats(worker_pid)
    S-->>SM: list[ChildProcessStats]
    SM->>SM: aggregate child cpu_percent into WorkerStats
    SM->>LCM: publish WorkerStats (incl. children[])

    LCM-->>DT: _on_msg(msg)
    DT->>DT: update _child_cpu_history[pid]
    DT->>DT: _make_child_line() per child
    DT->>DT: write JSONL line to log file (if --log)

    DP->>DP: pd.read_json(log)
    DP->>DP: _load() → DataFrame + labels
    DP->>DP: _plot() → save PNG
Loading

Reviews (2): Last reviewed commit: "Merge branch 'dev' into andrew/feat/dtop..." | Re-trigger Greptile

Comment thread dimos/utils/cli/dtop.py
self._latest = msg
self._last_msg_time = time.monotonic()
if self._log_file:
self._log_file.write(json.dumps({"ts": time.time(), **msg}) + "\n")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Log file not flushed between writes

Each _on_msg call writes a line to _log_file but never calls flush(). Because Python's file I/O is buffered by default, lines written near a crash or SIGKILL will silently stay in the OS/Python buffer and never reach disk. Adding a flush() after the write ensures each message is durable.

Suggested change
self._log_file.write(json.dumps({"ts": time.time(), **msg}) + "\n")
self._log_file.write(json.dumps({"ts": time.time(), **msg}) + "\n")
self._log_file.flush()

Comment thread dimos/utils/cli/dtop.py
) -> None:
super().__init__()
self._topic_name = topic_name
self._log_file = open(log_path, "a") if log_path else None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 File handle leak if __init__ raises after open()

_log_file is opened before autoconf, PickleLCM(), and subscribe(). If any of those subsequent calls throw, on_unmount is never called and the file handle is leaked. A try/except (or opening the file later, e.g. in on_mount) would prevent this.

Comment thread dimos/utils/cli/dtop.py
parts.append(Rule(title=title, style=border_style))
parts.extend(self._make_lines(d, stale, ranges, self._cpu_history[role]))
for child in d.get("children", []):
pid = child.get("pid", 0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 pid variable shadowed by inner loop

The outer for tuple unpacks pid as a string (worker pid for display), but this inner assignment overwrites it with an integer child pid. The outer pid is rebound at the start of each outer iteration so there's no runtime bug, but the shadowing is confusing and could easily introduce a bug if code is added between the inner loop and the next outer iteration.

Suggested change
pid = child.get("pid", 0)
child_pid = child.get("pid", 0)
if child_pid not in self._child_cpu_history:
self._child_cpu_history[child_pid] = deque(maxlen=_SPARK_WIDTH * 2)
if not stale:
self._child_cpu_history[child_pid].append(child.get("cpu_percent", 0.0))
parts.append(self._make_child_line(child, stale, self._child_cpu_history[child_pid]))

rows = []
for _, msg in raw.iterrows():
ts = msg["ts"]
rows.append({"ts": ts, "role": _COORDINATOR, **msg[_COORDINATOR]})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 KeyError on malformed log lines

msg[_COORDINATOR] raises KeyError if any line in the JSONL file is missing the "coordinator" key (e.g., a truncated line written during an unclean shutdown). Wrapping the row processing in a try/except KeyError and skipping bad rows would make the tool more robust.

Suggested change
rows.append({"ts": ts, "role": _COORDINATOR, **msg[_COORDINATOR]})
try:
ts = msg["ts"]
rows.append({"ts": ts, "role": _COORDINATOR, **msg[_COORDINATOR]})
except (KeyError, TypeError):
continue

for _, msg in raw.iterrows():
ts = msg["ts"]
rows.append({"ts": ts, "role": _COORDINATOR, **msg[_COORDINATOR]})
for w in msg.get("workers", []):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 msg.get("workers", []) returns NaN, not [], on pandas null rows

pd.read_json(path, lines=True) creates a "workers" column for the whole DataFrame. If any log line is missing the "workers" key (e.g. a coordinator-only message from an older build, or a partially-written line), pandas fills that row with NaN. A pandas Series get(key, default) only falls back to default when the key is absent from the index — not when the value is NaN. So msg.get("workers", []) returns NaN for those rows, and for w in NaN raises TypeError: 'float' object is not iterable.

Use msg.get("workers") or [] to handle the NaN case:

        for w in msg.get("workers") or []:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants