CPU and Max RSS Analysis tools by ChrisPaulBennett · Pull Request #6663 · cylc/cylc-flow

ChrisPaulBennett · 2025-03-12T09:14:38Z

This apart of 3 pull requests for adding CPU time and Max RSS analysis to the Cylc UI.

This adds the Max RSS and CPU time (as measured by cgroups) to the table view, box plot and time series views.

This adds a python profiler script. This profiler will will be ran by cylc in the same crgroup as the cylc task. It will periodically poll cgroups and save data to a file. Cylc will then store these values in the sql db file.

Linked to;
cylc/cylc-ui#2100
cylc/cylc-uiserver#675

Check List

I have read CONTRIBUTING.md and added my name as a Code Contributor.
Contains logically grouped changes (else tidy your branch by rebase).
Does not contain off-topic changes (use other PRs for other changes).
Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
Tests are included (or explain why tests are not needed).
Changelog entry included if this is a change that can affect users
Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.
If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

oliver-sanders

🎉

oliver-sanders

👍

oliver-sanders · 2025-04-03T10:59:35Z

+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#-------------------------------------------------------------------------------
+# cylc profile test


This test will run regular background jobs, no slurm / pbs / whatever, so no cgroups.

I think this is testing that the profiler will not cause the job to fail, even if it cannot poll cgroups? Which is worthwhile testing.

We should test the jobs stderr for the line(s) written by the profiler script complaining of the fault.

@ChrisPaulBennett

The profiler actually fails in this test, but the test passes anyway because it doesn't check whether the profiler did anything useful.

I've had a crack at a test here: ChrisPaulBennett#1

A couple of the sub-tests don't pass at the moment because the cpu/memory are not returned if the job fails.

oliver-sanders · 2025-04-03T11:03:48Z

(please ignore the manylinux test failures, we'll be removing this test on master shortly)

wxtim · 2025-04-16T13:50:42Z

I'm getting lots of failures with this (admittedly nasty) workflow on localhost:

[task parameters]
    time = 1..10
    reps = 1..5
[scheduling]
    cycling mode = integer
    [[graph]]
        R1 = task<time><reps>
[runtime]
    [[task<time><reps>]]
        script = sleep $CYLC_TASK_PARAM_time

About 2/3 of tasks have FileNotFoundError: [Errno 2] No such file or directory: 'cpu_time' - It looks to me like the profiler fails if the task exits too fast?

Full Traceback

Traceback (most recent call last):
  File "/home/users/tim.pillinger/conda-envs/cylc39/bin/cylc", line 8, in <module>
    sys.exit(main())
  File "/home/users/tim.pillinger/repos/cylc-flow/cylc/flow/scripts/cylc.py", line 702, in main
    execute_cmd(command, *cmd_args)
  File "/home/users/tim.pillinger/repos/cylc-flow/cylc/flow/scripts/cylc.py", line 333, in execute_cmd
    entry_point.load()(*args)
  File "/home/users/tim.pillinger/repos/cylc-flow/cylc/flow/terminal.py", line 298, in wrapper
    wrapped_function(*wrapped_args, **wrapped_kwargs)
  File "/home/users/tim.pillinger/repos/cylc-flow/cylc/flow/scripts/profiler.py", line 62, in main
    get_config(options)
  File "/home/users/tim.pillinger/repos/cylc-flow/cylc/flow/scripts/profiler.py", line 180, in get_config
    profile(process, cgroup_version, args.delay)
  File "/home/users/tim.pillinger/repos/cylc-flow/cylc/flow/scripts/profiler.py", line 159, in profile
    write_data(str(cpu_time), "cpu_time")
  File "/home/users/tim.pillinger/repos/cylc-flow/cylc/flow/scripts/profiler.py", line 103, in write_data
    with open(filename, 'w') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'cpu_time'

oliver-sanders · 2025-04-16T14:17:45Z

Note, it's not really valid to configure the profiler for the localhost platform as the job isn't running in a cgroup, but jobs that exit faster than the profiler's poll interval is an edge case that we should handle.

wxtim · 2025-04-17T12:59:20Z

Note, it's not really valid to configure the profiler for the localhost platform as the job isn't running in a cgroup

Probably need some user safety rails/warnings about that

oliver-sanders · 2025-04-17T13:01:33Z

Probably need some user safety rails/warnings about that

It's difficult for us to say which job runners do or do not support cgroup profiling. The best we can do is to document it.

ChrisPaulBennett · 2025-04-28T10:23:28Z

I'm not sure how to deal with the linting failure. My Perl is rusty, at best.
If I add "export", as the error code recommends, the test fails. If I remove it the test also fails.
Dave Matthews recommendations have been implemented

oliver-sanders · 2025-05-07T11:57:17Z

Works fine for me:

$ ctb -v tests/functional/jobscript/02-profiler.t -p '*'
ok 1 - 02-profiler-validate
ok 2 - 02-profiler-run
ok    20179 ms ( 0.01 usr  0.01 sys +  3.16 cusr  1.10 csys =  4.28 CPU)
[12:56:44]
All tests successful.
Files=1, Tests=2, 24 wallclock secs ( 0.02 usr  0.01 sys +  3.16 cusr  1.10 csys =  4.29 CPU)
Result: PASS

$ git diff
diff --git a/tests/functional/jobscript/02-profiler.t b/tests/functional/jobscript/02-profiler.t
index 1d8dbc548..601d12971 100644
--- a/tests/functional/jobscript/02-profiler.t
+++ b/tests/functional/jobscript/02-profiler.t
@@ -16,7 +16,7 @@
 # along with this program.  If not, see <http://www.gnu.org/licenses/>.
 #-------------------------------------------------------------------------------
 # cylc profile test
-REQUIRE_PLATFORM='runner:?(pbs|slurm)'
+export REQUIRE_PLATFORM='runner:?(pbs|slurm)'
 . "$(dirname "$0")/test_header"
 #-------------------------------------------------------------------------------
 set_test_number 2

$ etc/bin/shellchecker 
$ echo $?
0

ChrisPaulBennett · 2026-05-07T14:55:41Z

I've tested the CPU times as a sanity check that the numbers are correct. And It looks good to me.
I've got two flow.cylc files. One serial and one parallel.
FOO, FOOT and FOOL does some amount of compute
BAR, BOOL and PUB does twice the amount of compute.
In the serial workflow you should see both wall clock time and CPU time scale together (Roughly double). In parallel you should see the CPU time double (Same amount of work still), but the wall clock time should stay roughly the same (Twice as many cores doing the work)

Serial

#!Jinja2
#

[scheduler]
    UTC mode = True
    allow implicit tasks = True

[scheduling]
    initial cycle point = 2019-12-09T09:00Z
    [[graph]]
        R1 = foo_cold => foo_start
        R1/T00 = foo_start[^] => FOO
        T00, T12 = """
            cycle_end[-PT12H] => FOO
            FOO:succeed-all => BAR
            BAR:succeed-any => wipe_bar
            BAR:succeed-all & wipe_bar => cycle_end
        """

[runtime]

    [[root]]
    	platform = spice
        
    [[FOO]]
    script = /usr/bin/time -v bash -c 'for x in 1 2 3; do python -c "for x in range(100000000): (x / 1.234567) ** 2.3456789"; done'
      [[[directives]]]
        --mem=1000
        --ntasks=2
    [[BAR]]
    script = /usr/bin/time -v bash -c 'for x in 1 2 3 4 5 6; do python -c "for x in range(100000000): (x / 1.234567) ** 2.3456789"; done'
      [[[directives]]]
        --mem=500
        --ntasks=2

Parallel

#!Jinja2
#

[scheduler]
    UTC mode = True
    allow implicit tasks = True

[scheduling]
    initial cycle point = 2019-12-09T09:00Z
    [[graph]]
        R1 = foo_cold => foo_start
        R1/T00 = foo_start[^] => FOO
        T00, T12 = """
            cycle_end[-PT12H] => FOO
            FOO:succeed-all => BAR
            BAR:succeed-any => wipe_bar
            BAR:succeed-all & wipe_bar => cycle_end
        """

[runtime]

    [[root]]
    	platform = spice
        
    [[FOO]]
    script = /usr/bin/time -v bash -c 'for x in 1 2 3; do python -c "for x in range(100000000): (x / 1.234567) ** 2.3456789" & done; wait'
      [[[directives]]]
        --mem=1000
        --ntasks=2
    [[BAR]]
    script = /usr/bin/time -v bash -c 'for x in 1 2 3 4 5 6; do python -c "for x in range(100000000): (x / 1.234567) ** 2.3456789" & done; wait'
      [[[directives]]]
        --mem=500
        --ntasks=2

Co-authored-by: Ronnie Dutta <61982285+MetRonnie@users.noreply.github.com>

oliver-sanders · 2026-05-19T13:24:28Z

+                     most circumstances
+                     ''')
+                Conf('polling interval', VDR.V_INTEGER,
+                     default=10,


@dpmatthews, should we consider reducing this to 1?

Has been discussed: Information on the impacts of polling are scarce and contradictory. We will leave this at 10s to play it safe.

MetRonnie

ChrisPaulBennett#4

@oliver-sanders might want to cast your eye over this too

Fix profiler silent failures

oliver-sanders · 2026-06-11T10:21:42Z

Still getting errors running some test HPC jobs. @ChrisPaulBennett, could you try putting together a test workflow to replicate:

INFO - [Errno 2] No such file or directory: '/sys/fs/cgroupmemory//pbspro.service/jobid/.../memory.stat'. Unable to find memory usage data. This error came from the Cylc profiler and is not a problem with your workflow. Statistics gathering for the analysis view may be incomplete.

…lc_profiler

oliver-sanders · 2026-06-12T14:57:44Z

tests/functional/jobscript/03-profiler-e2e.t is failing for _local_slurm_indep_tcp and _local_slurm_shared_tcp with Cgroup not found messages in the job.err files.

oliver-sanders · 2026-06-12T12:52:55Z

-                if "max" not in line:
-                    return int(line)
+            memory_max_file = cgroup_memory_path / "memory.max"
+            line = memory_max_file.read_text().splitlines()[0]


This reads the whole file, we might want to stick with only reading the first line, i.e:

with open(...) as myfile: line = myfile.readline()

oliver-sanders · 2026-06-12T15:00:04Z

+# NOTE: This test will run the Cylc profiler on the given test platform.
+# The test platform may need to be configured for this to work (e.g.
+# "cgroups path" may need to be set).
+export REQUIRE_PLATFORM='runner:?(pbs|slurm) comms:tcp'


Making this change allows us to run the test on remote platforms (i.e, _remote*) as well as local ones (i.e, _local*). I think it's because tests are hardcoded to run on local platforms only, you have to "unlock" remote testing explicitly.

Suggested change

export REQUIRE_PLATFORM='runner:?(pbs|slurm) comms:tcp'

export REQUIRE_PLATFORM='loc:* runner:?(pbs|slurm) comms:tcp'

This is how I got it to run on _remote_pbs_indep_tcp.

ChrisPaulBennett marked this pull request as draft March 12, 2025 09:19

oliver-sanders reviewed Mar 12, 2025

View reviewed changes

oliver-sanders added this to the 8.x milestone Mar 12, 2025

oliver-sanders assigned ChrisPaulBennett Mar 12, 2025

This was referenced Mar 13, 2025

CPU and Max RSS Analysis tools cylc/cylc-ui#2100

Open

CPU and Max RSS Analysis tools cylc/cylc-uiserver#675

Open

ChrisPaulBennett force-pushed the cylc_profiler branch 2 times, most recently from fb1b12b to c5d30b3 Compare March 21, 2025 11:37

ChrisPaulBennett force-pushed the cylc_profiler branch 3 times, most recently from 30a7bb0 to 7091711 Compare April 2, 2025 08:35

ChrisPaulBennett marked this pull request as ready for review April 2, 2025 14:20

oliver-sanders reviewed Apr 3, 2025

View reviewed changes

oliver-sanders reviewed Apr 10, 2025

View reviewed changes

Comment thread cylc/flow/cfgspec/globalcfg.py Outdated

ChrisPaulBennett force-pushed the cylc_profiler branch from 4f3d03a to 49fcbc8 Compare April 15, 2025 07:51

ChrisPaulBennett requested a review from oliver-sanders April 15, 2025 09:39

ChrisPaulBennett force-pushed the cylc_profiler branch from 68b0687 to 66acd1f Compare April 17, 2025 09:41

oliver-sanders reviewed Apr 23, 2025

View reviewed changes

Comment thread cylc/flow/etc/job.sh Outdated

oliver-sanders reviewed Apr 24, 2025

View reviewed changes

Comment thread cylc/flow/cfgspec/globalcfg.py Outdated

oliver-sanders reviewed Apr 24, 2025

View reviewed changes

Comment thread cylc/flow/etc/job.sh Outdated

oliver-sanders reviewed Apr 24, 2025

View reviewed changes

Comment thread cylc/flow/etc/job.sh Outdated

oliver-sanders reviewed Apr 24, 2025

View reviewed changes

Comment thread cylc/flow/etc/job.sh

ChrisPaulBennett requested a review from oliver-sanders April 28, 2025 14:02

Merge branch 'master' into cylc_profiler

8425e58

MetRonnie reviewed May 11, 2026

View reviewed changes

Comment thread cylc/flow/cfgspec/globalcfg.py Outdated

Comment thread cylc/flow/job_file.py Outdated

MetRonnie self-requested a review May 11, 2026 16:10

ChrisPaulBennett and others added 3 commits May 14, 2026 17:35

Update cylc/flow/cfgspec/globalcfg.py

f33adf8

Co-authored-by: Ronnie Dutta <61982285+MetRonnie@users.noreply.github.com>

Code review changes

8340427

Typo

77801d3

MetRonnie reviewed May 15, 2026

View reviewed changes

Comment thread tests/unit/scripts/test_profiler.py Outdated

ChrisPaulBennett and others added 5 commits May 18, 2026 08:40

Update cylc/flow/scripts/profiler.py

1980831

Co-authored-by: Ronnie Dutta <61982285+MetRonnie@users.noreply.github.com>

Update unit tests

5506e7e

Update functional tests

3ff17b2

Code review changes

30db51a

Code review changes

bdb71b3

MetRonnie requested changes May 18, 2026

View reviewed changes

Comment thread cylc/flow/scripts/profiler.py Outdated

Comment thread tests/unit/scripts/test_profiler.py Outdated

Code review changes

bedf564

oliver-sanders requested review from MetRonnie and oliver-sanders May 19, 2026 13:12

oliver-sanders reviewed May 19, 2026

View reviewed changes

MetRonnie requested changes May 19, 2026

View reviewed changes

MetRonnie and others added 2 commits May 19, 2026 16:20

Fix profiler silent failures

7008d51

Merge pull request #4 from MetRonnie/profiler-fix

c66a1b4

Fix profiler silent failures

oliver-sanders requested review from MetRonnie and oliver-sanders June 9, 2026 13:56

ChrisPaulBennett added 3 commits June 12, 2026 13:49

Removed manual string mangling

58bac9f

Merge remote-tracking branch 'ChrisPaulBennett/cylc_profiler' into cy…

f584d2f

…lc_profiler

Removed manual string mangling

4f64cfa

oliver-sanders reviewed Jun 12, 2026

View reviewed changes

	export REQUIRE_PLATFORM='runner:?(pbs\|slurm) comms:tcp'
	export REQUIRE_PLATFORM='loc:* runner:?(pbs\|slurm) comms:tcp'

Conversation

ChrisPaulBennett commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oliver-sanders left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

oliver-sanders left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

oliver-sanders Apr 3, 2025

Choose a reason for hiding this comment

Uh oh!

oliver-sanders Apr 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

oliver-sanders commented Apr 3, 2025

Uh oh!

Uh oh!

wxtim commented Apr 16, 2025

Uh oh!

oliver-sanders commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wxtim commented Apr 17, 2025

Uh oh!

oliver-sanders commented Apr 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ChrisPaulBennett commented Apr 28, 2025

Uh oh!

oliver-sanders commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChrisPaulBennett commented May 7, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

oliver-sanders May 19, 2026

Choose a reason for hiding this comment

Uh oh!

oliver-sanders Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

MetRonnie left a comment

Choose a reason for hiding this comment

Uh oh!

oliver-sanders commented Jun 11, 2026

Uh oh!

ChrisPaulBennett commented Mar 12, 2025 •

edited

Loading

oliver-sanders commented Apr 16, 2025 •

edited

Loading

oliver-sanders commented May 7, 2025 •

edited

Loading