[MISC] Add full support of AMD GPU to unit test and benchmark infra. by v01dXYZ · Pull Request #2680 · Genesis-Embodied-AI/Genesis

v01dXYZ · 2026-04-08T17:30:29Z

Description

try to fallback to rocm-smi if nvidia-smi not found
use KFD sysfs instead of nvidia proc interface

Related Issue

This is related to #2679. I had a hard time finding the specs of the GitHub runners so I don't know what are the AMD GPUs used to test Genesis.

Resolves Genesis-Embodied-AI/Genesis#

Motivation and Context

to have benchmark/mem-reporting works on ROCm.

How Has This Been / Can This Be Tested?

Yes, manually though.

Screenshots (if appropriate):

Checklist:

I read the CONTRIBUTING document.
I followed the Submitting Code Changes section of CONTRIBUTING document.
I tagged the title correctly (including BUG FIX/FEATURE/MISC/BREAKING)
I updated the documentation accordingly or no change is needed: No new/modified API
I tested my changes and added instructions on how to test it for reviewers: I didn't test my changes on NVIDIA machines

I have added tests to cover my changes.
All new and existing tests passed.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 87d60d8a4d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

v01dXYZ · 2026-04-09T08:53:07Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 453d137c0a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-09T09:01:20Z

+    except FileNotFoundError:
+        pass


Catch nvidia-smi command failures before ROCm fallback

is_mem_monitoring_supported() now allows monitoring when nvidia-smi fails but rocm-smi works, yet get_cuda_usage() only falls back when nvidia-smi is missing. On ROCm hosts where nvidia-smi is present but exits non-zero (raising subprocess.CalledProcessError), the monitor process crashes before trying rocm-smi, so --mem-monitoring-filepath is reported as supported but fails at runtime.

Useful? React with 👍 / 👎.

I assume on a ROCm machine there is no nvidia tooling, otherwise it would be hard to distinguish between both of them.

v01dXYZ · 2026-04-09T09:19:15Z

In order to avoid having allocation failure, I put device_memory_GB=32 when manually testing.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 453d137c0a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-09T09:40:27Z

+    hip_uuid = "".join([chr(int(device_uuid[i : i + 2], 16)) for i in range(0, len(device_uuid), 2)])
+    unique_id = int(hip_uuid, 16)


Decode ROCm UUID bytes correctly before integer conversion

amdgpu_get_node_rank_from_uuid() converts each UUID byte to a character and then calls int(..., 16), but that intermediate string will contain non-hex characters for almost all valid UUIDs (e.g., byte 0x7d becomes }), so int raises ValueError before any node lookup happens. On ROCm systems this propagates from _torch_get_gpu_idx() and can fail GPU test setup as soon as device-index validation runs, so memory/device routing checks break even when KFD topology is present.

Useful? React with 👍 / 👎.

No because actually hip UUID is already a 16 bytes hex representation of a int64 integer. PyTorch converts this 16 bytes hex rep into a 32 bytes hex rep (so it is a hex rep of an hex rep).

v01dXYZ · 2026-04-09T10:37:27Z

Below is the result of the mem/speed benchmarks on MI300X:

Details

MEM:

env=duck_in_box_easy    | gjk_collision=True    | batch_size=30000      | backend=gpu   | dtype=ndarray         | max_mem_mb=35445
env=duck_in_box_easy    | gjk_collision=False   | batch_size=30000      | backend=gpu   | dtype=ndarray         | max_mem_mb=36424
env=duck_in_box_hard    | gjk_collision=True    | batch_size=30000      | backend=gpu   | dtype=ndarray         | max_mem_mb=36398
env=duck_in_box_hard    | gjk_collision=False   | batch_size=30000      | backend=gpu   | dtype=ndarray         | max_mem_mb=36399
env=duck_in_box_hard    | batch_size=0  | backend=cpu   | dtype=ndarray         | max_mem_mb=3249
env=anymal_random       | batch_size=30000      | backend=gpu   | dtype=ndarray         | max_mem_mb=36832
env=anymal_uniform      | batch_size=30000      | backend=gpu   | dtype=ndarray         | max_mem_mb=36834
env=anymal_zero         | batch_size=30000      | backend=gpu   | dtype=ndarray         | max_mem_mb=36835
env=anymal_zero         | batch_size=0  | backend=cpu   | dtype=ndarray         | max_mem_mb=3685
env=anymal_uniform_kinematic    | batch_size=30000      | backend=gpu   | dtype=ndarray         | max_mem_mb=36837
env=anymal_uniform_kinematic    | batch_size=0  | backend=cpu   | dtype=ndarray         | max_mem_mb=3687
env=go2         | gjk_collision=True    | batch_size=4096       | backend=gpu   | dtype=ndarray         | max_mem_mb=36856
env=go2         | constraint_solver=CG  | gjk_collision=False   | batch_size=4096       | backend=gpu   | dtype=ndarray         | max_mem_mb=36860
env=go2         | constraint_solver=Newton      | gjk_collision=False   | batch_size=4096       | backend=gpu   | dtype=ndarray         | max_mem_mb=36880
env=franka_accessors    | batch_size=0  | backend=cpu   | dtype=ndarray         | max_mem_mb=3692
env=franka_accessors    | batch_size=30000      | backend=gpu   | dtype=ndarray         | max_mem_mb=36927
env=franka_free         | batch_size=30000      | backend=gpu   | dtype=ndarray         | max_mem_mb=36942
env=franka      | batch_size=30000      | backend=gpu   | dtype=ndarray         | max_mem_mb=36928
env=franka_random       | gjk_collision=False   | batch_size=30000      | backend=gpu   | dtype=ndarray         | max_mem_mb=36926
env=franka_random       | gjk_collision=True    | batch_size=30000      | backend=gpu   | dtype=ndarray         | max_mem_mb=36919
env=franka_random       | constraint_solver=CG  | batch_size=30000      | backend=gpu   | dtype=ndarray         | max_mem_mb=36925
env=franka_random       | constraint_solver=Newton      | batch_size=30000      | backend=gpu   | dtype=ndarray         | max_mem_mb=36929
env=franka_random       | batch_size=0  | backend=cpu   | dtype=ndarray         | max_mem_mb=3773
env=box_pyramid_3       | batch_size=4096       | backend=gpu   | dtype=ndarray         | max_mem_mb=36924
env=box_pyramid_4       | batch_size=4096       | backend=gpu   | dtype=ndarray         | max_mem_mb=36926
env=box_pyramid_5       | batch_size=4096       | backend=gpu   | dtype=ndarray         | max_mem_mb=36927
env=box_pyramid_6       | gjk_collision=True    | batch_size=4096       | backend=gpu   | dtype=ndarray         | max_mem_mb=36928

SPEED:

env=duck_in_box_easy    | batch_size=30000      | use_contact_island=False      | gjk_collision=True    | dtype=ndarray         | backend=amdgpu        | compile_time=48.5     | runtime_fps=3041448.0        | realtime_factor=30414.5
env=duck_in_box_easy    | batch_size=30000      | use_contact_island=False      | gjk_collision=False   | dtype=ndarray         | backend=amdgpu        | compile_time=75.5     | runtime_fps=4409587.0        | realtime_factor=44095.9
env=duck_in_box_hard    | batch_size=30000      | use_contact_island=False      | gjk_collision=True    | dtype=ndarray         | backend=amdgpu        | compile_time=49.1     | runtime_fps=653952.0  | realtime_factor=6539.5
env=duck_in_box_hard    | batch_size=30000      | use_contact_island=False      | gjk_collision=False   | dtype=ndarray         | backend=amdgpu        | compile_time=76.5     | runtime_fps=1942384.0        | realtime_factor=19423.8
env=duck_in_box_hard    | batch_size=0  | use_contact_island=False      | dtype=ndarray         | backend=cpu   | compile_time=37.4     | runtime_fps=2308.0    | realtime_factor=23.1
env=anymal_random       | batch_size=30000      | use_contact_island=False      | dtype=ndarray         | backend=amdgpu        | compile_time=77.6     | runtime_fps=1334529.0         | realtime_factor=13345.3
env=anymal_uniform      | batch_size=30000      | use_contact_island=False      | dtype=ndarray         | backend=amdgpu        | compile_time=77.6     | runtime_fps=1703289.0         | realtime_factor=17032.9
env=anymal_zero         | batch_size=30000      | use_contact_island=False      | dtype=ndarray         | backend=amdgpu        | compile_time=78.0     | runtime_fps=2006120.0         | realtime_factor=20061.2
env=anymal_zero         | batch_size=0  | use_contact_island=False      | dtype=ndarray         | backend=cpu   | compile_time=38.6     | runtime_fps=2897.0    | realtime_factor=29.0
env=anymal_uniform_kinematic    | batch_size=30000      | use_contact_island=False      | dtype=ndarray         | backend=amdgpu        | compile_time=83.5     | runtime_fps=1466308.0         | realtime_factor=14663.1
env=anymal_uniform_kinematic    | batch_size=0  | use_contact_island=False      | dtype=ndarray         | backend=cpu   | compile_time=42.6     | runtime_fps=861.0     | realtime_factor=8.6
env=go2         | batch_size=4096       | use_contact_island=False      | gjk_collision=True    | dtype=ndarray         | backend=amdgpu        | compile_time=52.0     | runtime_fps=506838.0  | realtime_factor=5068.4
env=go2         | batch_size=4096       | constraint_solver=CG  | use_contact_island=False      | gjk_collision=False   | dtype=ndarray         | backend=amdgpu        | compile_time=78.7     | runtime_fps=540631.0  | realtime_factor=5406.3
env=go2         | batch_size=4096       | constraint_solver=Newton      | use_contact_island=False      | gjk_collision=False   | dtype=ndarray         | backend=amdgpu        | compile_time=79.1     | runtime_fps=686389.0  | realtime_factor=6863.9
env=franka_accessors    | batch_size=0  | use_contact_island=False      | dtype=ndarray         | backend=cpu   | compile_time=38.7     | runtime_fps=760.0     | realtime_factor=7.6
env=franka_accessors    | batch_size=30000      | use_contact_island=False      | dtype=ndarray         | backend=amdgpu        | compile_time=79.1     | runtime_fps=3370704.0         | realtime_factor=33707.0
env=franka_free         | batch_size=30000      | use_contact_island=False      | dtype=ndarray         | backend=amdgpu        | compile_time=79.9     | runtime_fps=4551981.0         | realtime_factor=45519.8
env=franka      | batch_size=30000      | use_contact_island=False      | dtype=ndarray         | backend=amdgpu        | compile_time=80.1     | runtime_fps=3117229.0         | realtime_factor=31172.3
env=franka_random       | batch_size=30000      | use_contact_island=False      | gjk_collision=False   | dtype=ndarray         | backend=amdgpu        | compile_time=80.2     | runtime_fps=2548096.0        | realtime_factor=25481.0
env=franka_random       | batch_size=30000      | use_contact_island=False      | gjk_collision=True    | dtype=ndarray         | backend=amdgpu        | compile_time=53.0     | runtime_fps=1984557.0        | realtime_factor=19845.6
env=franka_random       | batch_size=30000      | constraint_solver=CG  | use_contact_island=False      | dtype=ndarray         | backend=amdgpu        | compile_time=79.4     | runtime_fps=2241193.0        | realtime_factor=22411.9
env=franka_random       | batch_size=30000      | constraint_solver=Newton      | use_contact_island=False      | dtype=ndarray         | backend=amdgpu        | compile_time=79.5     | runtime_fps=2557212.0 | realtime_factor=25572.1
env=franka_random       | batch_size=0  | use_contact_island=False      | dtype=ndarray         | backend=cpu   | compile_time=39.4     | runtime_fps=2669.0    | realtime_factor=26.7
env=box_pyramid_3       | batch_size=4096       | use_contact_island=False      | dtype=ndarray         | backend=amdgpu        | compile_time=79.3     | runtime_fps=186131.0  | realtime_factor=1861.3
env=box_pyramid_4       | batch_size=4096       | use_contact_island=False      | dtype=ndarray         | backend=amdgpu        | compile_time=78.0     | runtime_fps=74858.0   | realtime_factor=748.6
env=box_pyramid_5       | batch_size=4096       | use_contact_island=False      | dtype=ndarray         | backend=amdgpu        | compile_time=79.7     | runtime_fps=25651.0   | realtime_factor=256.5

MEM:

env	batch_size	backend	dtype	gjk_collision	constraint_solver	max_mem_mb
duck_in_box_easy	30000	gpu	ndarray	True	-	35445
duck_in_box_easy	30000	gpu	ndarray	False	-	36424
duck_in_box_hard	30000	gpu	ndarray	True	-	36398
duck_in_box_hard	30000	gpu	ndarray	False	-	36399
duck_in_box_hard	0	cpu	ndarray	-	-	3249
anymal_random	30000	gpu	ndarray	-	-	36832
anymal_uniform	30000	gpu	ndarray	-	-	36834
anymal_zero	30000	gpu	ndarray	-	-	36835
anymal_zero	0	cpu	ndarray	-	-	3685
anymal_uniform_kinematic	30000	gpu	ndarray	-	-	36837
anymal_uniform_kinematic	0	cpu	ndarray	-	-	3687
go2	4096	gpu	ndarray	True	-	36856
go2	4096	gpu	ndarray	False	CG	36860
go2	4096	gpu	ndarray	False	Newton	36880
franka_accessors	0	cpu	ndarray	-	-	3692
franka_accessors	30000	gpu	ndarray	-	-	36927
franka_free	30000	gpu	ndarray	-	-	36942
franka	30000	gpu	ndarray	-	-	36928
franka_random	30000	gpu	ndarray	False	-	36926
franka_random	30000	gpu	ndarray	True	-	36919
franka_random	30000	gpu	ndarray	-	CG	36925
franka_random	30000	gpu	ndarray	-	Newton	36929
franka_random	0	cpu	ndarray	-	-	3773
box_pyramid_3	4096	gpu	ndarray	-	-	36924
box_pyramid_4	4096	gpu	ndarray	-	-	36926
box_pyramid_5	4096	gpu	ndarray	-	-	36927
box_pyramid_6	4096	gpu	ndarray	True	-	36928

SPEED:

env	batch_size	constraint_solver	use_contact_island	dtype	backend	gjk_collision	runtime_fps	compile_time	realtime_factor
duck_in_box_easy	30000	-	False	ndarray	amdgpu	True	3.04145e+06	48.5	30414.5
duck_in_box_easy	30000	-	False	ndarray	amdgpu	False	4.40959e+06	75.5	44095.9
duck_in_box_hard	30000	-	False	ndarray	amdgpu	True	653952	49.1	6539.5
duck_in_box_hard	30000	-	False	ndarray	amdgpu	False	1.94238e+06	76.5	19423.8
duck_in_box_hard	0	-	False	ndarray	cpu	-	2308	37.4	23.1
anymal_random	30000	-	False	ndarray	amdgpu	-	1.33453e+06	77.6	13345.3
anymal_uniform	30000	-	False	ndarray	amdgpu	-	1.70329e+06	77.6	17032.9
anymal_zero	30000	-	False	ndarray	amdgpu	-	2.00612e+06	78	20061.2
anymal_zero	0	-	False	ndarray	cpu	-	2897	38.6	29
anymal_uniform_kinematic	30000	-	False	ndarray	amdgpu	-	1.46631e+06	83.5	14663.1
anymal_uniform_kinematic	0	-	False	ndarray	cpu	-	861	42.6	8.6
go2	4096	-	False	ndarray	amdgpu	True	506838	52	5068.4
go2	4096	CG	False	ndarray	amdgpu	False	540631	78.7	5406.3
go2	4096	Newton	False	ndarray	amdgpu	False	686389	79.1	6863.9
franka_accessors	0	-	False	ndarray	cpu	-	760	38.7	7.6
franka_accessors	30000	-	False	ndarray	amdgpu	-	3.3707e+06	79.1	33707
franka_free	30000	-	False	ndarray	amdgpu	-	4.55198e+06	79.9	45519.8
franka	30000	-	False	ndarray	amdgpu	-	3.11723e+06	80.1	31172.3
franka_random	30000	-	False	ndarray	amdgpu	False	2.5481e+06	80.2	25481
franka_random	30000	-	False	ndarray	amdgpu	True	1.98456e+06	53	19845.6
franka_random	30000	CG	False	ndarray	amdgpu	-	2.24119e+06	79.4	22411.9
franka_random	30000	Newton	False	ndarray	amdgpu	-	2.55721e+06	79.5	25572.1
franka_random	0	-	False	ndarray	cpu	-	2669	39.4	26.7
box_pyramid_3	4096	-	False	ndarray	amdgpu	-	186131	79.3	1861.3
box_pyramid_4	4096	-	False	ndarray	amdgpu	-	74858	78	748.6
box_pyramid_5	4096	-	False	ndarray	amdgpu	-	25651	79.7	256.5

RTX 6000

MEM:

env	batch_size	backend	dtype	gjk_collision	constraint_solver	max_mem_mb
duck_in_box_easy	30000	gpu	ndarray	True	-	3858
duck_in_box_easy	30000	gpu	ndarray	False	-	3252
duck_in_box_hard	30000	gpu	ndarray	True	-	13974
duck_in_box_hard	30000	gpu	ndarray	False	-	13368
duck_in_box_hard	0	cpu	ndarray	-	-	804
anymal_random	30000	gpu	ndarray	-	-	12414
anymal_uniform	30000	gpu	ndarray	-	-	12414
anymal_zero	30000	gpu	ndarray	-	-	12416
anymal_zero	0	cpu	ndarray	-	-	1228
anymal_uniform_kinematic	30000	gpu	ndarray	-	-	12738
anymal_uniform_kinematic	0	cpu	ndarray	-	-	1230
go2	4096	gpu	ndarray	True	-	4580
go2	4096	gpu	ndarray	False	CG	3974
go2	4096	gpu	ndarray	False	Newton	3976
franka_accessors	0	cpu	ndarray	-	-	1236
franka_accessors	30000	gpu	ndarray	-	-	16268
franka_free	30000	gpu	ndarray	-	-	16270
franka	30000	gpu	ndarray	-	-	16272
franka_random	30000	gpu	ndarray	False	-	16274
franka_random	30000	gpu	ndarray	True	-	16884
franka_random	30000	gpu	ndarray	-	CG	16278
franka_random	30000	gpu	ndarray	-	Newton	16278
franka_random	0	cpu	ndarray	-	-	1250
box_pyramid_3	4096	gpu	ndarray	-	-	2104
box_pyramid_4	4096	gpu	ndarray	-	-	3386
box_pyramid_5	4096	gpu	ndarray	-	-	6620
box_pyramid_6	4096	gpu	ndarray	True	-	10622
box_pyramid_6	4096	gpu	ndarray	False	-	9984
g1_fall	4096	gpu	ndarray	-	Newton	5026
dex_hand	4096	gpu	ndarray	-	-	8644

SPEED:

env	batch_size	constraint_solver	use_contact_island	dtype	backend	gjk_collision	runtime_fps	compile_time	realtime_factor
duck_in_box_easy	30000	-	False	ndarray	cuda	True	6.95194e+06	27.2	69519.4
duck_in_box_easy	30000	-	False	ndarray	cuda	False	6.9787e+06	42.6	69787
duck_in_box_hard	30000	-	False	ndarray	cuda	True	3.1793e+06	27.5	31793
duck_in_box_hard	30000	-	False	ndarray	cuda	False	6.94348e+06	43.8	69434.9
duck_in_box_hard	0	-	False	ndarray	cpu	-	4551	22.4	45.5
anymal_random	30000	-	False	ndarray	cuda	-	6.91582e+06	43.3	69158.1
anymal_uniform	30000	-	False	ndarray	cuda	-	6.93112e+06	43.4	69311.2
anymal_zero	30000	-	False	ndarray	cuda	-	6.94098e+06	43.6	69409.8
anymal_zero	0	-	False	ndarray	cpu	-	5634	22.4	56.3
anymal_uniform_kinematic	30000	-	False	ndarray	cuda	-	6.46777e+06	46.2	64677.7
anymal_uniform_kinematic	0	-	False	ndarray	cpu	-	1905	24.5	19.1
go2	4096	-	False	ndarray	cuda	True	889852	28.4	8898.5
go2	4096	CG	False	ndarray	cuda	False	424786	52.8	4247.9
go2	4096	Newton	False	ndarray	cuda	False	399131	44.5	3991.3
franka_accessors	0	-	False	ndarray	cpu	-	1662	22.9	16.6
franka_accessors	30000	-	False	ndarray	cuda	-	1.76582e+06	52.9	17658.2
franka_free	30000	-	False	ndarray	cuda	-	1.82898e+06	44.6	18289.8
franka	30000	-	False	ndarray	cuda	-	5.43482e+06	44.9	54348.2
franka_random	30000	-	False	ndarray	cuda	False	6.73956e+06	44.1	67395.6
franka_random	30000	-	False	ndarray	cuda	True	6.69475e+06	28.1	66947.5
franka_random	30000	CG	False	ndarray	cuda	-	6.98663e+06	43.8	69866.3
franka_random	30000	Newton	False	ndarray	cuda	-	6.72559e+06	44.1	67255.9
franka_random	0	-	False	ndarray	cpu	-	5043	22.8	50.4
box_pyramid_3	4096	-	False	ndarray	cuda	-	655001	44	6550
box_pyramid_4	4096	-	False	ndarray	cuda	-	243413	43.4	2434.1
box_pyramid_5	4096	-	False	ndarray	cuda	-	87378	43.7	873.8
box_pyramid_6	4096	-	False	ndarray	cuda	True	40378	27.8	403.8
box_pyramid_6	4096	-	False	ndarray	cuda	False	36980	43.8	369.8
g1_fall	4096	Newton	False	ndarray	cuda	-	99842	43.9	499.2
dex_hand	4096	-	False	ndarray	cuda	-	12757	61	797.3

duburcqa · 2026-04-10T07:40:56Z

    try:
-        assert sys.platform.startswith("linux")
        subprocess.check_output(["nvidia-smi"], stderr=subprocess.STDOUT, timeout=10)
        return True, None
-    except Exception as exc:  # platform or nvidia-smi unavailable
-        return False, exc
+    except Exception as exc:
+        pass
+
+    try:
+        subprocess.check_output(["rocm-smi"], stderr=subprocess.STDOUT, timeout=10)
+        return True, None
+    except Exception as exc:
+        pass


Avoid duplicated logics

Never catch generic Exception unless it is absolutely necessary:

utilities = ("nvidia-smi", "rocm-smi") for exe in utilities: try: subprocess.check_output([exe], stderr=subprocess.STDOUT, check=True, timeout=10) return True, None except (FileNotFoundError, subprocess.TimeoutExpired, subprocess.CalledProcessError, OSError): pass return False, f"None of the executables {utilities} exited successfully (code 0)."

duburcqa · 2026-04-10T07:46:01Z

+        if NVIDIA_GPU_INTERFACE_PATH.is_dir():
+            return tuple(i for i, _ in enumerate(NVIDIA_GPU_INTERFACE_PATH.iterdir()))
+        if KFD_SYSFS_PATH.is_dir():
+            return tuple(i for i, _ in enumerate(get_kfd_gpu_nodes_properties()))
+
+        warnings.warn(
+            f"'{NVIDIA_GPU_INTERFACE_PATH!s}' or '{KFD_SYSFS_PATH!s}' is not available. Multi-GPU support will be disabled. This is expected "
+            "on WSL2 where the NVIDIA proc interface is not mounted.",
+            stacklevel=2,
+        )

    return (0,)


+def parse_kfd_node_properties(kfd_properties_str: str):
+    props = {}
+    for l in kfd_properties_str.split("\n"):
+        l = l.strip()
+
+        if not l:
+            continue
+
+        name, value_str = l.split()
+        props[name] = int(value_str)
+
+    return props
+
+
+KFD_SYSFS_PATH = Path("/sys/devices/virtual/kfd/kfd/topology")
+
+
+def get_kfd_gpu_nodes_properties():
+    kfd_sysfs_path_nodes = Path(KFD_SYSFS_PATH) / "nodes"
+
+    gpu_nodes_properties = {}
+
+    for node_path in kfd_sysfs_path_nodes.iterdir():
+        with (node_path / "properties").open() as node_properties_f:
+            properties_str = node_properties_f.read()
+            node_props = parse_kfd_node_properties(properties_str)
+
+            if node_props["cpu_cores_count"] == 0:
+                gpu_nodes_properties[int(node_path.name)] = node_props
+
+    return gpu_nodes_properties
+
+
+def amdgpu_get_node_rank_from_uuid(device_uuid):
+    device_uuid = device_uuid.replace("-", "")
+    hip_uuid = "".join([chr(int(device_uuid[i : i + 2], 16)) for i in range(0, len(device_uuid), 2)])
+    unique_id = int(hip_uuid, 16)
+
+    UNIQUE_ID = "unique_id"
+    gpu_nodes_properties = get_kfd_gpu_nodes_properties()
+
+    for node_rank, gpu_props in enumerate(gpu_nodes_properties.values()):
+        if gpu_props["unique_id"] == unique_id:
+            return node_rank
+
+    return -1
+
+
+NVIDIA_GPU_INTERFACE_PATH = Path("/proc/driver/nvidia/gpus/")
+
+
+def nvidia_get_node_rank_from_uuid(device_uuid):
+    for device_idx, device_path in enumerate(NVIDIA_GPU_INTERFACE_PATH.iterdir()):
+        with (device_path / "information").open() as f:
+            device_info = f.read()
+        if re.search(rf"GPU UUID:\s+GPU-{device_uuid}", device_info):
+            return device_idx
+
+    return -1


This is starting to look very fragile. Is there any library out there that would be used to hiding all this logics?

duburcqa · 2026-04-10T08:31:05Z

What about this plan?

 Plan: Cross-vendor GPU info module for tests                                                                                                                                                                                                           
                                                        
 Context

 PR #2680 adds ROCm/AMD support to the test infrastructure, but the vendor-specific GPU queries (device count, VRAM, UUID mapping, per-process memory) are scattered across conftest.py and monitor_test_mem.py with if nvidia ... elif amd ...
 branching. A review comment flagged this as fragile. We need a clean OOP abstraction that isolates vendor-specific logic and makes it easy to maintain and extend.

 Only NVIDIA and AMD GPUs need multi-GPU support.

 Design

 Create tests/gpu_info.py with:

 GpuBackend (ABC)
   ├── NvidiaBackend   # /proc/driver/nvidia/gpus/ + nvidia-smi
   └── AmdBackend      # KFD sysfs + rocm-smi

 detect_gpu_backend() -> GpuBackend | None

 Abstract interface — 4 methods

 ┌──────────────────────────────────┬─────────────────┬────────────────────────────────────────┐
 │              Method              │     Returns     │                Used by                 │
 ├──────────────────────────────────┼─────────────────┼────────────────────────────────────────┤
 │ get_device_count()               │ int             │ conftest._get_gpu_indices              │
 ├──────────────────────────────────┼─────────────────┼────────────────────────────────────────┤
 │ get_device_vram_mib()            │ tuple[int, ...] │ conftest.pytest_xdist_auto_num_workers │
 ├──────────────────────────────────┼─────────────────┼────────────────────────────────────────┤
 │ get_device_index_from_uuid(uuid) │ int             │ conftest._torch_get_gpu_idx            │
 ├──────────────────────────────────┼─────────────────┼────────────────────────────────────────┤
 │ get_per_process_vram_mib()       │ dict[int, int]  │ monitor_test_mem.get_cuda_usage        │
 └──────────────────────────────────┴─────────────────┴────────────────────────────────────────┘

 NvidiaBackend

 - get_device_count: enumerate /proc/driver/nvidia/gpus/
 - get_device_vram_mib: nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits
 - get_device_index_from_uuid: read /proc/driver/nvidia/gpus/*/information, match UUID
 - get_per_process_vram_mib: parse nvidia-smi output (existing logic from monitor_test_mem.py)

 AmdBackend

 - get_device_count: enumerate KFD sysfs GPU nodes (filter out CPU nodes)
 - get_device_vram_mib: rocm-smi --showmeminfo vram -d 0-255
 - get_device_index_from_uuid: match against KFD unique_id property
 - get_per_process_vram_mib: parse rocm-smi --showpids output

 detect_gpu_backend()

 Returns NvidiaBackend if /proc/driver/nvidia/gpus/ exists, AmdBackend if KFD sysfs exists, else None.

 Files to modify

 1. Create tests/gpu_info.py — the new module
 2. tests/conftest.py — replace inline GPU queries with gpu_info calls:
   - is_mem_monitoring_supported() → detect_gpu_backend() is not None
   - _get_gpu_indices() → backend.get_device_count()
   - _torch_get_gpu_idx() → backend.get_device_index_from_uuid()
   - pytest_xdist_auto_num_workers() → backend.get_device_vram_mib()
   - Remove NVIDIA_GPU_INTERFACE_PATH, KFD_SYSFS_PATH, and all KFD helper functions
 3. tests/monitor_test_mem.py — replace get_cuda_usage() with backend.get_per_process_vram_mib()
   - Remove parse_nvidia_smi_output_for_mem_per_process and parse_rocm_smi_output_for_mem_per_process
 4. tests/test_utils.py — move the rocm-smi parsing test to test the new module instead

 What stays untouched

 - _get_egl_index() — NVIDIA-specific EGL logic, stays in conftest (it will call backend.get_device_count() instead of _get_gpu_indices())
 - Intel XPU fallback in pytest_xdist_auto_num_workers — stays as-is (only triggers when no NVIDIA/AMD backend is detected)
 - genesis/utils/misc.py — uses PyTorch APIs, not raw GPU queries

 Verification

 - python -m pytest tests/test_utils.py -k rocm_smi — existing rocm-smi parsing test should pass (moved to test gpu_info)
 - python -m pytest tests/test_utils.py -k gpu_info — if we add unit tests for the new module
 - On NVIDIA machine: python -c "from tests.gpu_info import detect_gpu_backend; b = detect_gpu_backend(); print(b, b.get_device_count())"

hughperkins · 2026-04-11T11:07:44Z

Note, not required for merge, but, for the benchmarks I'd be interested in seeing a side-by-side comparison of the cuda and amd gpu, for runtime fps, i.e. one column for the cuda card, one column for the amd card, and one column showing the ratio of the two. (and/or provide a link to a downloadable csv file, for each, so I can easily compare myself)

…populated

…efore)

v01dXYZ · 2026-04-12T14:56:23Z

My commits are not clean, I was trying to make it work on packet.ai. If you try to test it with this provider, it seems right now only west2 instances are able to have nvidia-smi with per-process metrics. The reason there are so many problems is because it's not real VMs but a k8s pod.

@hughperkins To answer your request, I was able to run the speed/mem benchmark on both Hot Aisle MI300X and Packet.ai RTX9000PRO. That are the tables above. I'll reformat them.

hughperkins · 2026-04-12T15:03:45Z

My commits are not clean, I was trying to make it work on packet.ai. If you try to test it with this provider, it seems right now only west2 instances are able to have nvidia-smi with per-process metrics. The reason there are so many problems is because it's not real VMs but a k8s pod.

@hughperkins To answer your request, I was able to run the speed/mem benchmark on both Hot Aisle MI300X and Packet.ai RTX9000PRO. That are the tables above. I'll reformat them.

Note: I've experienced issues with nvidia-smi previously.

You might have success with migrating to this branch potentially, https://github.com/Genesis-Embodied-AI/Genesis/compare/main...hughperkins:Genesis:hp/mem-monitor-faster?expand=1

hughperkins · 2026-04-12T15:31:06Z

Yes, using nvidia-ml-py could be easier as well and more robust (no parsing a CLI tool output!).

Here the comparison table:

('mi300x', 'runtime_fps') ('rtx6000pro', 'runtime_fps') ('mi300x', 'compile_time') ('rtx6000pro', 'compile_time') ('mi300x', 'realtime_factor') ('rtx6000pro', 'realtime_factor')
('duck_in_box_easy', '30000', '-', 'False', 'ndarray', 'True') 3041448.0 6.95194e+06 48.5 27.2 30414.5 69519.4
('duck_in_box_easy', '30000', '-', 'False', 'ndarray', 'False') 4409587.0 6.9787e+06 75.5 42.6 44095.9 69787
('duck_in_box_hard', '30000', '-', 'False', 'ndarray', 'True') 653952.0 3.1793e+06 49.1 27.5 6539.5 31793
('duck_in_box_hard', '30000', '-', 'False', 'ndarray', 'False') 1942384.0 6.94348e+06 76.5 43.8 19423.8 69434.9
('anymal_random', '30000', '-', 'False', 'ndarray', '-') 1334529.0 6.91582e+06 77.6 43.3 13345.3 69158.1
('anymal_uniform', '30000', '-', 'False', 'ndarray', '-') 1703289.0 6.93112e+06 77.6 43.4 17032.9 69311.2
('anymal_zero', '30000', '-', 'False', 'ndarray', '-') 2006120.0 6.94098e+06 78.0 43.6 20061.2 69409.8
('anymal_uniform_kinematic', '30000', '-', 'False', 'ndarray', '-') 1466308.0 6.46777e+06 83.5 46.2 14663.1 64677.7
('go2', '4096', '-', 'False', 'ndarray', 'True') 506838.0 889852 52.0 28.4 5068.4 8898.5
('go2', '4096', 'CG', 'False', 'ndarray', 'False') 540631.0 424786 78.7 52.8 5406.3 4247.9
('go2', '4096', 'Newton', 'False', 'ndarray', 'False') 686389.0 399131 79.1 44.5 6863.9 3991.3
('franka_accessors', '30000', '-', 'False', 'ndarray', '-') 3370704.0 1.76582e+06 79.1 52.9 33707.0 17658.2
('franka_free', '30000', '-', 'False', 'ndarray', '-') 4551981.0 1.82898e+06 79.9 44.6 45519.8 18289.8
('franka', '30000', '-', 'False', 'ndarray', '-') 3117229.0 5.43482e+06 80.1 44.9 31172.3 54348.2
('franka_random', '30000', '-', 'False', 'ndarray', 'False') 2548096.0 6.73956e+06 80.2 44.1 25481.0 67395.6
('franka_random', '30000', '-', 'False', 'ndarray', 'True') 1984557.0 6.69475e+06 53.0 28.1 19845.6 66947.5
('franka_random', '30000', 'CG', 'False', 'ndarray', '-') 2241193.0 6.98663e+06 79.4 43.8 22411.9 69866.3
('franka_random', '30000', 'Newton', 'False', 'ndarray', '-') 2557212.0 6.72559e+06 79.5 44.1 25572.1 67255.9
('box_pyramid_3', '4096', '-', 'False', 'ndarray', '-') 186131.0 655001 79.3 44 1861.3 6550
('box_pyramid_4', '4096', '-', 'False', 'ndarray', '-') 74858.0 243413 78.0 43.4 748.6 2434.1
('box_pyramid_5', '4096', '-', 'False', 'ndarray', '-') 25651.0 87378 79.7 43.7 256.5 873.8
('box_pyramid_6', '4096', '-', 'False', 'ndarray', 'True') - 40378 - 27.8 - 403.8
('box_pyramid_6', '4096', '-', 'False', 'ndarray', 'False') - 36980 - 43.8 - 369.8
('g1_fall', '4096', 'Newton', 'False', 'ndarray', '-') - 99842 - 43.9 - 499.2
('dex_hand', '4096', '-', 'False', 'ndarray', '-') - 12757 - 61 - 797.3
Script to generate it

Formatting seems inconsistent
- one runtime fps is using non-scientific, and one is using scientific, which makes them hard to compare I feel
would be very useful to have a ratio column, for the runtime fps, I feel
- in our other benchmarking tables, the ratio is evaluated as (v2/v1 - 1) * 100

v01dXYZ · 2026-04-12T16:19:56Z

@hughperkins here the table for speed as it should be.

To reproduce:

SPEED

	('mi300x', 'runtime_fps')	('rtx6000pro', 'runtime_fps')	runtime_fps_mi300x/rtx6000pro	('mi300x', 'compile_time')	('rtx6000pro', 'compile_time')	compile_time_mi300x/rtx6000pro	('mi300x', 'realtime_factor')	('rtx6000pro', 'realtime_factor')	realtime_factor_mi300x/rtx6000pro
('duck_in_box_easy', 30000, '-', 'False', 'ndarray', 'True')	3041448	6951939	0.437	48	27	1.78	30414	69519	0.437
('duck_in_box_easy', 30000, '-', 'False', 'ndarray', 'False')	4409587	6978696	0.632	75	42	1.79	44095	69787	0.632
('duck_in_box_hard', 30000, '-', 'False', 'ndarray', 'True')	653952	3179303	0.206	49	27	1.81	6539	31793	0.206
('duck_in_box_hard', 30000, '-', 'False', 'ndarray', 'False')	1942384	6943485	0.28	76	43	1.77	19423	69434	0.28
('anymal_random', 30000, '-', 'False', 'ndarray', '-')	1334529	6915815	0.193	77	43	1.79	13345	69158	0.193
('anymal_uniform', 30000, '-', 'False', 'ndarray', '-')	1703289	6931124	0.246	77	43	1.79	17032	69311	0.246
('anymal_zero', 30000, '-', 'False', 'ndarray', '-')	2006120	6940983	0.289	78	43	1.81	20061	69409	0.289
('anymal_uniform_kinematic', 30000, '-', 'False', 'ndarray', '-')	1466308	6467770	0.227	83	46	1.8	14663	64677	0.227
('go2', 4096, '-', 'False', 'ndarray', 'True')	506838	889852	0.57	52	28	1.86	5068	8898	0.57
('go2', 4096, 'CG', 'False', 'ndarray', 'False')	540631	424786	1.27	78	52	1.5	5406	4247	1.27
('go2', 4096, 'Newton', 'False', 'ndarray', 'False')	686389	399131	1.72	79	44	1.8	6863	3991	1.72
('franka_accessors', 30000, '-', 'False', 'ndarray', '-')	3370704	1765821	1.91	79	52	1.52	33707	17658	1.91
('franka_free', 30000, '-', 'False', 'ndarray', '-')	4551981	1828975	2.49	79	44	1.8	45519	18289	2.49
('franka', 30000, '-', 'False', 'ndarray', '-')	3117229	5434819	0.574	80	44	1.82	31172	54348	0.574
('franka_random', 30000, '-', 'False', 'ndarray', 'False')	2548096	6739563	0.378	80	44	1.82	25481	67395	0.378
('franka_random', 30000, '-', 'False', 'ndarray', 'True')	1984557	6694750	0.296	53	28	1.89	19845	66947	0.296
('franka_random', 30000, 'CG', 'False', 'ndarray', '-')	2241193	6986628	0.321	79	43	1.84	22411	69866	0.321
('franka_random', 30000, 'Newton', 'False', 'ndarray', '-')	2557212	6725591	0.38	79	44	1.8	25572	67255	0.38
('box_pyramid_3', 4096, '-', 'False', 'ndarray', '-')	186131	655001	0.284	79	44	1.8	1861	6550	0.284
('box_pyramid_4', 4096, '-', 'False', 'ndarray', '-')	74858	243413	0.308	78	43	1.81	748	2434	0.307
('box_pyramid_5', 4096, '-', 'False', 'ndarray', '-')	25651	87378	0.294	79	43	1.84	256	873	0.293
('box_pyramid_6', 4096, '-', 'False', 'ndarray', 'True')	-	40378	nan	-	27	nan	-	403	nan
('box_pyramid_6', 4096, '-', 'False', 'ndarray', 'False')	-	36980	nan	-	43	nan	-	369	nan
('g1_fall', 4096, 'Newton', 'False', 'ndarray', '-')	-	99842	nan	-	43	nan	-	499	nan
('dex_hand', 4096, '-', 'False', 'ndarray', '-')	-	12757	nan	-	61	nan	-	797	nan

hughperkins · 2026-04-12T16:27:26Z

Interesting:

slower on many things (but not by much; all within a single order of magnitude)
faster on a few things, by a similar factor

Any thoughts on what the things that run faster on the AMD GPU have in common? (I glanced at them, but seems not obvious to me by simple inspection. Might be worth running profiling on them, to compare the kernel times, between CUDA and AMD. Notes:

for CUDA we use pytorch profiling. I'm not sure if something similar exists for AMD?
- to view the profile, https://ui.perfetto.dev/ works well
recommend disabling CUDA graph (QD_GRAPH=0), when running profiling, otherwise you just get a giant "white space" in the profile, where the graph was running, and no detailed kernel timings
).

hughperkins · 2026-04-12T16:49:53Z

(would also be nice to have a table summarizing the key statistics of each graphics card. Thigns like:

how much global memory?
size of L1 cache
size of L2 cache
how many GPU cores
etc ...

Edit: To be clear, none of the benchmarks or GPU comparisons is part of this PR, or required for merge. I'm also happy to move the perf discussion elsewhere perhaps 🤔
)

v01dXYZ · 2026-04-12T16:54:30Z

@hughperkins We could create an issue to discuss MI300X disappointing performance wrt its specs.

hughperkins · 2026-04-12T16:57:50Z

@hughperkins We could create an issue to discuss MI300X disappointing performance wrt its specs.

https://www.techpowerup.com/gpu-specs/radeon-instinct-mi300x.c4179

https://www.techpowerup.com/gpu-specs/rtx-pro-6000-blackwell.c4272

Works for me. (or a Discussion perhaps? Since an issue tends to be for something fairly concrete, well-defined, clear 'definition of done', I feel?).

Add support for rocm-smi when relying only on nvidia-smi

87d60d8

v01dXYZ requested review from YilingQiao and duburcqa as code owners April 8, 2026 17:30

claude bot reviewed Apr 8, 2026

View reviewed changes

chatgpt-codex-connector bot reviewed Apr 8, 2026

View reviewed changes

Comment thread tests/conftest.py Outdated

v01dXYZ marked this pull request as draft April 8, 2026 17:34

typo

436babb

v01dXYZ changed the title ~~(WIP) [FIX] some ROCm incompatibilities~~ (WIP) [FIX] fix some ROCm incompatibilities Apr 8, 2026

v01dxyz added 2 commits April 8, 2026 20:13

Detect when no PIDs

cf0a407

Get device id from torch uuid

453d137

chatgpt-codex-connector bot reviewed Apr 9, 2026

View reviewed changes

v01dXYZ marked this pull request as ready for review April 9, 2026 09:36

v01dXYZ changed the title ~~(WIP) [FIX] fix some ROCm incompatibilities~~ [FIX] fix some ROCm incompatibilities Apr 9, 2026

chatgpt-codex-connector bot reviewed Apr 9, 2026

View reviewed changes

v01dXYZ mentioned this pull request Apr 9, 2026

[Feature]: Add MI300X as a GitHub runner/via SSH #2679

Open

duburcqa reviewed Apr 10, 2026

View reviewed changes

duburcqa changed the title ~~[FIX] fix some ROCm incompatibilities~~ [MISC] Add full support of AMD GPU to unit test and benchmark infra. Apr 10, 2026

v01dxyz added 3 commits April 11, 2026 03:12

Applying opus plan to for gpu_info.py (round 1)

7e69881

Use Python3.10+ typing

6b83ea0

format

d149a69

v01dXYZ marked this pull request as draft April 11, 2026 10:05

v01dxyz added 3 commits April 11, 2026 15:13

Add is_detect method

862b4c6

Fix some typos

b7cddb2

Fall back to 'nvidia-smi --list-gpus' if /dev/driver/nvidia/gpus not …

8721784

…populated

v01dxyz added 9 commits April 11, 2026 16:04

typo

079aea5

improve regex for GPU uuid

c8c5640

typo

eec2c82

Improve regex, remove try-except (assuming user called is_available b…

60dbf58

…efore)

Fix typos and missing asserts

b0760a7

typo

3f4cd65

Put parsing or data extraction from string data into static methods

339cead

Factorize code

aba6b21

Fix wrong regexp, fix dumb mistake

e098ebf

duburcqa mentioned this pull request Apr 11, 2026

[FEATURE] Add nvidia-smi fallback for GPU detection in cloud environments #2694

Closed

v01dxyz added 5 commits April 12, 2026 11:33

WIP 1

d702678

WIP 2

8260523

WIP 3

f608785

WIP 4

f39e51c

Warn only if nvidia-smi returns no memory usage per pid

959e671

This comment was marked as outdated.

Sign in to view

		hip_uuid = "".join([chr(int(device_uuid[i : i + 2], 16)) for i in range(0, len(device_uuid), 2)])
		unique_id = int(hip_uuid, 16)

Conversation

v01dXYZ commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue

Motivation and Context

How Has This Been / Can This Be Tested?

Screenshots (if appropriate):

Checklist:

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

v01dXYZ commented Apr 9, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

v01dXYZ Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

v01dXYZ commented Apr 9, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

v01dXYZ Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

v01dXYZ commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

duburcqa Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

duburcqa Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

duburcqa commented Apr 10, 2026

Uh oh!

hughperkins commented Apr 11, 2026

Uh oh!

v01dXYZ commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hughperkins commented Apr 12, 2026

Uh oh!

This comment was marked as outdated.

hughperkins commented Apr 12, 2026

Uh oh!

v01dXYZ commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hughperkins commented Apr 12, 2026

Uh oh!

hughperkins commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

v01dXYZ commented Apr 12, 2026

Uh oh!

hughperkins commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

v01dXYZ commented Apr 8, 2026 •

edited

Loading

v01dXYZ commented Apr 9, 2026 •

edited

Loading

v01dXYZ commented Apr 12, 2026 •

edited

Loading

v01dXYZ commented Apr 12, 2026 •

edited

Loading

hughperkins commented Apr 12, 2026 •

edited

Loading