Skip to content

[MISC] Add full support of AMD GPU to unit test and benchmark infra.#2680

Draft
v01dXYZ wants to merge 24 commits intoGenesis-Embodied-AI:mainfrom
v01dXYZ:fix-rocm-compatibility
Draft

[MISC] Add full support of AMD GPU to unit test and benchmark infra.#2680
v01dXYZ wants to merge 24 commits intoGenesis-Embodied-AI:mainfrom
v01dXYZ:fix-rocm-compatibility

Conversation

@v01dXYZ
Copy link
Copy Markdown

@v01dXYZ v01dXYZ commented Apr 8, 2026

Description

  • try to fallback to rocm-smi if nvidia-smi not found
  • use KFD sysfs instead of nvidia proc interface

Related Issue

This is related to #2679. I had a hard time finding the specs of the GitHub runners so I don't know what are the AMD GPUs used to test Genesis.

Resolves Genesis-Embodied-AI/Genesis#

Motivation and Context

to have benchmark/mem-reporting works on ROCm.

How Has This Been / Can This Be Tested?

Yes, manually though.

Screenshots (if appropriate):

Checklist:

  • I read the CONTRIBUTING document.
  • I followed the Submitting Code Changes section of CONTRIBUTING document.
  • I tagged the title correctly (including BUG FIX/FEATURE/MISC/BREAKING)
  • I updated the documentation accordingly or no change is needed: No new/modified API
  • I tested my changes and added instructions on how to test it for reviewers: I didn't test my changes on NVIDIA machines
  • I have added tests to cover my changes.
  • All new and existing tests passed.

Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 87d60d8a4d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread tests/conftest.py Outdated
@v01dXYZ v01dXYZ marked this pull request as draft April 8, 2026 17:34
@v01dXYZ v01dXYZ changed the title (WIP) [FIX] some ROCm incompatibilities (WIP) [FIX] fix some ROCm incompatibilities Apr 8, 2026
@v01dXYZ
Copy link
Copy Markdown
Author

v01dXYZ commented Apr 9, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 453d137c0a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread tests/monitor_test_mem.py Outdated
Comment on lines +121 to +122
except FileNotFoundError:
pass
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Catch nvidia-smi command failures before ROCm fallback

is_mem_monitoring_supported() now allows monitoring when nvidia-smi fails but rocm-smi works, yet get_cuda_usage() only falls back when nvidia-smi is missing. On ROCm hosts where nvidia-smi is present but exits non-zero (raising subprocess.CalledProcessError), the monitor process crashes before trying rocm-smi, so --mem-monitoring-filepath is reported as supported but fails at runtime.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume on a ROCm machine there is no nvidia tooling, otherwise it would be hard to distinguish between both of them.

@v01dXYZ
Copy link
Copy Markdown
Author

v01dXYZ commented Apr 9, 2026

In order to avoid having allocation failure, I put device_memory_GB=32 when manually testing.

@v01dXYZ v01dXYZ marked this pull request as ready for review April 9, 2026 09:36
@v01dXYZ v01dXYZ changed the title (WIP) [FIX] fix some ROCm incompatibilities [FIX] fix some ROCm incompatibilities Apr 9, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 453d137c0a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread tests/conftest.py Outdated
Comment on lines +286 to +287
hip_uuid = "".join([chr(int(device_uuid[i : i + 2], 16)) for i in range(0, len(device_uuid), 2)])
unique_id = int(hip_uuid, 16)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Decode ROCm UUID bytes correctly before integer conversion

amdgpu_get_node_rank_from_uuid() converts each UUID byte to a character and then calls int(..., 16), but that intermediate string will contain non-hex characters for almost all valid UUIDs (e.g., byte 0x7d becomes }), so int raises ValueError before any node lookup happens. On ROCm systems this propagates from _torch_get_gpu_idx() and can fail GPU test setup as soon as device-index validation runs, so memory/device routing checks break even when KFD topology is present.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No because actually hip UUID is already a 16 bytes hex representation of a int64 integer. PyTorch converts this 16 bytes hex rep into a 32 bytes hex rep (so it is a hex rep of an hex rep).

@v01dXYZ
Copy link
Copy Markdown
Author

v01dXYZ commented Apr 9, 2026

Below is the result of the mem/speed benchmarks on MI300X:

Details

MEM:

env=duck_in_box_easy    | gjk_collision=True    | batch_size=30000      | backend=gpu   | dtype=ndarray         | max_mem_mb=35445
env=duck_in_box_easy    | gjk_collision=False   | batch_size=30000      | backend=gpu   | dtype=ndarray         | max_mem_mb=36424
env=duck_in_box_hard    | gjk_collision=True    | batch_size=30000      | backend=gpu   | dtype=ndarray         | max_mem_mb=36398
env=duck_in_box_hard    | gjk_collision=False   | batch_size=30000      | backend=gpu   | dtype=ndarray         | max_mem_mb=36399
env=duck_in_box_hard    | batch_size=0  | backend=cpu   | dtype=ndarray         | max_mem_mb=3249
env=anymal_random       | batch_size=30000      | backend=gpu   | dtype=ndarray         | max_mem_mb=36832
env=anymal_uniform      | batch_size=30000      | backend=gpu   | dtype=ndarray         | max_mem_mb=36834
env=anymal_zero         | batch_size=30000      | backend=gpu   | dtype=ndarray         | max_mem_mb=36835
env=anymal_zero         | batch_size=0  | backend=cpu   | dtype=ndarray         | max_mem_mb=3685
env=anymal_uniform_kinematic    | batch_size=30000      | backend=gpu   | dtype=ndarray         | max_mem_mb=36837
env=anymal_uniform_kinematic    | batch_size=0  | backend=cpu   | dtype=ndarray         | max_mem_mb=3687
env=go2         | gjk_collision=True    | batch_size=4096       | backend=gpu   | dtype=ndarray         | max_mem_mb=36856
env=go2         | constraint_solver=CG  | gjk_collision=False   | batch_size=4096       | backend=gpu   | dtype=ndarray         | max_mem_mb=36860
env=go2         | constraint_solver=Newton      | gjk_collision=False   | batch_size=4096       | backend=gpu   | dtype=ndarray         | max_mem_mb=36880
env=franka_accessors    | batch_size=0  | backend=cpu   | dtype=ndarray         | max_mem_mb=3692
env=franka_accessors    | batch_size=30000      | backend=gpu   | dtype=ndarray         | max_mem_mb=36927
env=franka_free         | batch_size=30000      | backend=gpu   | dtype=ndarray         | max_mem_mb=36942
env=franka      | batch_size=30000      | backend=gpu   | dtype=ndarray         | max_mem_mb=36928
env=franka_random       | gjk_collision=False   | batch_size=30000      | backend=gpu   | dtype=ndarray         | max_mem_mb=36926
env=franka_random       | gjk_collision=True    | batch_size=30000      | backend=gpu   | dtype=ndarray         | max_mem_mb=36919
env=franka_random       | constraint_solver=CG  | batch_size=30000      | backend=gpu   | dtype=ndarray         | max_mem_mb=36925
env=franka_random       | constraint_solver=Newton      | batch_size=30000      | backend=gpu   | dtype=ndarray         | max_mem_mb=36929
env=franka_random       | batch_size=0  | backend=cpu   | dtype=ndarray         | max_mem_mb=3773
env=box_pyramid_3       | batch_size=4096       | backend=gpu   | dtype=ndarray         | max_mem_mb=36924
env=box_pyramid_4       | batch_size=4096       | backend=gpu   | dtype=ndarray         | max_mem_mb=36926
env=box_pyramid_5       | batch_size=4096       | backend=gpu   | dtype=ndarray         | max_mem_mb=36927
env=box_pyramid_6       | gjk_collision=True    | batch_size=4096       | backend=gpu   | dtype=ndarray         | max_mem_mb=36928

SPEED:

env=duck_in_box_easy    | batch_size=30000      | use_contact_island=False      | gjk_collision=True    | dtype=ndarray         | backend=amdgpu        | compile_time=48.5     | runtime_fps=3041448.0        | realtime_factor=30414.5
env=duck_in_box_easy    | batch_size=30000      | use_contact_island=False      | gjk_collision=False   | dtype=ndarray         | backend=amdgpu        | compile_time=75.5     | runtime_fps=4409587.0        | realtime_factor=44095.9
env=duck_in_box_hard    | batch_size=30000      | use_contact_island=False      | gjk_collision=True    | dtype=ndarray         | backend=amdgpu        | compile_time=49.1     | runtime_fps=653952.0  | realtime_factor=6539.5
env=duck_in_box_hard    | batch_size=30000      | use_contact_island=False      | gjk_collision=False   | dtype=ndarray         | backend=amdgpu        | compile_time=76.5     | runtime_fps=1942384.0        | realtime_factor=19423.8
env=duck_in_box_hard    | batch_size=0  | use_contact_island=False      | dtype=ndarray         | backend=cpu   | compile_time=37.4     | runtime_fps=2308.0    | realtime_factor=23.1
env=anymal_random       | batch_size=30000      | use_contact_island=False      | dtype=ndarray         | backend=amdgpu        | compile_time=77.6     | runtime_fps=1334529.0         | realtime_factor=13345.3
env=anymal_uniform      | batch_size=30000      | use_contact_island=False      | dtype=ndarray         | backend=amdgpu        | compile_time=77.6     | runtime_fps=1703289.0         | realtime_factor=17032.9
env=anymal_zero         | batch_size=30000      | use_contact_island=False      | dtype=ndarray         | backend=amdgpu        | compile_time=78.0     | runtime_fps=2006120.0         | realtime_factor=20061.2
env=anymal_zero         | batch_size=0  | use_contact_island=False      | dtype=ndarray         | backend=cpu   | compile_time=38.6     | runtime_fps=2897.0    | realtime_factor=29.0
env=anymal_uniform_kinematic    | batch_size=30000      | use_contact_island=False      | dtype=ndarray         | backend=amdgpu        | compile_time=83.5     | runtime_fps=1466308.0         | realtime_factor=14663.1
env=anymal_uniform_kinematic    | batch_size=0  | use_contact_island=False      | dtype=ndarray         | backend=cpu   | compile_time=42.6     | runtime_fps=861.0     | realtime_factor=8.6
env=go2         | batch_size=4096       | use_contact_island=False      | gjk_collision=True    | dtype=ndarray         | backend=amdgpu        | compile_time=52.0     | runtime_fps=506838.0  | realtime_factor=5068.4
env=go2         | batch_size=4096       | constraint_solver=CG  | use_contact_island=False      | gjk_collision=False   | dtype=ndarray         | backend=amdgpu        | compile_time=78.7     | runtime_fps=540631.0  | realtime_factor=5406.3
env=go2         | batch_size=4096       | constraint_solver=Newton      | use_contact_island=False      | gjk_collision=False   | dtype=ndarray         | backend=amdgpu        | compile_time=79.1     | runtime_fps=686389.0  | realtime_factor=6863.9
env=franka_accessors    | batch_size=0  | use_contact_island=False      | dtype=ndarray         | backend=cpu   | compile_time=38.7     | runtime_fps=760.0     | realtime_factor=7.6
env=franka_accessors    | batch_size=30000      | use_contact_island=False      | dtype=ndarray         | backend=amdgpu        | compile_time=79.1     | runtime_fps=3370704.0         | realtime_factor=33707.0
env=franka_free         | batch_size=30000      | use_contact_island=False      | dtype=ndarray         | backend=amdgpu        | compile_time=79.9     | runtime_fps=4551981.0         | realtime_factor=45519.8
env=franka      | batch_size=30000      | use_contact_island=False      | dtype=ndarray         | backend=amdgpu        | compile_time=80.1     | runtime_fps=3117229.0         | realtime_factor=31172.3
env=franka_random       | batch_size=30000      | use_contact_island=False      | gjk_collision=False   | dtype=ndarray         | backend=amdgpu        | compile_time=80.2     | runtime_fps=2548096.0        | realtime_factor=25481.0
env=franka_random       | batch_size=30000      | use_contact_island=False      | gjk_collision=True    | dtype=ndarray         | backend=amdgpu        | compile_time=53.0     | runtime_fps=1984557.0        | realtime_factor=19845.6
env=franka_random       | batch_size=30000      | constraint_solver=CG  | use_contact_island=False      | dtype=ndarray         | backend=amdgpu        | compile_time=79.4     | runtime_fps=2241193.0        | realtime_factor=22411.9
env=franka_random       | batch_size=30000      | constraint_solver=Newton      | use_contact_island=False      | dtype=ndarray         | backend=amdgpu        | compile_time=79.5     | runtime_fps=2557212.0 | realtime_factor=25572.1
env=franka_random       | batch_size=0  | use_contact_island=False      | dtype=ndarray         | backend=cpu   | compile_time=39.4     | runtime_fps=2669.0    | realtime_factor=26.7
env=box_pyramid_3       | batch_size=4096       | use_contact_island=False      | dtype=ndarray         | backend=amdgpu        | compile_time=79.3     | runtime_fps=186131.0  | realtime_factor=1861.3
env=box_pyramid_4       | batch_size=4096       | use_contact_island=False      | dtype=ndarray         | backend=amdgpu        | compile_time=78.0     | runtime_fps=74858.0   | realtime_factor=748.6
env=box_pyramid_5       | batch_size=4096       | use_contact_island=False      | dtype=ndarray         | backend=amdgpu        | compile_time=79.7     | runtime_fps=25651.0   | realtime_factor=256.5

MEM:

env batch_size backend dtype gjk_collision constraint_solver max_mem_mb
duck_in_box_easy 30000 gpu ndarray True - 35445
duck_in_box_easy 30000 gpu ndarray False - 36424
duck_in_box_hard 30000 gpu ndarray True - 36398
duck_in_box_hard 30000 gpu ndarray False - 36399
duck_in_box_hard 0 cpu ndarray - - 3249
anymal_random 30000 gpu ndarray - - 36832
anymal_uniform 30000 gpu ndarray - - 36834
anymal_zero 30000 gpu ndarray - - 36835
anymal_zero 0 cpu ndarray - - 3685
anymal_uniform_kinematic 30000 gpu ndarray - - 36837
anymal_uniform_kinematic 0 cpu ndarray - - 3687
go2 4096 gpu ndarray True - 36856
go2 4096 gpu ndarray False CG 36860
go2 4096 gpu ndarray False Newton 36880
franka_accessors 0 cpu ndarray - - 3692
franka_accessors 30000 gpu ndarray - - 36927
franka_free 30000 gpu ndarray - - 36942
franka 30000 gpu ndarray - - 36928
franka_random 30000 gpu ndarray False - 36926
franka_random 30000 gpu ndarray True - 36919
franka_random 30000 gpu ndarray - CG 36925
franka_random 30000 gpu ndarray - Newton 36929
franka_random 0 cpu ndarray - - 3773
box_pyramid_3 4096 gpu ndarray - - 36924
box_pyramid_4 4096 gpu ndarray - - 36926
box_pyramid_5 4096 gpu ndarray - - 36927
box_pyramid_6 4096 gpu ndarray True - 36928

SPEED:

env batch_size constraint_solver use_contact_island dtype backend gjk_collision runtime_fps compile_time realtime_factor
duck_in_box_easy 30000 - False ndarray amdgpu True 3.04145e+06 48.5 30414.5
duck_in_box_easy 30000 - False ndarray amdgpu False 4.40959e+06 75.5 44095.9
duck_in_box_hard 30000 - False ndarray amdgpu True 653952 49.1 6539.5
duck_in_box_hard 30000 - False ndarray amdgpu False 1.94238e+06 76.5 19423.8
duck_in_box_hard 0 - False ndarray cpu - 2308 37.4 23.1
anymal_random 30000 - False ndarray amdgpu - 1.33453e+06 77.6 13345.3
anymal_uniform 30000 - False ndarray amdgpu - 1.70329e+06 77.6 17032.9
anymal_zero 30000 - False ndarray amdgpu - 2.00612e+06 78 20061.2
anymal_zero 0 - False ndarray cpu - 2897 38.6 29
anymal_uniform_kinematic 30000 - False ndarray amdgpu - 1.46631e+06 83.5 14663.1
anymal_uniform_kinematic 0 - False ndarray cpu - 861 42.6 8.6
go2 4096 - False ndarray amdgpu True 506838 52 5068.4
go2 4096 CG False ndarray amdgpu False 540631 78.7 5406.3
go2 4096 Newton False ndarray amdgpu False 686389 79.1 6863.9
franka_accessors 0 - False ndarray cpu - 760 38.7 7.6
franka_accessors 30000 - False ndarray amdgpu - 3.3707e+06 79.1 33707
franka_free 30000 - False ndarray amdgpu - 4.55198e+06 79.9 45519.8
franka 30000 - False ndarray amdgpu - 3.11723e+06 80.1 31172.3
franka_random 30000 - False ndarray amdgpu False 2.5481e+06 80.2 25481
franka_random 30000 - False ndarray amdgpu True 1.98456e+06 53 19845.6
franka_random 30000 CG False ndarray amdgpu - 2.24119e+06 79.4 22411.9
franka_random 30000 Newton False ndarray amdgpu - 2.55721e+06 79.5 25572.1
franka_random 0 - False ndarray cpu - 2669 39.4 26.7
box_pyramid_3 4096 - False ndarray amdgpu - 186131 79.3 1861.3
box_pyramid_4 4096 - False ndarray amdgpu - 74858 78 748.6
box_pyramid_5 4096 - False ndarray amdgpu - 25651 79.7 256.5

RTX 6000

MEM:

env batch_size backend dtype gjk_collision constraint_solver max_mem_mb
duck_in_box_easy 30000 gpu ndarray True - 3858
duck_in_box_easy 30000 gpu ndarray False - 3252
duck_in_box_hard 30000 gpu ndarray True - 13974
duck_in_box_hard 30000 gpu ndarray False - 13368
duck_in_box_hard 0 cpu ndarray - - 804
anymal_random 30000 gpu ndarray - - 12414
anymal_uniform 30000 gpu ndarray - - 12414
anymal_zero 30000 gpu ndarray - - 12416
anymal_zero 0 cpu ndarray - - 1228
anymal_uniform_kinematic 30000 gpu ndarray - - 12738
anymal_uniform_kinematic 0 cpu ndarray - - 1230
go2 4096 gpu ndarray True - 4580
go2 4096 gpu ndarray False CG 3974
go2 4096 gpu ndarray False Newton 3976
franka_accessors 0 cpu ndarray - - 1236
franka_accessors 30000 gpu ndarray - - 16268
franka_free 30000 gpu ndarray - - 16270
franka 30000 gpu ndarray - - 16272
franka_random 30000 gpu ndarray False - 16274
franka_random 30000 gpu ndarray True - 16884
franka_random 30000 gpu ndarray - CG 16278
franka_random 30000 gpu ndarray - Newton 16278
franka_random 0 cpu ndarray - - 1250
box_pyramid_3 4096 gpu ndarray - - 2104
box_pyramid_4 4096 gpu ndarray - - 3386
box_pyramid_5 4096 gpu ndarray - - 6620
box_pyramid_6 4096 gpu ndarray True - 10622
box_pyramid_6 4096 gpu ndarray False - 9984
g1_fall 4096 gpu ndarray - Newton 5026
dex_hand 4096 gpu ndarray - - 8644

SPEED:

env batch_size constraint_solver use_contact_island dtype backend gjk_collision runtime_fps compile_time realtime_factor
duck_in_box_easy 30000 - False ndarray cuda True 6.95194e+06 27.2 69519.4
duck_in_box_easy 30000 - False ndarray cuda False 6.9787e+06 42.6 69787
duck_in_box_hard 30000 - False ndarray cuda True 3.1793e+06 27.5 31793
duck_in_box_hard 30000 - False ndarray cuda False 6.94348e+06 43.8 69434.9
duck_in_box_hard 0 - False ndarray cpu - 4551 22.4 45.5
anymal_random 30000 - False ndarray cuda - 6.91582e+06 43.3 69158.1
anymal_uniform 30000 - False ndarray cuda - 6.93112e+06 43.4 69311.2
anymal_zero 30000 - False ndarray cuda - 6.94098e+06 43.6 69409.8
anymal_zero 0 - False ndarray cpu - 5634 22.4 56.3
anymal_uniform_kinematic 30000 - False ndarray cuda - 6.46777e+06 46.2 64677.7
anymal_uniform_kinematic 0 - False ndarray cpu - 1905 24.5 19.1
go2 4096 - False ndarray cuda True 889852 28.4 8898.5
go2 4096 CG False ndarray cuda False 424786 52.8 4247.9
go2 4096 Newton False ndarray cuda False 399131 44.5 3991.3
franka_accessors 0 - False ndarray cpu - 1662 22.9 16.6
franka_accessors 30000 - False ndarray cuda - 1.76582e+06 52.9 17658.2
franka_free 30000 - False ndarray cuda - 1.82898e+06 44.6 18289.8
franka 30000 - False ndarray cuda - 5.43482e+06 44.9 54348.2
franka_random 30000 - False ndarray cuda False 6.73956e+06 44.1 67395.6
franka_random 30000 - False ndarray cuda True 6.69475e+06 28.1 66947.5
franka_random 30000 CG False ndarray cuda - 6.98663e+06 43.8 69866.3
franka_random 30000 Newton False ndarray cuda - 6.72559e+06 44.1 67255.9
franka_random 0 - False ndarray cpu - 5043 22.8 50.4
box_pyramid_3 4096 - False ndarray cuda - 655001 44 6550
box_pyramid_4 4096 - False ndarray cuda - 243413 43.4 2434.1
box_pyramid_5 4096 - False ndarray cuda - 87378 43.7 873.8
box_pyramid_6 4096 - False ndarray cuda True 40378 27.8 403.8
box_pyramid_6 4096 - False ndarray cuda False 36980 43.8 369.8
g1_fall 4096 Newton False ndarray cuda - 99842 43.9 499.2
dex_hand 4096 - False ndarray cuda - 12757 61 797.3

Comment thread tests/conftest.py Outdated
Comment on lines +113 to +123
try:
assert sys.platform.startswith("linux")
subprocess.check_output(["nvidia-smi"], stderr=subprocess.STDOUT, timeout=10)
return True, None
except Exception as exc: # platform or nvidia-smi unavailable
return False, exc
except Exception as exc:
pass

try:
subprocess.check_output(["rocm-smi"], stderr=subprocess.STDOUT, timeout=10)
return True, None
except Exception as exc:
pass
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Avoid duplicated logics
  • Never catch generic Exception unless it is absolutely necessary:
utilities = ("nvidia-smi", "rocm-smi")
for exe in utilities:
    try:
        subprocess.check_output([exe], stderr=subprocess.STDOUT, check=True, timeout=10)
        return True, None
    except (FileNotFoundError, subprocess.TimeoutExpired, subprocess.CalledProcessError, OSError):
        pass

return False, f"None of the executables {utilities} exited successfully (code 0)."

Comment thread tests/conftest.py Outdated
Comment on lines +237 to +309
if NVIDIA_GPU_INTERFACE_PATH.is_dir():
return tuple(i for i, _ in enumerate(NVIDIA_GPU_INTERFACE_PATH.iterdir()))
if KFD_SYSFS_PATH.is_dir():
return tuple(i for i, _ in enumerate(get_kfd_gpu_nodes_properties()))

warnings.warn(
f"'{NVIDIA_GPU_INTERFACE_PATH!s}' or '{KFD_SYSFS_PATH!s}' is not available. Multi-GPU support will be disabled. This is expected "
"on WSL2 where the NVIDIA proc interface is not mounted.",
stacklevel=2,
)

return (0,)


def parse_kfd_node_properties(kfd_properties_str: str):
props = {}
for l in kfd_properties_str.split("\n"):
l = l.strip()

if not l:
continue

name, value_str = l.split()
props[name] = int(value_str)

return props


KFD_SYSFS_PATH = Path("/sys/devices/virtual/kfd/kfd/topology")


def get_kfd_gpu_nodes_properties():
kfd_sysfs_path_nodes = Path(KFD_SYSFS_PATH) / "nodes"

gpu_nodes_properties = {}

for node_path in kfd_sysfs_path_nodes.iterdir():
with (node_path / "properties").open() as node_properties_f:
properties_str = node_properties_f.read()
node_props = parse_kfd_node_properties(properties_str)

if node_props["cpu_cores_count"] == 0:
gpu_nodes_properties[int(node_path.name)] = node_props

return gpu_nodes_properties


def amdgpu_get_node_rank_from_uuid(device_uuid):
device_uuid = device_uuid.replace("-", "")
hip_uuid = "".join([chr(int(device_uuid[i : i + 2], 16)) for i in range(0, len(device_uuid), 2)])
unique_id = int(hip_uuid, 16)

UNIQUE_ID = "unique_id"
gpu_nodes_properties = get_kfd_gpu_nodes_properties()

for node_rank, gpu_props in enumerate(gpu_nodes_properties.values()):
if gpu_props["unique_id"] == unique_id:
return node_rank

return -1


NVIDIA_GPU_INTERFACE_PATH = Path("/proc/driver/nvidia/gpus/")


def nvidia_get_node_rank_from_uuid(device_uuid):
for device_idx, device_path in enumerate(NVIDIA_GPU_INTERFACE_PATH.iterdir()):
with (device_path / "information").open() as f:
device_info = f.read()
if re.search(rf"GPU UUID:\s+GPU-{device_uuid}", device_info):
return device_idx

return -1
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is starting to look very fragile. Is there any library out there that would be used to hiding all this logics?

@duburcqa
Copy link
Copy Markdown
Collaborator

What about this plan?

 Plan: Cross-vendor GPU info module for tests                                                                                                                                                                                                           
                                                        
 Context

 PR #2680 adds ROCm/AMD support to the test infrastructure, but the vendor-specific GPU queries (device count, VRAM, UUID mapping, per-process memory) are scattered across conftest.py and monitor_test_mem.py with if nvidia ... elif amd ...
 branching. A review comment flagged this as fragile. We need a clean OOP abstraction that isolates vendor-specific logic and makes it easy to maintain and extend.

 Only NVIDIA and AMD GPUs need multi-GPU support.

 Design

 Create tests/gpu_info.py with:

 GpuBackend (ABC)
   ├── NvidiaBackend   # /proc/driver/nvidia/gpus/ + nvidia-smi
   └── AmdBackend      # KFD sysfs + rocm-smi

 detect_gpu_backend() -> GpuBackend | None

 Abstract interface — 4 methods

 ┌──────────────────────────────────┬─────────────────┬────────────────────────────────────────┐
 │              Method              │     Returns     │                Used by                 │
 ├──────────────────────────────────┼─────────────────┼────────────────────────────────────────┤
 │ get_device_count()               │ int             │ conftest._get_gpu_indices              │
 ├──────────────────────────────────┼─────────────────┼────────────────────────────────────────┤
 │ get_device_vram_mib()            │ tuple[int, ...] │ conftest.pytest_xdist_auto_num_workers │
 ├──────────────────────────────────┼─────────────────┼────────────────────────────────────────┤
 │ get_device_index_from_uuid(uuid) │ int             │ conftest._torch_get_gpu_idx            │
 ├──────────────────────────────────┼─────────────────┼────────────────────────────────────────┤
 │ get_per_process_vram_mib()       │ dict[int, int]  │ monitor_test_mem.get_cuda_usage        │
 └──────────────────────────────────┴─────────────────┴────────────────────────────────────────┘

 NvidiaBackend

 - get_device_count: enumerate /proc/driver/nvidia/gpus/
 - get_device_vram_mib: nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits
 - get_device_index_from_uuid: read /proc/driver/nvidia/gpus/*/information, match UUID
 - get_per_process_vram_mib: parse nvidia-smi output (existing logic from monitor_test_mem.py)

 AmdBackend

 - get_device_count: enumerate KFD sysfs GPU nodes (filter out CPU nodes)
 - get_device_vram_mib: rocm-smi --showmeminfo vram -d 0-255
 - get_device_index_from_uuid: match against KFD unique_id property
 - get_per_process_vram_mib: parse rocm-smi --showpids output

 detect_gpu_backend()

 Returns NvidiaBackend if /proc/driver/nvidia/gpus/ exists, AmdBackend if KFD sysfs exists, else None.

 Files to modify

 1. Create tests/gpu_info.py — the new module
 2. tests/conftest.py — replace inline GPU queries with gpu_info calls:
   - is_mem_monitoring_supported() → detect_gpu_backend() is not None
   - _get_gpu_indices() → backend.get_device_count()
   - _torch_get_gpu_idx() → backend.get_device_index_from_uuid()
   - pytest_xdist_auto_num_workers() → backend.get_device_vram_mib()
   - Remove NVIDIA_GPU_INTERFACE_PATH, KFD_SYSFS_PATH, and all KFD helper functions
 3. tests/monitor_test_mem.py — replace get_cuda_usage() with backend.get_per_process_vram_mib()
   - Remove parse_nvidia_smi_output_for_mem_per_process and parse_rocm_smi_output_for_mem_per_process
 4. tests/test_utils.py — move the rocm-smi parsing test to test the new module instead

 What stays untouched

 - _get_egl_index() — NVIDIA-specific EGL logic, stays in conftest (it will call backend.get_device_count() instead of _get_gpu_indices())
 - Intel XPU fallback in pytest_xdist_auto_num_workers — stays as-is (only triggers when no NVIDIA/AMD backend is detected)
 - genesis/utils/misc.py — uses PyTorch APIs, not raw GPU queries

 Verification

 - python -m pytest tests/test_utils.py -k rocm_smi — existing rocm-smi parsing test should pass (moved to test gpu_info)
 - python -m pytest tests/test_utils.py -k gpu_info — if we add unit tests for the new module
 - On NVIDIA machine: python -c "from tests.gpu_info import detect_gpu_backend; b = detect_gpu_backend(); print(b, b.get_device_count())"

@duburcqa duburcqa changed the title [FIX] fix some ROCm incompatibilities [MISC] Add full support of AMD GPU to unit test and benchmark infra. Apr 10, 2026
@v01dXYZ v01dXYZ marked this pull request as draft April 11, 2026 10:05
@hughperkins
Copy link
Copy Markdown
Collaborator

Note, not required for merge, but, for the benchmarks I'd be interested in seeing a side-by-side comparison of the cuda and amd gpu, for runtime fps, i.e. one column for the cuda card, one column for the amd card, and one column showing the ratio of the two. (and/or provide a link to a downloadable csv file, for each, so I can easily compare myself)

@v01dXYZ
Copy link
Copy Markdown
Author

v01dXYZ commented Apr 12, 2026

My commits are not clean, I was trying to make it work on packet.ai. If you try to test it with this provider, it seems right now only west2 instances are able to have nvidia-smi with per-process metrics. The reason there are so many problems is because it's not real VMs but a k8s pod.

@hughperkins To answer your request, I was able to run the speed/mem benchmark on both Hot Aisle MI300X and Packet.ai RTX9000PRO. That are the tables above. I'll reformat them.

@hughperkins
Copy link
Copy Markdown
Collaborator

My commits are not clean, I was trying to make it work on packet.ai. If you try to test it with this provider, it seems right now only west2 instances are able to have nvidia-smi with per-process metrics. The reason there are so many problems is because it's not real VMs but a k8s pod.

@hughperkins To answer your request, I was able to run the speed/mem benchmark on both Hot Aisle MI300X and Packet.ai RTX9000PRO. That are the tables above. I'll reformat them.

Note: I've experienced issues with nvidia-smi previously.

You might have success with migrating to this branch potentially, https://github.com/Genesis-Embodied-AI/Genesis/compare/main...hughperkins:Genesis:hp/mem-monitor-faster?expand=1

@v01dXYZ

This comment was marked as outdated.

@hughperkins
Copy link
Copy Markdown
Collaborator

Yes, using nvidia-ml-py could be easier as well and more robust (no parsing a CLI tool output!).

Here the comparison table:

('mi300x', 'runtime_fps') ('rtx6000pro', 'runtime_fps') ('mi300x', 'compile_time') ('rtx6000pro', 'compile_time') ('mi300x', 'realtime_factor') ('rtx6000pro', 'realtime_factor')
('duck_in_box_easy', '30000', '-', 'False', 'ndarray', 'True') 3041448.0 6.95194e+06 48.5 27.2 30414.5 69519.4
('duck_in_box_easy', '30000', '-', 'False', 'ndarray', 'False') 4409587.0 6.9787e+06 75.5 42.6 44095.9 69787
('duck_in_box_hard', '30000', '-', 'False', 'ndarray', 'True') 653952.0 3.1793e+06 49.1 27.5 6539.5 31793
('duck_in_box_hard', '30000', '-', 'False', 'ndarray', 'False') 1942384.0 6.94348e+06 76.5 43.8 19423.8 69434.9
('anymal_random', '30000', '-', 'False', 'ndarray', '-') 1334529.0 6.91582e+06 77.6 43.3 13345.3 69158.1
('anymal_uniform', '30000', '-', 'False', 'ndarray', '-') 1703289.0 6.93112e+06 77.6 43.4 17032.9 69311.2
('anymal_zero', '30000', '-', 'False', 'ndarray', '-') 2006120.0 6.94098e+06 78.0 43.6 20061.2 69409.8
('anymal_uniform_kinematic', '30000', '-', 'False', 'ndarray', '-') 1466308.0 6.46777e+06 83.5 46.2 14663.1 64677.7
('go2', '4096', '-', 'False', 'ndarray', 'True') 506838.0 889852 52.0 28.4 5068.4 8898.5
('go2', '4096', 'CG', 'False', 'ndarray', 'False') 540631.0 424786 78.7 52.8 5406.3 4247.9
('go2', '4096', 'Newton', 'False', 'ndarray', 'False') 686389.0 399131 79.1 44.5 6863.9 3991.3
('franka_accessors', '30000', '-', 'False', 'ndarray', '-') 3370704.0 1.76582e+06 79.1 52.9 33707.0 17658.2
('franka_free', '30000', '-', 'False', 'ndarray', '-') 4551981.0 1.82898e+06 79.9 44.6 45519.8 18289.8
('franka', '30000', '-', 'False', 'ndarray', '-') 3117229.0 5.43482e+06 80.1 44.9 31172.3 54348.2
('franka_random', '30000', '-', 'False', 'ndarray', 'False') 2548096.0 6.73956e+06 80.2 44.1 25481.0 67395.6
('franka_random', '30000', '-', 'False', 'ndarray', 'True') 1984557.0 6.69475e+06 53.0 28.1 19845.6 66947.5
('franka_random', '30000', 'CG', 'False', 'ndarray', '-') 2241193.0 6.98663e+06 79.4 43.8 22411.9 69866.3
('franka_random', '30000', 'Newton', 'False', 'ndarray', '-') 2557212.0 6.72559e+06 79.5 44.1 25572.1 67255.9
('box_pyramid_3', '4096', '-', 'False', 'ndarray', '-') 186131.0 655001 79.3 44 1861.3 6550
('box_pyramid_4', '4096', '-', 'False', 'ndarray', '-') 74858.0 243413 78.0 43.4 748.6 2434.1
('box_pyramid_5', '4096', '-', 'False', 'ndarray', '-') 25651.0 87378 79.7 43.7 256.5 873.8
('box_pyramid_6', '4096', '-', 'False', 'ndarray', 'True') - 40378 - 27.8 - 403.8
('box_pyramid_6', '4096', '-', 'False', 'ndarray', 'False') - 36980 - 43.8 - 369.8
('g1_fall', '4096', 'Newton', 'False', 'ndarray', '-') - 99842 - 43.9 - 499.2
('dex_hand', '4096', '-', 'False', 'ndarray', '-') - 12757 - 61 - 797.3
Script to generate it

  • Formatting seems inconsistent
    • one runtime fps is using non-scientific, and one is using scientific, which makes them hard to compare I feel
  • would be very useful to have a ratio column, for the runtime fps, I feel
    • in our other benchmarking tables, the ratio is evaluated as (v2/v1 - 1) * 100

@v01dXYZ
Copy link
Copy Markdown
Author

v01dXYZ commented Apr 12, 2026

@hughperkins here the table for speed as it should be.

To reproduce:

SPEED

('mi300x', 'runtime_fps') ('rtx6000pro', 'runtime_fps') runtime_fps_mi300x/rtx6000pro ('mi300x', 'compile_time') ('rtx6000pro', 'compile_time') compile_time_mi300x/rtx6000pro ('mi300x', 'realtime_factor') ('rtx6000pro', 'realtime_factor') realtime_factor_mi300x/rtx6000pro
('duck_in_box_easy', 30000, '-', 'False', 'ndarray', 'True') 3041448 6951939 0.437 48 27 1.78 30414 69519 0.437
('duck_in_box_easy', 30000, '-', 'False', 'ndarray', 'False') 4409587 6978696 0.632 75 42 1.79 44095 69787 0.632
('duck_in_box_hard', 30000, '-', 'False', 'ndarray', 'True') 653952 3179303 0.206 49 27 1.81 6539 31793 0.206
('duck_in_box_hard', 30000, '-', 'False', 'ndarray', 'False') 1942384 6943485 0.28 76 43 1.77 19423 69434 0.28
('anymal_random', 30000, '-', 'False', 'ndarray', '-') 1334529 6915815 0.193 77 43 1.79 13345 69158 0.193
('anymal_uniform', 30000, '-', 'False', 'ndarray', '-') 1703289 6931124 0.246 77 43 1.79 17032 69311 0.246
('anymal_zero', 30000, '-', 'False', 'ndarray', '-') 2006120 6940983 0.289 78 43 1.81 20061 69409 0.289
('anymal_uniform_kinematic', 30000, '-', 'False', 'ndarray', '-') 1466308 6467770 0.227 83 46 1.8 14663 64677 0.227
('go2', 4096, '-', 'False', 'ndarray', 'True') 506838 889852 0.57 52 28 1.86 5068 8898 0.57
('go2', 4096, 'CG', 'False', 'ndarray', 'False') 540631 424786 1.27 78 52 1.5 5406 4247 1.27
('go2', 4096, 'Newton', 'False', 'ndarray', 'False') 686389 399131 1.72 79 44 1.8 6863 3991 1.72
('franka_accessors', 30000, '-', 'False', 'ndarray', '-') 3370704 1765821 1.91 79 52 1.52 33707 17658 1.91
('franka_free', 30000, '-', 'False', 'ndarray', '-') 4551981 1828975 2.49 79 44 1.8 45519 18289 2.49
('franka', 30000, '-', 'False', 'ndarray', '-') 3117229 5434819 0.574 80 44 1.82 31172 54348 0.574
('franka_random', 30000, '-', 'False', 'ndarray', 'False') 2548096 6739563 0.378 80 44 1.82 25481 67395 0.378
('franka_random', 30000, '-', 'False', 'ndarray', 'True') 1984557 6694750 0.296 53 28 1.89 19845 66947 0.296
('franka_random', 30000, 'CG', 'False', 'ndarray', '-') 2241193 6986628 0.321 79 43 1.84 22411 69866 0.321
('franka_random', 30000, 'Newton', 'False', 'ndarray', '-') 2557212 6725591 0.38 79 44 1.8 25572 67255 0.38
('box_pyramid_3', 4096, '-', 'False', 'ndarray', '-') 186131 655001 0.284 79 44 1.8 1861 6550 0.284
('box_pyramid_4', 4096, '-', 'False', 'ndarray', '-') 74858 243413 0.308 78 43 1.81 748 2434 0.307
('box_pyramid_5', 4096, '-', 'False', 'ndarray', '-') 25651 87378 0.294 79 43 1.84 256 873 0.293
('box_pyramid_6', 4096, '-', 'False', 'ndarray', 'True') - 40378 nan - 27 nan - 403 nan
('box_pyramid_6', 4096, '-', 'False', 'ndarray', 'False') - 36980 nan - 43 nan - 369 nan
('g1_fall', 4096, 'Newton', 'False', 'ndarray', '-') - 99842 nan - 43 nan - 499 nan
('dex_hand', 4096, '-', 'False', 'ndarray', '-') - 12757 nan - 61 nan - 797 nan

@hughperkins
Copy link
Copy Markdown
Collaborator

Interesting:

  • slower on many things (but not by much; all within a single order of magnitude)
  • faster on a few things, by a similar factor

Any thoughts on what the things that run faster on the AMD GPU have in common? (I glanced at them, but seems not obvious to me by simple inspection. Might be worth running profiling on them, to compare the kernel times, between CUDA and AMD. Notes:

  • for CUDA we use pytorch profiling. I'm not sure if something similar exists for AMD?
  • recommend disabling CUDA graph (QD_GRAPH=0), when running profiling, otherwise you just get a giant "white space" in the profile, where the graph was running, and no detailed kernel timings
    ).

@hughperkins
Copy link
Copy Markdown
Collaborator

hughperkins commented Apr 12, 2026

(would also be nice to have a table summarizing the key statistics of each graphics card. Thigns like:

  • how much global memory?
  • size of L1 cache
  • size of L2 cache
  • how many GPU cores
  • etc ...

Edit: To be clear, none of the benchmarks or GPU comparisons is part of this PR, or required for merge. I'm also happy to move the perf discussion elsewhere perhaps 🤔
)

@v01dXYZ
Copy link
Copy Markdown
Author

v01dXYZ commented Apr 12, 2026

@hughperkins We could create an issue to discuss MI300X disappointing performance wrt its specs.

@hughperkins
Copy link
Copy Markdown
Collaborator

@hughperkins We could create an issue to discuss MI300X disappointing performance wrt its specs.

Works for me. (or a Discussion perhaps? Since an issue tends to be for something fairly concrete, well-defined, clear 'definition of done', I feel?).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants