[MISC] Add full support of AMD GPU to unit test and benchmark infra.#2680
[MISC] Add full support of AMD GPU to unit test and benchmark infra.#2680v01dXYZ wants to merge 24 commits intoGenesis-Embodied-AI:mainfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 87d60d8a4d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 453d137c0a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| except FileNotFoundError: | ||
| pass |
There was a problem hiding this comment.
Catch nvidia-smi command failures before ROCm fallback
is_mem_monitoring_supported() now allows monitoring when nvidia-smi fails but rocm-smi works, yet get_cuda_usage() only falls back when nvidia-smi is missing. On ROCm hosts where nvidia-smi is present but exits non-zero (raising subprocess.CalledProcessError), the monitor process crashes before trying rocm-smi, so --mem-monitoring-filepath is reported as supported but fails at runtime.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
I assume on a ROCm machine there is no nvidia tooling, otherwise it would be hard to distinguish between both of them.
|
In order to avoid having allocation failure, I put |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 453d137c0a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| hip_uuid = "".join([chr(int(device_uuid[i : i + 2], 16)) for i in range(0, len(device_uuid), 2)]) | ||
| unique_id = int(hip_uuid, 16) |
There was a problem hiding this comment.
Decode ROCm UUID bytes correctly before integer conversion
amdgpu_get_node_rank_from_uuid() converts each UUID byte to a character and then calls int(..., 16), but that intermediate string will contain non-hex characters for almost all valid UUIDs (e.g., byte 0x7d becomes }), so int raises ValueError before any node lookup happens. On ROCm systems this propagates from _torch_get_gpu_idx() and can fail GPU test setup as soon as device-index validation runs, so memory/device routing checks break even when KFD topology is present.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
No because actually hip UUID is already a 16 bytes hex representation of a int64 integer. PyTorch converts this 16 bytes hex rep into a 32 bytes hex rep (so it is a hex rep of an hex rep).
|
Below is the result of the mem/speed benchmarks on MI300X: DetailsMEM: SPEED: MEM:
SPEED:
RTX 6000 MEM:
SPEED:
|
| try: | ||
| assert sys.platform.startswith("linux") | ||
| subprocess.check_output(["nvidia-smi"], stderr=subprocess.STDOUT, timeout=10) | ||
| return True, None | ||
| except Exception as exc: # platform or nvidia-smi unavailable | ||
| return False, exc | ||
| except Exception as exc: | ||
| pass | ||
|
|
||
| try: | ||
| subprocess.check_output(["rocm-smi"], stderr=subprocess.STDOUT, timeout=10) | ||
| return True, None | ||
| except Exception as exc: | ||
| pass |
There was a problem hiding this comment.
- Avoid duplicated logics
- Never catch generic
Exceptionunless it is absolutely necessary:
utilities = ("nvidia-smi", "rocm-smi")
for exe in utilities:
try:
subprocess.check_output([exe], stderr=subprocess.STDOUT, check=True, timeout=10)
return True, None
except (FileNotFoundError, subprocess.TimeoutExpired, subprocess.CalledProcessError, OSError):
pass
return False, f"None of the executables {utilities} exited successfully (code 0)."| if NVIDIA_GPU_INTERFACE_PATH.is_dir(): | ||
| return tuple(i for i, _ in enumerate(NVIDIA_GPU_INTERFACE_PATH.iterdir())) | ||
| if KFD_SYSFS_PATH.is_dir(): | ||
| return tuple(i for i, _ in enumerate(get_kfd_gpu_nodes_properties())) | ||
|
|
||
| warnings.warn( | ||
| f"'{NVIDIA_GPU_INTERFACE_PATH!s}' or '{KFD_SYSFS_PATH!s}' is not available. Multi-GPU support will be disabled. This is expected " | ||
| "on WSL2 where the NVIDIA proc interface is not mounted.", | ||
| stacklevel=2, | ||
| ) | ||
|
|
||
| return (0,) | ||
|
|
||
|
|
||
| def parse_kfd_node_properties(kfd_properties_str: str): | ||
| props = {} | ||
| for l in kfd_properties_str.split("\n"): | ||
| l = l.strip() | ||
|
|
||
| if not l: | ||
| continue | ||
|
|
||
| name, value_str = l.split() | ||
| props[name] = int(value_str) | ||
|
|
||
| return props | ||
|
|
||
|
|
||
| KFD_SYSFS_PATH = Path("/sys/devices/virtual/kfd/kfd/topology") | ||
|
|
||
|
|
||
| def get_kfd_gpu_nodes_properties(): | ||
| kfd_sysfs_path_nodes = Path(KFD_SYSFS_PATH) / "nodes" | ||
|
|
||
| gpu_nodes_properties = {} | ||
|
|
||
| for node_path in kfd_sysfs_path_nodes.iterdir(): | ||
| with (node_path / "properties").open() as node_properties_f: | ||
| properties_str = node_properties_f.read() | ||
| node_props = parse_kfd_node_properties(properties_str) | ||
|
|
||
| if node_props["cpu_cores_count"] == 0: | ||
| gpu_nodes_properties[int(node_path.name)] = node_props | ||
|
|
||
| return gpu_nodes_properties | ||
|
|
||
|
|
||
| def amdgpu_get_node_rank_from_uuid(device_uuid): | ||
| device_uuid = device_uuid.replace("-", "") | ||
| hip_uuid = "".join([chr(int(device_uuid[i : i + 2], 16)) for i in range(0, len(device_uuid), 2)]) | ||
| unique_id = int(hip_uuid, 16) | ||
|
|
||
| UNIQUE_ID = "unique_id" | ||
| gpu_nodes_properties = get_kfd_gpu_nodes_properties() | ||
|
|
||
| for node_rank, gpu_props in enumerate(gpu_nodes_properties.values()): | ||
| if gpu_props["unique_id"] == unique_id: | ||
| return node_rank | ||
|
|
||
| return -1 | ||
|
|
||
|
|
||
| NVIDIA_GPU_INTERFACE_PATH = Path("/proc/driver/nvidia/gpus/") | ||
|
|
||
|
|
||
| def nvidia_get_node_rank_from_uuid(device_uuid): | ||
| for device_idx, device_path in enumerate(NVIDIA_GPU_INTERFACE_PATH.iterdir()): | ||
| with (device_path / "information").open() as f: | ||
| device_info = f.read() | ||
| if re.search(rf"GPU UUID:\s+GPU-{device_uuid}", device_info): | ||
| return device_idx | ||
|
|
||
| return -1 |
There was a problem hiding this comment.
This is starting to look very fragile. Is there any library out there that would be used to hiding all this logics?
|
What about this plan? |
|
Note, not required for merge, but, for the benchmarks I'd be interested in seeing a side-by-side comparison of the cuda and amd gpu, for runtime fps, i.e. one column for the cuda card, one column for the amd card, and one column showing the ratio of the two. (and/or provide a link to a downloadable csv file, for each, so I can easily compare myself) |
|
My commits are not clean, I was trying to make it work on @hughperkins To answer your request, I was able to run the speed/mem benchmark on both Hot Aisle MI300X and Packet.ai RTX9000PRO. That are the tables above. I'll reformat them. |
Note: I've experienced issues with nvidia-smi previously. You might have success with migrating to this branch potentially, https://github.com/Genesis-Embodied-AI/Genesis/compare/main...hughperkins:Genesis:hp/mem-monitor-faster?expand=1 |
This comment was marked as outdated.
This comment was marked as outdated.
|
|
@hughperkins here the table for speed as it should be. To reproduce:
SPEED
|
|
Interesting:
Any thoughts on what the things that run faster on the AMD GPU have in common? (I glanced at them, but seems not obvious to me by simple inspection. Might be worth running profiling on them, to compare the kernel times, between CUDA and AMD. Notes:
|
|
(would also be nice to have a table summarizing the key statistics of each graphics card. Thigns like:
Edit: To be clear, none of the benchmarks or GPU comparisons is part of this PR, or required for merge. I'm also happy to move the perf discussion elsewhere perhaps 🤔 |
|
@hughperkins We could create an issue to discuss MI300X disappointing performance wrt its specs. |
Works for me. (or a Discussion perhaps? Since an issue tends to be for something fairly concrete, well-defined, clear 'definition of done', I feel?). |
Description
rocm-smiifnvidia-sminot foundRelated Issue
This is related to #2679. I had a hard time finding the specs of the GitHub runners so I don't know what are the AMD GPUs used to test Genesis.
Resolves Genesis-Embodied-AI/Genesis#
Motivation and Context
to have benchmark/mem-reporting works on ROCm.
How Has This Been / Can This Be Tested?
Yes, manually though.
Screenshots (if appropriate):
Checklist:
Submitting Code Changessection of CONTRIBUTING document.