[FEATURE] Add nvidia-smi fallback for GPU detection in cloud environments#2694
[FEATURE] Add nvidia-smi fallback for GPU detection in cloud environments#2694vlordier wants to merge 9 commits intoGenesis-Embodied-AI:mainfrom
Conversation
- Add history_length parameter to ContactForce sensor options - Override read() to return historical force readings from ring buffer - Update return format to include history dimension - Extend ring buffer size to accommodate history
There was a problem hiding this comment.
Pull request overview
This PR primarily targets improved GPU detection in Linux cloud/container environments by falling back to nvidia-smi when the NVIDIA /proc interface is unavailable, but it also introduces additional sensor and terrain-related API changes.
Changes:
- Add
nvidia-smifallbacks for multi-GPU detection and UUID-based GPU indexing in test setup. - Add
history_lengthtoContactForcesensor options and adjust sensor buffering logic. - Add terrain sampling helpers (
get_height_at,get_normal_at) toRigidEntity.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/conftest.py | Adds nvidia-smi fallback paths for GPU enumeration and UUID matching when /proc/driver/nvidia/gpus/ is missing. |
| genesis/options/sensors/options.py | Introduces history_length option for ContactForce sensor and documents intended history behavior. |
| genesis/engine/sensors/sensor_manager.py | Updates buffer length calculation to account for sensors that declare history_length. |
| genesis/engine/sensors/contact_force.py | Implements history_length behavior in ContactForceSensor.read() and adjusts debug drawing data source. |
| genesis/engine/entities/rigid_entity/rigid_entity.py | Adds public terrain query helpers for height and surface normal interpolation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| return self.read(envs_idx) | ||
|
|
There was a problem hiding this comment.
read_ground_truth() now delegates to read(), which returns measured (noisy/quantized/clipped/delayed) values when history_length == 1. This breaks the contract of read_ground_truth() and will fail existing tests that expect ground-truth force (e.g., tests/test_sensors.py asserts ground truth is 0.0 before contact). Implement a separate ground-truth path that reads from SensorManager.get_cloned_from_cache(..., is_ground_truth=True) (and, if supporting history, a ground-truth history buffer).
| return self.read(envs_idx) | |
| envs_idx = self._sanitize_envs_idx(envs_idx) | |
| history_length = self._options.history_length | |
| if history_length == 1: | |
| return self._get_formatted_data( | |
| self._manager.get_cloned_from_cache(self, is_ground_truth=True), envs_idx | |
| ) | |
| buffered_data = self._manager._buffered_data[gs.tc_float] | |
| cache_slice = slice(self._cache_idx, self._cache_idx + 3) | |
| n_envs = self._manager._sim.n_envs | |
| history_data = [] | |
| for i in range(history_length): | |
| hist = buffered_data.at(i, envs_idx, cache_slice) | |
| if n_envs == 0: | |
| hist = hist.reshape(3) | |
| else: | |
| hist = hist.reshape(n_envs, 3) | |
| history_data.append(hist) | |
| result = torch.stack(history_data, dim=1) | |
| return result.squeeze(1) if n_envs == 0 else result |
| if n_envs == 0: | ||
| hist = hist.reshape(3) | ||
| else: | ||
| hist = hist.reshape(n_envs, 3) |
There was a problem hiding this comment.
For history_length > 1, read() pulls from self._manager._buffered_data[gs.tc_float], but this ring buffer is populated with shared_ground_truth_cache in _update_shared_cache() before delay/noise/clipping/quantization are applied. That means the returned history bypasses the documented measured-sensor processing (and can disagree with read() for history_length == 1). Consider storing the processed cache into a dedicated history ring buffer, or moving the ring-buffer set() to after processing so the buffer contains measured values.
| hist = hist.reshape(n_envs, 3) | |
| hist = hist.reshape(n_envs, 3) | |
| hist = self._get_formatted_data(hist, envs_idx) |
| history_data = [] | ||
| for i in range(history_length): | ||
| hist = buffered_data.at(i, envs_idx, cache_slice) | ||
| if n_envs == 0: | ||
| hist = hist.reshape(3) | ||
| else: | ||
| hist = hist.reshape(n_envs, 3) |
There was a problem hiding this comment.
The history path reshapes each hist using n_envs = self._manager._sim.n_envs, but buffered_data.at(i, envs_idx, ...) returns data for len(envs_idx) environments (which may be a subset). Using n_envs will produce incorrect shapes or a reshape error when envs_idx is not all environments. Reshape based on len(envs_idx) (and keep the non-batched n_envs == 0 behavior consistent with Sensor._get_formatted_data).
| history_data = [] | |
| for i in range(history_length): | |
| hist = buffered_data.at(i, envs_idx, cache_slice) | |
| if n_envs == 0: | |
| hist = hist.reshape(3) | |
| else: | |
| hist = hist.reshape(n_envs, 3) | |
| selected_n_envs = len(envs_idx) if n_envs != 0 else 0 | |
| history_data = [] | |
| for i in range(history_length): | |
| hist = buffered_data.at(i, envs_idx, cache_slice) | |
| if n_envs == 0: | |
| hist = hist.reshape(3) | |
| else: | |
| hist = hist.reshape(selected_n_envs, 3) |
| result = torch.stack(history_data, dim=1) | ||
| return result.squeeze(1) if n_envs == 0 else result |
There was a problem hiding this comment.
In the non-batched case (sim.n_envs == 0) and history_length > 1, torch.stack(history_data, dim=1) stacks 1D tensors of shape (3,) along dim=1, producing shape (3, history_length) (transposed vs the docstring expectation). If you want (history_length, 3) for the single-env case, stack along dim=0 (or ensure tensors are shaped (1,3) before stacking).
| result = torch.stack(history_data, dim=1) | |
| return result.squeeze(1) if n_envs == 0 else result | |
| stack_dim = 0 if n_envs == 0 else 1 | |
| result = torch.stack(history_data, dim=stack_dim) | |
| return result |
| quat = self._link.get_quat(env_idx).reshape((4,)) | ||
|
|
||
| force = self.read(env_idx).reshape((3,)) | ||
| force = self._manager.get_cloned_from_cache(self, is_ground_truth=False)[0, :3].reshape((3,)) |
There was a problem hiding this comment.
_draw_debug() now always reads get_cloned_from_cache(...)[0, :3], which hard-codes environment 0 and ignores env_idx = context.rendered_envs_idx[0]. If rendered_envs_idx[0] != 0 (or if you later change debug rendering to other envs), the debug arrow will visualize the wrong environment’s force. Index the cloned cache with the correct env index (or reuse self.read(env_idx) while ensuring history-length handling doesn’t break debug).
| force = self._manager.get_cloned_from_cache(self, is_ground_truth=False)[0, :3].reshape((3,)) | |
| cache = self._manager.get_cloned_from_cache(self, is_ground_truth=False) | |
| force = cache[env_idx, :3].reshape((3,)) if env_idx is not None else cache[:3].reshape((3,)) |
| The maximum output absolute force per each axis. Values above this will be clipped. Default is infinity. | ||
| history_length : int, optional | ||
| The number of historical force readings to store and return. Default is 1 (current value only). | ||
| When > 1, the sensor returns a history buffer of shape (history_length, 3) per environment. |
There was a problem hiding this comment.
The docstring says the history buffer is of shape (history_length, 3) per environment, but the sensor system generally returns batched tensors shaped (n_envs, ...) when n_envs > 0. Please clarify the expected return shape for history_length > 1 (e.g., (n_envs, history_length, 3) for batched, (history_length, 3) for single-env) and ensure the implementation matches it.
| When > 1, the sensor returns a history buffer of shape (history_length, 3) per environment. | |
| When > 1, the sensor returns a history buffer with shape (history_length, 3) for a single environment, | |
| or (n_envs, history_length, 3) when returned in batched form across multiple environments. |
| @gs.assert_built | ||
| def get_height_at(self, x: float, y: float) -> float: | ||
| """ | ||
| Get terrain height at world position (x, y). | ||
|
|
||
| Uses bilinear interpolation from the height field. | ||
|
|
||
| Parameters |
There was a problem hiding this comment.
This PR introduces additional public API/features beyond the stated GPU-detection fallback (e.g., new RigidEntity.get_height_at() / get_normal_at() terrain helpers). Either update the PR description/title to reflect these additions or split them into a separate PR to keep review scope focused and reduce merge risk.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f2fbb32004
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| """ | ||
| Read the ground truth sensor data (without noise). | ||
| """ | ||
| return self.read(envs_idx) |
There was a problem hiding this comment.
Return cached ground truth in read_ground_truth
read_ground_truth() now directly calls read(), which uses the non-ground-truth path when history_length == 1. That means sensors configured with noise/bias/delay/clipping/quantization can return processed values from read_ground_truth() instead of true contact forces, violating the sensor API contract and breaking existing expectations for ground-truth reads.
Useful? React with 👍 / 👎.
| if n_envs == 0: | ||
| hist = hist.reshape(3) | ||
| else: | ||
| hist = hist.reshape(n_envs, 3) |
There was a problem hiding this comment.
Reshape history using requested env count
When history_length > 1, each history slice is reshaped with global n_envs rather than len(envs_idx). If callers request a subset via read(envs_idx=...), buffered_data.at() returns data sized to that subset, so reshape(n_envs, 3) can raise a runtime error (or produce invalid shape assumptions) in parallel scenes.
Useful? React with 👍 / 👎.
| envs_idx = self._sanitize_envs_idx(envs_idx) | ||
| history_length = self._options.history_length | ||
|
|
||
| buffered_data = self._manager._buffered_data[gs.tc_float] |
There was a problem hiding this comment.
Read history from processed cache, not raw buffer
The history branch reads from _buffered_data, but _update_shared_cache() writes shared_ground_truth_cache into that ring buffer before applying delay, noise, clipping, and quantization. So read() returns raw undelayed forces whenever history_length > 1, which silently changes semantics versus the history_length == 1 path and contradicts the method contract (“with noise applied if applicable”).
Useful? React with 👍 / 👎.
| x_idx = x / h_scale | ||
| y_idx = y / h_scale |
There was a problem hiding this comment.
Transform world query into terrain-local coordinates
get_height_at() is documented to accept world (x, y) but directly converts those values to grid indices without accounting for the terrain entity pose. Since _load_terrain allows non-zero morph.pos/morph.quat, translated or rotated terrains will return incorrect heights from this method (and get_normal_at() has the same frame mismatch).
Useful? React with 👍 / 👎.
- Fix reshape crash in read() for batched env subsets: use n_query_envs instead of global n_envs when reshaping history slices - Restore read_ground_truth() contract: return noise-free ground truth cache instead of delegating to read() which returns processed data - Fix _draw_debug() env index: use cache[env_idx] instead of hardcoded cache[0] for correct environment in multi-env scenes Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
- Adds entity.get_height_at(x, y) to query terrain height at world position using bilinear interpolation from height field - Adds entity.get_normal_at(x, y) to compute surface normal at world position from height field gradient - Both methods handle boundary conditions gracefully Closes Genesis-Embodied-AI#2094
…se transform - Fix coordinate indexing: hf[y,x] -> hf[x,y] since heightfield is stored as [row, col] where row corresponds to x - Add pose transformation: convert world coords to terrain local frame using inv_transform_by_trans_quat(terrain_pos, terrain_quat) - Transform normals back to world frame with transform_by_quat - Height now includes terrain z-offset (terrain_pos[2]) Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
…ents - Add fallback to in _get_gpu_indices() when /proc/driver/nvidia/gpus/ is unavailable - Add fallback to in _torch_get_gpu_idx() when /proc interface is missing - Handles cloud GPU instances and containers lacking /proc interface - Graceful degradation to single-GPU mode when both methods fail Closes Genesis-Embodied-AI#2683
fb6b2ae to
ccf9c5e
Compare
|
Supersedes by #2680. Closing. |
Summary
nvidia-smicommands when/proc/driver/nvidia/gpus/interface is unavailableChanges
tests/conftest.py:_get_gpu_indices()using--list-gpusto count GPUs_torch_get_gpu_idx()using--query-gpu=uuidto match GPU UUIDUsage
This change is transparent to users - Genesis will automatically detect GPUs using:
/proc/driver/nvidia/gpus/(primary method, unchanged)nvidia-smicommands (fallback for cloud/container environments)Benefit
/dev/nvidia*access to utilize multi-GPU detectionCloses #2683