Skip to content

[FEATURE] Add nvidia-smi fallback for GPU detection in cloud environments#2694

Closed
vlordier wants to merge 9 commits intoGenesis-Embodied-AI:mainfrom
vlordier:feature/nvidia-smi-fallback
Closed

[FEATURE] Add nvidia-smi fallback for GPU detection in cloud environments#2694
vlordier wants to merge 9 commits intoGenesis-Embodied-AI:mainfrom
vlordier:feature/nvidia-smi-fallback

Conversation

@vlordier
Copy link
Copy Markdown

Summary

  • Adds fallback to nvidia-smi commands when /proc/driver/nvidia/gpus/ interface is unavailable
  • Supports cloud GPU instances and containers lacking the /proc interface
  • Maintains backward compatibility - existing behavior unchanged when /proc interface is available

Changes

  • tests/conftest.py:
    • Added nvidia-smi fallback in _get_gpu_indices() using --list-gpus to count GPUs
    • Added nvidia-smi fallback in _torch_get_gpu_idx() using --query-gpu=uuid to match GPU UUID
    • Both fallbacks gracefully degrade to single-GPU mode when nvidia-smi is also unavailable

Usage

This change is transparent to users - Genesis will automatically detect GPUs using:

  1. /proc/driver/nvidia/gpus/ (primary method, unchanged)
  2. nvidia-smi commands (fallback for cloud/container environments)
  3. Single GPU assumption (final fallback)

Benefit

  • Enables Genesis to run on cloud GPU providers that don't mount the NVIDIA proc interface
  • Allows containers with only /dev/nvidia* access to utilize multi-GPU detection
  • Prevents unnecessary warnings and fallbacks to single-GPU mode in supported environments

Closes #2683

- Add history_length parameter to ContactForce sensor options
- Override read() to return historical force readings from ring buffer
- Update return format to include history dimension
- Extend ring buffer size to accommodate history
Copilot AI review requested due to automatic review settings April 11, 2026 07:20
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR primarily targets improved GPU detection in Linux cloud/container environments by falling back to nvidia-smi when the NVIDIA /proc interface is unavailable, but it also introduces additional sensor and terrain-related API changes.

Changes:

  • Add nvidia-smi fallbacks for multi-GPU detection and UUID-based GPU indexing in test setup.
  • Add history_length to ContactForce sensor options and adjust sensor buffering logic.
  • Add terrain sampling helpers (get_height_at, get_normal_at) to RigidEntity.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
tests/conftest.py Adds nvidia-smi fallback paths for GPU enumeration and UUID matching when /proc/driver/nvidia/gpus/ is missing.
genesis/options/sensors/options.py Introduces history_length option for ContactForce sensor and documents intended history behavior.
genesis/engine/sensors/sensor_manager.py Updates buffer length calculation to account for sensors that declare history_length.
genesis/engine/sensors/contact_force.py Implements history_length behavior in ContactForceSensor.read() and adjusts debug drawing data source.
genesis/engine/entities/rigid_entity/rigid_entity.py Adds public terrain query helpers for height and surface normal interpolation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread genesis/engine/sensors/contact_force.py Outdated
Comment on lines +221 to +222
return self.read(envs_idx)

Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read_ground_truth() now delegates to read(), which returns measured (noisy/quantized/clipped/delayed) values when history_length == 1. This breaks the contract of read_ground_truth() and will fail existing tests that expect ground-truth force (e.g., tests/test_sensors.py asserts ground truth is 0.0 before contact). Implement a separate ground-truth path that reads from SensorManager.get_cloned_from_cache(..., is_ground_truth=True) (and, if supporting history, a ground-truth history buffer).

Suggested change
return self.read(envs_idx)
envs_idx = self._sanitize_envs_idx(envs_idx)
history_length = self._options.history_length
if history_length == 1:
return self._get_formatted_data(
self._manager.get_cloned_from_cache(self, is_ground_truth=True), envs_idx
)
buffered_data = self._manager._buffered_data[gs.tc_float]
cache_slice = slice(self._cache_idx, self._cache_idx + 3)
n_envs = self._manager._sim.n_envs
history_data = []
for i in range(history_length):
hist = buffered_data.at(i, envs_idx, cache_slice)
if n_envs == 0:
hist = hist.reshape(3)
else:
hist = hist.reshape(n_envs, 3)
history_data.append(hist)
result = torch.stack(history_data, dim=1)
return result.squeeze(1) if n_envs == 0 else result

Copilot uses AI. Check for mistakes.
Comment thread genesis/engine/sensors/contact_force.py Outdated
if n_envs == 0:
hist = hist.reshape(3)
else:
hist = hist.reshape(n_envs, 3)
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For history_length > 1, read() pulls from self._manager._buffered_data[gs.tc_float], but this ring buffer is populated with shared_ground_truth_cache in _update_shared_cache() before delay/noise/clipping/quantization are applied. That means the returned history bypasses the documented measured-sensor processing (and can disagree with read() for history_length == 1). Consider storing the processed cache into a dedicated history ring buffer, or moving the ring-buffer set() to after processing so the buffer contains measured values.

Suggested change
hist = hist.reshape(n_envs, 3)
hist = hist.reshape(n_envs, 3)
hist = self._get_formatted_data(hist, envs_idx)

Copilot uses AI. Check for mistakes.
Comment thread genesis/engine/sensors/contact_force.py Outdated
Comment on lines +204 to +210
history_data = []
for i in range(history_length):
hist = buffered_data.at(i, envs_idx, cache_slice)
if n_envs == 0:
hist = hist.reshape(3)
else:
hist = hist.reshape(n_envs, 3)
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The history path reshapes each hist using n_envs = self._manager._sim.n_envs, but buffered_data.at(i, envs_idx, ...) returns data for len(envs_idx) environments (which may be a subset). Using n_envs will produce incorrect shapes or a reshape error when envs_idx is not all environments. Reshape based on len(envs_idx) (and keep the non-batched n_envs == 0 behavior consistent with Sensor._get_formatted_data).

Suggested change
history_data = []
for i in range(history_length):
hist = buffered_data.at(i, envs_idx, cache_slice)
if n_envs == 0:
hist = hist.reshape(3)
else:
hist = hist.reshape(n_envs, 3)
selected_n_envs = len(envs_idx) if n_envs != 0 else 0
history_data = []
for i in range(history_length):
hist = buffered_data.at(i, envs_idx, cache_slice)
if n_envs == 0:
hist = hist.reshape(3)
else:
hist = hist.reshape(selected_n_envs, 3)

Copilot uses AI. Check for mistakes.
Comment on lines +213 to +214
result = torch.stack(history_data, dim=1)
return result.squeeze(1) if n_envs == 0 else result
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the non-batched case (sim.n_envs == 0) and history_length > 1, torch.stack(history_data, dim=1) stacks 1D tensors of shape (3,) along dim=1, producing shape (3, history_length) (transposed vs the docstring expectation). If you want (history_length, 3) for the single-env case, stack along dim=0 (or ensure tensors are shaped (1,3) before stacking).

Suggested change
result = torch.stack(history_data, dim=1)
return result.squeeze(1) if n_envs == 0 else result
stack_dim = 0 if n_envs == 0 else 1
result = torch.stack(history_data, dim=stack_dim)
return result

Copilot uses AI. Check for mistakes.
Comment thread genesis/engine/sensors/contact_force.py Outdated
quat = self._link.get_quat(env_idx).reshape((4,))

force = self.read(env_idx).reshape((3,))
force = self._manager.get_cloned_from_cache(self, is_ground_truth=False)[0, :3].reshape((3,))
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_draw_debug() now always reads get_cloned_from_cache(...)[0, :3], which hard-codes environment 0 and ignores env_idx = context.rendered_envs_idx[0]. If rendered_envs_idx[0] != 0 (or if you later change debug rendering to other envs), the debug arrow will visualize the wrong environment’s force. Index the cloned cache with the correct env index (or reuse self.read(env_idx) while ensuring history-length handling doesn’t break debug).

Suggested change
force = self._manager.get_cloned_from_cache(self, is_ground_truth=False)[0, :3].reshape((3,))
cache = self._manager.get_cloned_from_cache(self, is_ground_truth=False)
force = cache[env_idx, :3].reshape((3,)) if env_idx is not None else cache[:3].reshape((3,))

Copilot uses AI. Check for mistakes.
The maximum output absolute force per each axis. Values above this will be clipped. Default is infinity.
history_length : int, optional
The number of historical force readings to store and return. Default is 1 (current value only).
When > 1, the sensor returns a history buffer of shape (history_length, 3) per environment.
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring says the history buffer is of shape (history_length, 3) per environment, but the sensor system generally returns batched tensors shaped (n_envs, ...) when n_envs > 0. Please clarify the expected return shape for history_length > 1 (e.g., (n_envs, history_length, 3) for batched, (history_length, 3) for single-env) and ensure the implementation matches it.

Suggested change
When > 1, the sensor returns a history buffer of shape (history_length, 3) per environment.
When > 1, the sensor returns a history buffer with shape (history_length, 3) for a single environment,
or (n_envs, history_length, 3) when returned in batched form across multiple environments.

Copilot uses AI. Check for mistakes.
Comment on lines +4157 to +4164
@gs.assert_built
def get_height_at(self, x: float, y: float) -> float:
"""
Get terrain height at world position (x, y).

Uses bilinear interpolation from the height field.

Parameters
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR introduces additional public API/features beyond the stated GPU-detection fallback (e.g., new RigidEntity.get_height_at() / get_normal_at() terrain helpers). Either update the PR description/title to reflect these additions or split them into a separate PR to keep review scope focused and reduce merge risk.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f2fbb32004

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread genesis/engine/sensors/contact_force.py Outdated
"""
Read the ground truth sensor data (without noise).
"""
return self.read(envs_idx)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Return cached ground truth in read_ground_truth

read_ground_truth() now directly calls read(), which uses the non-ground-truth path when history_length == 1. That means sensors configured with noise/bias/delay/clipping/quantization can return processed values from read_ground_truth() instead of true contact forces, violating the sensor API contract and breaking existing expectations for ground-truth reads.

Useful? React with 👍 / 👎.

Comment thread genesis/engine/sensors/contact_force.py Outdated
if n_envs == 0:
hist = hist.reshape(3)
else:
hist = hist.reshape(n_envs, 3)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reshape history using requested env count

When history_length > 1, each history slice is reshaped with global n_envs rather than len(envs_idx). If callers request a subset via read(envs_idx=...), buffered_data.at() returns data sized to that subset, so reshape(n_envs, 3) can raise a runtime error (or produce invalid shape assumptions) in parallel scenes.

Useful? React with 👍 / 👎.

envs_idx = self._sanitize_envs_idx(envs_idx)
history_length = self._options.history_length

buffered_data = self._manager._buffered_data[gs.tc_float]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Read history from processed cache, not raw buffer

The history branch reads from _buffered_data, but _update_shared_cache() writes shared_ground_truth_cache into that ring buffer before applying delay, noise, clipping, and quantization. So read() returns raw undelayed forces whenever history_length > 1, which silently changes semantics versus the history_length == 1 path and contradicts the method contract (“with noise applied if applicable”).

Useful? React with 👍 / 👎.

Comment on lines +4182 to +4183
x_idx = x / h_scale
y_idx = y / h_scale
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Transform world query into terrain-local coordinates

get_height_at() is documented to accept world (x, y) but directly converts those values to grid indices without accounting for the terrain entity pose. Since _load_terrain allows non-zero morph.pos/morph.quat, translated or rotated terrains will return incorrect heights from this method (and get_normal_at() has the same frame mismatch).

Useful? React with 👍 / 👎.

vlordier and others added 5 commits April 11, 2026 15:00
- Fix reshape crash in read() for batched env subsets: use n_query_envs
  instead of global n_envs when reshaping history slices
- Restore read_ground_truth() contract: return noise-free ground truth
  cache instead of delegating to read() which returns processed data
- Fix _draw_debug() env index: use cache[env_idx] instead of hardcoded
  cache[0] for correct environment in multi-env scenes

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
- Adds entity.get_height_at(x, y) to query terrain height at world position
  using bilinear interpolation from height field
- Adds entity.get_normal_at(x, y) to compute surface normal at world position
  from height field gradient
- Both methods handle boundary conditions gracefully

Closes Genesis-Embodied-AI#2094
…se transform

- Fix coordinate indexing: hf[y,x] -> hf[x,y] since heightfield is
  stored as [row, col] where row corresponds to x
- Add pose transformation: convert world coords to terrain local frame
  using inv_transform_by_trans_quat(terrain_pos, terrain_quat)
- Transform normals back to world frame with transform_by_quat
- Height now includes terrain z-offset (terrain_pos[2])

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
…ents

- Add fallback to  in _get_gpu_indices() when /proc/driver/nvidia/gpus/ is unavailable
- Add fallback to  in _torch_get_gpu_idx() when /proc interface is missing
- Handles cloud GPU instances and containers lacking /proc interface
- Graceful degradation to single-GPU mode when both methods fail

Closes Genesis-Embodied-AI#2683
@vlordier vlordier force-pushed the feature/nvidia-smi-fallback branch from fb6b2ae to ccf9c5e Compare April 11, 2026 13:28
@duburcqa
Copy link
Copy Markdown
Collaborator

Supersedes by #2680. Closing.

@duburcqa duburcqa closed this Apr 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Fallback to nvidia-smi -L if no /proc/driver/nvidia/gpus

3 participants