PyTorch bindings for NVIDIA Optical Flow SDK, providing hardware-accelerated optical flow computation with PyTorch end-to-end integration in Nvidia and Python.
Please read more about the NVIDIA Optical Flow SDK here: https://developer.nvidia.com/optical-flow-sdk
- Hardware-accelerated optical flow using a special processor in Nvidia GPUs. No gradients are computed, this is for inference only.
- Frame interpolation and ROI, or other additional content in the SDK is not supported.
- Configurable speed (slow, medium, fast) vs and grid size (1, 2, 4)
- Support for various ABGR8 format, namely, RGB images.
- End-to-end GPU processing with PyTorch.
- Biderectional optical flow computation (forward and backward) in a single call. This is supported by the SDK, but not exposed in the wrappers they provide.
The package comes with basic functionality for optical flow:
- .flo reader and writer
- Optical Flow common metrics
- Visualization utilities
- NVIDIA GPU with Optical Flow SDK support (Turing, Ampere or Ada)
- Tested on linux (Ubuntu), the SDK is compatible with windows too. Read optical_flow_sdk>Read_Me.pdf for windows instructions.
- CUDA toolkit >=10.2
- Linux drivers "nvidia-smi" >=528.85
- GCC >= 5.1
- CMake >= 3.14
- When you pip install torch, it comes with its own CUDA binaries. Get the same or higher CUDA toolkit version as your PyTorch installation.
They are compiled against PyTorch's CUDA binaries to ensure compatibility and standalone functionality without requiring the full CUDA toolkit. But ofc this depends on the exact PyTorch version and CUDA version you have. Very likely will work with Torch>=2.10 and CUDA>=12.8
uv add torch-nvidia-of-sdk
# or
uv add torch-nvidia-of-sdk[full] # To have headless opencv for visualization examplesIf you still use pip
pip install torch-nvidia-of-sdk
# or
pip install torch-nvidia-of-sdk[full] # To have headless opencv forThis repository uses uv. A oneshot comand to build, install and test the package would be:
rm -rf build _skbuild .venv && CC=gcc CXX=g++ uv sync --extra full --reinstall-package torch-nvidia-of-sdk && uv run --extra full examples/minimal_example.py--reinstall-package forces uv to re-compile the package. Clearing caches is not really needed but I'm paranoid.
--extra full is analogous to pip extras pip install torch-nvidia-of-sdk[full]. It just adds headless opencv for visualization
CC=gcc CXX=g++ uv build --wheel --package torch-nvidia-of-sdk will build a wheel in dist/ that you can install with pip.
Try the minimal example to get started quickly:
# Run the minimal example (uses sample frames from assets/)
uv run --extra full examples/minimal_example.pyThis will:
- Load two sample frames from the
assets/directory - Compute optical flow using NVOF
- Generate visualizations and save results to
output/
See examples/README.md for more examples and tutorials.
import torch
import numpy as np
from of import TorchNVOpticalFlow
from of.io import read_flo, write_flo
from of.visualization import flow_to_color
# Load your images (RGB format, uint8)
img1 = torch.from_numpy(np.array(...)).cuda() # Shape: (H, W, 3)
img2 = torch.from_numpy(np.array(...)).cuda()
# Initialize optical flow engine
flow_engine = TorchNVOpticalFlow(
width=img1.shape[1],
height=img1.shape[0],
gpu_id=0,
preset="medium", # "slow", "medium", or "fast"
grid_size=1, # 1, 2, or 4
)
# Compute optical flow
flow = flow_engine.compute_flow(img1, img2, upsample=True)
# Flow is a (H, W, 2) tensor where flow[..., 0] is x-displacement, flow[..., 1] is y-displacement
print(f"Flow shape: {flow.shape}")
# Visualize flow as RGB image
flow_rgb = flow_to_color(flow.cpu().numpy())
# Save flow to .flo file
write_flo("output_flow.flo", flow)Since v5.0.3, compute_flow and compute_flow_bidirectional are fully asynchronous and stream-ordered
(except for upsample=True, see below). The binding binds the Optical Flow engine's input/output streams to
the caller's current PyTorch CUDA stream (via nvOFSetIOCudaStreams) before every execute:
- The OFA hardware engine waits for all prior work on the current stream (e.g. the input copies, or any torch ops producing the input tensors) before reading its inputs.
- OFA completion is signaled back onto the same stream, so the output copy — and any torch op you run afterwards on that stream — is correctly ordered behind it.
- The Python call returns immediately; no host synchronization happens anymore. The returned tensor's contents materialize when the stream reaches them, exactly like any regular asynchronous torch CUDA op.
The purpose: the OFA is a dedicated hardware engine (like NVENC/NVDEC) that runs physically in parallel with the SMs. With stream-ordered semantics you can compute flow on a side stream while your main stream runs compute kernels, hiding the flow latency entirely:
import torch
from of import TorchNVOpticalFlow
engine = TorchNVOpticalFlow(width=W, height=H, gpu_id=0, preset="fast", grid_size=4)
side = torch.cuda.Stream()
flows_ready = torch.cuda.Event()
# Submit flow computation on a side stream (returns immediately; OFA runs in parallel)
with torch.cuda.stream(side):
flow = engine.compute_flow(ref_rgba, alt_rgba, upsample=False)
flow.record_stream(torch.cuda.current_stream()) # allocator safety across streams
flows_ready.record(side)
heavy_compute(x) # main-stream SM work, overlaps with the OFA
# Order the main stream behind the flow result before consuming it (GPU-side wait, no host stall)
torch.cuda.current_stream().wait_event(flows_ready)
use(flow)Notes:
- Plain single-stream usage needs no code changes: everything stays on your current (default) stream and is ordered as before — just without the host stalls (2 per call in <= 5.0.2).
- Synchronizing consumers (
.cpu(),.numpy(),print(flow)) behave as with any async torch op: they sync implicitly. Only non-torch consumers (raw pointer access, custom CUDA code on other streams) need the event/wait_eventdiscipline shown above. - Exception —
upsample=True: the SDK's upsampling kernel is launched by NVIDIA's utils on the default stream with no stream argument, so this path still host-synchronizes around the upsample for correctness. Preferupsample=Falsein latency-sensitive pipelines and upsample the flow yourself in torch if needed. - One engine instance is not thread-safe (it owns a single set of SDK input/output buffers); calling it sequentially from one thread, on whichever stream, is fine.
TorchNVOpticalFlow(
width: int,
height: int,
gpu_id: int = 0,
preset: str = "medium",
grid_size: int = 1,
bidirectional: bool = False
)Parameters:
width: Width of input images in pixelsheight: Height of input images in pixelsgpu_id: CUDA device ID (default: 0)preset: Speed/quality preset. Options:"slow": Highest quality, slowest"medium": Balanced (recommended)"fast": Fastest, lower quality
grid_size: Output grid size. Options: 1, 2, or 4- 1: Full resolution output (default)
- 2/4: Downsampled output (faster, use with
upsample=Trueto restore resolution)
bidirectional: Enable bidirectional flow computation (forward and backward)
Compute forward optical flow between two frames.
Asynchronous and ordered on the caller's current CUDA stream since v5.0.3 (the upsample=True path
host-synchronizes; see Asynchronous Execution & CUDA Streams).
Parameters:
input: First frame as CUDA tensor of shape(H, W, 4), dtypeuint8, RGBA formatreference: Second frame as CUDA tensor of shape(H, W, 4), dtypeuint8, RGBA formatupsample: If True and grid_size > 1, upsample flow to full resolution (default: True)disable_temporal_hints: If True, the OFA does not seed this call's search from the previous call's result (v5.0.4). Set it when successive calls are not consecutive frames of one video — e.g. alternating(ref, alt_k)pairs — where the stale seeding degrades the vectors (default: False)
Returns:
torch.Tensor: Optical flow of shape(H, W, 2), dtypefloat32flow[..., 0]: Horizontal displacement (x)flow[..., 1]: Vertical displacement (y)
Example:
flow = flow_engine.compute_flow(img1_rgba, img2_rgba, upsample=True)Compute both forward and backward optical flow.
Parameters:
input: First frame as CUDA tensor of shape(H, W, 4), dtypeuint8, RGBA formatreference: Second frame as CUDA tensor of shape(H, W, 4), dtypeuint8, RGBA formatupsample: If True and grid_size > 1, upsample flows to full resolution (default: True)disable_temporal_hints: Seecompute_flow(default: False)
Returns:
Tuple[torch.Tensor, torch.Tensor]: Forward and backward flows, each of shape(H, W, 2)
Example:
forward_flow, backward_flow = flow_engine.compute_flow_bidirectional(
img1_rgba, img2_rgba, upsample=True
)Get the output shape for the current configuration.
Returns:
List[int]: Output shape as[height, width, 2]
Read optical flow from .flo file (Middlebury format).
Parameters:
filepath: Path to.flofile (str or Path)
Returns:
np.ndarray: Flow array of shape(H, W, 2), dtypefloat32
Write optical flow to .flo file (Middlebury format).
Parameters:
filepath: Output file path (str or Path)flow: Flow array of shape(H, W, 2)(numpy array or torch tensor)
This repository includes several examples in the examples/ directory:
See examples/README.md for detailed documentation and usage instructions.
disable_temporal_hintsargument oncompute_flow/compute_flow_bidirectional(defaultFalse, preserving previous behavior). The OFA seeds each call's search from the previous call's result; when successive calls are not consecutive frames of one video (e.g. alternating(ref, alt_k)pairs across sliding batches), that seeding degrades the vectors — identical inputs can return nonzero, call-order-dependent flow. Set it toTruefor non-consecutive pairs, per NVIDIA's recommendation (maps toNV_OF_EXECUTE_INPUT_PARAMS::disableTemporalHints).
- Stream-ordered asynchronous execution.
compute_flow/compute_flow_bidirectionalnow bind the OF engine's input/output streams to the caller's current PyTorch CUDA stream (nvOFSetIOCudaStreams) before each execute, and the two hostcudaStreamSynchronizecalls per invocation were removed. Calls return immediately and results materialize on the stream. Purpose: remove host stalls from real-time pipelines and allow overlapping OFA work with SM compute via side streams (see Asynchronous Execution & CUDA Streams). upsample=Truestill host-synchronizes around the SDK's default-stream upsampling kernel.- Added
SetIOCudaStreamsto theNvOFCuda/NvOFCudaAPISDK wrapper classes.
- Building against PyTorch's CUDA binaries to ensure compatibility and standalone functionality without requiring the full CUDA toolkit.
- CSEM Dataset now checks that the expected
imagesdirectory exists.