Skip to content

JuanFMontesinos/NvidiaOpticalFlowSDKforPytorch

Repository files navigation

Torch Optical Flow

PyTorch bindings for NVIDIA Optical Flow SDK, providing hardware-accelerated optical flow computation with PyTorch end-to-end integration in Nvidia and Python.

Please read more about the NVIDIA Optical Flow SDK here: https://developer.nvidia.com/optical-flow-sdk

What's this repo about?

  • Hardware-accelerated optical flow using a special processor in Nvidia GPUs. No gradients are computed, this is for inference only.
  • Frame interpolation and ROI, or other additional content in the SDK is not supported.
  • Configurable speed (slow, medium, fast) vs and grid size (1, 2, 4)
  • Support for various ABGR8 format, namely, RGB images.
  • End-to-end GPU processing with PyTorch.
  • Biderectional optical flow computation (forward and backward) in a single call. This is supported by the SDK, but not exposed in the wrappers they provide.

The package comes with basic functionality for optical flow:

  • .flo reader and writer
  • Optical Flow common metrics
  • Visualization utilities

Requirements

System Requirements

  • NVIDIA GPU with Optical Flow SDK support (Turing, Ampere or Ada)
  • Tested on linux (Ubuntu), the SDK is compatible with windows too. Read optical_flow_sdk>Read_Me.pdf for windows instructions.

Software Requirements

  • CUDA toolkit >=10.2
  • Linux drivers "nvidia-smi" >=528.85
  • GCC >= 5.1
  • CMake >= 3.14
  • When you pip install torch, it comes with its own CUDA binaries. Get the same or higher CUDA toolkit version as your PyTorch installation.

Installation

Precompiled binaries (experimental)

They are compiled against PyTorch's CUDA binaries to ensure compatibility and standalone functionality without requiring the full CUDA toolkit. But ofc this depends on the exact PyTorch version and CUDA version you have. Very likely will work with Torch>=2.10 and CUDA>=12.8

uv add torch-nvidia-of-sdk
# or
uv add torch-nvidia-of-sdk[full] # To have headless opencv for visualization examples

If you still use pip

pip install torch-nvidia-of-sdk
# or
pip install torch-nvidia-of-sdk[full] # To have headless opencv for

Build from Source (Recommended if precompiled binaries do not work)

This repository uses uv. A oneshot comand to build, install and test the package would be:

rm -rf build _skbuild .venv  && CC=gcc CXX=g++ uv sync --extra full --reinstall-package torch-nvidia-of-sdk && uv run --extra full examples/minimal_example.py

--reinstall-package forces uv to re-compile the package. Clearing caches is not really needed but I'm paranoid. --extra full is analogous to pip extras pip install torch-nvidia-of-sdk[full]. It just adds headless opencv for visualization

Compiling your own wheel

CC=gcc CXX=g++ uv build --wheel --package torch-nvidia-of-sdk will build a wheel in dist/ that you can install with pip.

Quick Start

Try the minimal example to get started quickly:

# Run the minimal example (uses sample frames from assets/)
uv run  --extra full examples/minimal_example.py

This will:

  1. Load two sample frames from the assets/ directory
  2. Compute optical flow using NVOF
  3. Generate visualizations and save results to output/

See examples/README.md for more examples and tutorials.

Basic Usage

import torch
import numpy as np
from of import TorchNVOpticalFlow
from of.io import read_flo, write_flo
from of.visualization import flow_to_color

# Load your images (RGB format, uint8)
img1 = torch.from_numpy(np.array(...)).cuda()  # Shape: (H, W, 3)
img2 = torch.from_numpy(np.array(...)).cuda()

# Initialize optical flow engine
flow_engine = TorchNVOpticalFlow(
    width=img1.shape[1],
    height=img1.shape[0],
    gpu_id=0,
    preset="medium",  # "slow", "medium", or "fast"
    grid_size=1,      # 1, 2, or 4
)

# Compute optical flow
flow = flow_engine.compute_flow(img1, img2, upsample=True)

# Flow is a (H, W, 2) tensor where flow[..., 0] is x-displacement, flow[..., 1] is y-displacement
print(f"Flow shape: {flow.shape}")

# Visualize flow as RGB image
flow_rgb = flow_to_color(flow.cpu().numpy())

# Save flow to .flo file
write_flo("output_flow.flo", flow)

Asynchronous Execution & CUDA Streams (v5.0.3)

Since v5.0.3, compute_flow and compute_flow_bidirectional are fully asynchronous and stream-ordered (except for upsample=True, see below). The binding binds the Optical Flow engine's input/output streams to the caller's current PyTorch CUDA stream (via nvOFSetIOCudaStreams) before every execute:

  • The OFA hardware engine waits for all prior work on the current stream (e.g. the input copies, or any torch ops producing the input tensors) before reading its inputs.
  • OFA completion is signaled back onto the same stream, so the output copy — and any torch op you run afterwards on that stream — is correctly ordered behind it.
  • The Python call returns immediately; no host synchronization happens anymore. The returned tensor's contents materialize when the stream reaches them, exactly like any regular asynchronous torch CUDA op.

The purpose: the OFA is a dedicated hardware engine (like NVENC/NVDEC) that runs physically in parallel with the SMs. With stream-ordered semantics you can compute flow on a side stream while your main stream runs compute kernels, hiding the flow latency entirely:

import torch
from of import TorchNVOpticalFlow

engine = TorchNVOpticalFlow(width=W, height=H, gpu_id=0, preset="fast", grid_size=4)

side = torch.cuda.Stream()
flows_ready = torch.cuda.Event()

# Submit flow computation on a side stream (returns immediately; OFA runs in parallel)
with torch.cuda.stream(side):
    flow = engine.compute_flow(ref_rgba, alt_rgba, upsample=False)
    flow.record_stream(torch.cuda.current_stream())  # allocator safety across streams
flows_ready.record(side)

heavy_compute(x)  # main-stream SM work, overlaps with the OFA

# Order the main stream behind the flow result before consuming it (GPU-side wait, no host stall)
torch.cuda.current_stream().wait_event(flows_ready)
use(flow)

Notes:

  • Plain single-stream usage needs no code changes: everything stays on your current (default) stream and is ordered as before — just without the host stalls (2 per call in <= 5.0.2).
  • Synchronizing consumers (.cpu(), .numpy(), print(flow)) behave as with any async torch op: they sync implicitly. Only non-torch consumers (raw pointer access, custom CUDA code on other streams) need the event/wait_event discipline shown above.
  • Exception — upsample=True: the SDK's upsampling kernel is launched by NVIDIA's utils on the default stream with no stream argument, so this path still host-synchronizes around the upsample for correctness. Prefer upsample=False in latency-sensitive pipelines and upsample the flow yourself in torch if needed.
  • One engine instance is not thread-safe (it owns a single set of SDK input/output buffers); calling it sequentially from one thread, on whichever stream, is fine.

API Reference

Core Class: TorchNVOpticalFlow

Constructor

TorchNVOpticalFlow(
    width: int,
    height: int,
    gpu_id: int = 0,
    preset: str = "medium",
    grid_size: int = 1,
    bidirectional: bool = False
)

Parameters:

  • width: Width of input images in pixels
  • height: Height of input images in pixels
  • gpu_id: CUDA device ID (default: 0)
  • preset: Speed/quality preset. Options:
    • "slow": Highest quality, slowest
    • "medium": Balanced (recommended)
    • "fast": Fastest, lower quality
  • grid_size: Output grid size. Options: 1, 2, or 4
    • 1: Full resolution output (default)
    • 2/4: Downsampled output (faster, use with upsample=True to restore resolution)
  • bidirectional: Enable bidirectional flow computation (forward and backward)

Methods

compute_flow(input, reference, upsample=True, disable_temporal_hints=False)

Compute forward optical flow between two frames.

Asynchronous and ordered on the caller's current CUDA stream since v5.0.3 (the upsample=True path host-synchronizes; see Asynchronous Execution & CUDA Streams).

Parameters:

  • input: First frame as CUDA tensor of shape (H, W, 4), dtype uint8, RGBA format
  • reference: Second frame as CUDA tensor of shape (H, W, 4), dtype uint8, RGBA format
  • upsample: If True and grid_size > 1, upsample flow to full resolution (default: True)
  • disable_temporal_hints: If True, the OFA does not seed this call's search from the previous call's result (v5.0.4). Set it when successive calls are not consecutive frames of one video — e.g. alternating (ref, alt_k) pairs — where the stale seeding degrades the vectors (default: False)

Returns:

  • torch.Tensor: Optical flow of shape (H, W, 2), dtype float32
    • flow[..., 0]: Horizontal displacement (x)
    • flow[..., 1]: Vertical displacement (y)

Example:

flow = flow_engine.compute_flow(img1_rgba, img2_rgba, upsample=True)

compute_flow_bidirectional(input, reference, upsample=True, disable_temporal_hints=False)

Compute both forward and backward optical flow.

Parameters:

  • input: First frame as CUDA tensor of shape (H, W, 4), dtype uint8, RGBA format
  • reference: Second frame as CUDA tensor of shape (H, W, 4), dtype uint8, RGBA format
  • upsample: If True and grid_size > 1, upsample flows to full resolution (default: True)
  • disable_temporal_hints: See compute_flow (default: False)

Returns:

  • Tuple[torch.Tensor, torch.Tensor]: Forward and backward flows, each of shape (H, W, 2)

Example:

forward_flow, backward_flow = flow_engine.compute_flow_bidirectional(
    img1_rgba, img2_rgba, upsample=True
)

output_shape()

Get the output shape for the current configuration.

Returns:

  • List[int]: Output shape as [height, width, 2]

I/O Utilities (of.io)

read_flo(filepath)

Read optical flow from .flo file (Middlebury format).

Parameters:

  • filepath: Path to .flo file (str or Path)

Returns:

  • np.ndarray: Flow array of shape (H, W, 2), dtype float32

write_flo(filepath, flow)

Write optical flow to .flo file (Middlebury format).

Parameters:

  • filepath: Output file path (str or Path)
  • flow: Flow array of shape (H, W, 2) (numpy array or torch tensor)

Examples

This repository includes several examples in the examples/ directory:

See examples/README.md for detailed documentation and usage instructions.

Changelog

5.0.4 (2026-06-13)

  • disable_temporal_hints argument on compute_flow / compute_flow_bidirectional (default False, preserving previous behavior). The OFA seeds each call's search from the previous call's result; when successive calls are not consecutive frames of one video (e.g. alternating (ref, alt_k) pairs across sliding batches), that seeding degrades the vectors — identical inputs can return nonzero, call-order-dependent flow. Set it to True for non-consecutive pairs, per NVIDIA's recommendation (maps to NV_OF_EXECUTE_INPUT_PARAMS::disableTemporalHints).

5.0.3 (2026-06-12)

  • Stream-ordered asynchronous execution. compute_flow / compute_flow_bidirectional now bind the OF engine's input/output streams to the caller's current PyTorch CUDA stream (nvOFSetIOCudaStreams) before each execute, and the two host cudaStreamSynchronize calls per invocation were removed. Calls return immediately and results materialize on the stream. Purpose: remove host stalls from real-time pipelines and allow overlapping OFA work with SM compute via side streams (see Asynchronous Execution & CUDA Streams).
  • upsample=True still host-synchronizes around the SDK's default-stream upsampling kernel.
  • Added SetIOCudaStreams to the NvOFCuda / NvOFCudaAPI SDK wrapper classes.

5.0.2 (2026-02-05)

  • Building against PyTorch's CUDA binaries to ensure compatibility and standalone functionality without requiring the full CUDA toolkit.
  • CSEM Dataset now checks that the expected images directory exists.

About

This repository provides a PyTorch binding for Nvidia's Optical Flow SDK. It permites using Nvidia's Optical Flow Chip directly in Torch Tensors. End-to-end GPU without host copies.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors