Skip to content

Networking error in Docker due to host IP detection (workaround: set VLLM_HOST_IP) #743

@insop

Description

@insop

🐛 Describe the bug

Description

In a Docker environment, running the command below triggers networking errors (IPv6 address chosen, IPv4 expected).

Command

uv run python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml

Workaround

Based on the logic in monarch_executor.py, the host IP can be overridden via an environment variable:

if host_ip := os.environ.get("VLLM_HOST_IP"):
    return host_ip

Setting the following resolves the issue in my environment:

export VLLM_HOST_IP=127.0.0.1

A more robust _get_host_ip() (e.g., preferring IPv4 or avoiding link-local IPv6 addresses in containers) could help. I'm happy to open a PR if that would be useful.

Error message

  (EngineCore_DP0 pid=8447) [2026-01-29 01:26:35] INFO monarch_executor.py:386: [actor=<root>]                 
  [MonarchExecutor] Head node: fe80::222:48ff:fe49:ba90:51391                                                  
  (EngineCore_DP0 pid=8447) [2026-01-29 01:26:35] INFO monarch_executor.py:393: [actor=<root>]                 
  [MonarchExecutor] Using allocated GPUs: ['1']                                                                
  WARNING 01-29 01:26:43 [worker_base.py:301]                                                                  
  [actor=<root>.<forge.actors.vllm.v1.forge_executor.ForgeWorkerWrapper vllm_workers{'procs': 0/1}>]           
  Missing `shared_worker_lock` argument from executor. This argument is needed for                             
  mm_processor_cache_type='shm'.                                                                               
  INFO 01-29 01:26:47 [parallel_state.py:1203]                                                                 
  [actor=<root>.<forge.actors.vllm.v1.forge_executor.ForgeWorkerWrapper vllm_workers{'procs': 0/1}>]           
  world_size=1 rank=0 local_rank=0 distributed_init_method=env:// backend=nccl                                 
  [W129 01:26:47.186735869 socket.cpp:767] [c10d] The client socket has failed to connect to                   
  [train16node-master]:51391 (errno: 22 - Invalid argument).                                                   
  [W129 01:26:47.186767930 socket.cpp:767] [c10d] The IPv4 network addresses of (fe80::222:48ff:fe49:ba90,     
  51391) cannot be retrieved (gai error: -9 - Address family for hostname not supported).                      
  [W129 01:26:47.876868742 socket.cpp:767] [c10d] The IPv4 network addresses of (fe80::222:48ff:fe49:ba90,     
  51391) cannot be retrieved (gai error: -9 - Address family for hostname not supported).                      
  [W129 01:26:48.613986049 socket.cpp:767] [c10d] The IPv4 network addresses of (fe80::222:48ff:fe49:ba90,     
  51391) cannot be retrieved (gai error: -9 - Address family for hostname not supported).                      
  [W129 01:26:49.280112312 socket.cpp:767] [c10d] The IPv4 network addresses of (fe80::222:48ff:fe49:ba90,     
  51391) cannot be retrieved (gai error: -9 - Address family for hostname not supported).                      
  [W129 01:26:51.628231771 socket.cpp:767] [c10d] The IPv4 network addresses of (fe80::222:48ff:fe49:ba90,     
  51391) cannot be retrieved (gai error: -9 - Address family for hostname not supported).                      
  [W129 01:26:54.085405747 socket.cpp:767] [c10d] The IPv4 network addresses of (fe80::222:48ff:fe49:ba90,     
  51391) cannot be retrieved (gai error: -9 - Address family for hostname not supported).                      
  [W129 01:26:59.063519507 socket.cpp:767] [c10d] The IPv4 network addresses of (fe80::222:48ff:fe49:ba90,     
  51391) cannot be retrieved (gai error: -9 - Address family for hostname not supported).                      
  [W129 01:27:07.065667397 socket.cpp:767] [c10d] The IPv4 network addresses of (fe80::222:48ff:fe49:ba90,     
  51391) cannot be retrieved (gai error: -9 - Address family for hostname not supported).  

Environment

git log -1
commit cd9e295c49b2a1a6e07eea2d77fa295613729638 (HEAD -> main, origin/main, origin/HEAD)
Author: Jiyue Wang <JenniferWang@users.noreply.github.com>
Date:   Wed Jan 28 16:40:10 2026 -0500

    [vllm] Upgrade vllm version to v0.13.0 (#737)

# Check core components
python -c "import torch, forge, monarch, vllm; print('All imports successful')"

# Check specific versions
python -c "
import torch
import forge
import vllm

print(f'PyTorch: {torch.__version__}')
print(f'TorchForge: {forge.__version__}')
print(f'vLLM: {vllm.__version__}')
print(f'CUDA: {torch.version.cuda}')
"
All imports successful
PyTorch: 2.9.0+cu128
TorchForge: 
vLLM: 0.13.0
CUDA: 12.8

Versions

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions