Skip to content

devin-lai/locateanything-coreml

locateanything-coreml

CI License: MIT Python ≥3.10 Platform Model

NVIDIA's LocateAnything-3B running natively on Apple Silicon — pure CoreML, no PyTorch at runtime.

Open-vocabulary object detection, visual grounding, and pointing on your Mac. Built for dataset pre-annotation, local vision tooling, and labeling workflows that need strong grounding without a PyTorch runtime.

Source code lives in this GitHub repo. Model files live on Hugging Face.

demo

Highlights

  • Native Mac runtime: CoreML packages plus numpy; no PyTorch dependency at inference time.
  • Flexible prompts: locate free-text categories and return boxes or points in original-image pixel coordinates.
  • Useful interfaces: CLI, Python API, localhost REST server, and MCP stdio tool share the same deterministic pipeline.
  • Annotation-ready exports: write JSON, COCO, YOLO, and Label Studio outputs for fast review in existing labeling tools.
  • Follow along: star the repo to track benchmark updates, quantization work, and native Mac examples.

Quickstart

pip install locateanything-coreml
locateanything photo.jpg --categories person,car --temperature 0

Or run it without installing into an environment — the model cache is shared either way:

uvx --from locateanything-coreml locateanything photo.jpg --categories person,car
pipx install locateanything-coreml   # isolated install, `locateanything` on PATH

First run downloads the model snapshot (~7.2 GB) from devin-lai/LocateAnything-3B-CoreML and caches it. Inference is deterministic by default (temperature=0); set --temperature 0.7 when you want sampled output.

Python API

from locateanything_coreml import LocateAnything

model = LocateAnything.from_pretrained()
detections = model.detect("photo.jpg", ["person", "car"])

Boxes and points are returned in original-image pixel coordinates.

Performance

Measured on an Apple M5 / 32 GB (1536×1024 input, categories person,car), comparing post-load inference against the PyTorch MPS bf16 reference:

Metric CoreML Optimized PyTorch MPS bf16 Improvement
Post-load inference time 11.7 s 12.7 s ~1.1x faster
Generation time 7.64 s 12.56 s ~1.6x faster
Prefill time 1.72 s 7.97 s ~4.6x faster
Tokens per second 17.55 TPS 10.35 TPS ~1.7x higher throughput

Model load on first call is a one-time ~35 s; packages stay warm afterwards (see the server and MCP modes for paying it once per session).

Full benchmark notes live on the Hugging Face model card.

Accuracy

Verified against the PyTorch (MPS, bf16) reference:

  • mean IoU 0.988 (6/6 boxes) and 0.963 greedy (20/20 boxes) on the validation images;
  • against an fp32 ground-truth run, total box-coordinate delta of the CoreML pipeline (11 px) is smaller than the bf16 reference's own delta (12 px) — i.e. conversion noise is below the reference's own precision noise.

Why this instead of GroundingDINO / OWLv2 / YOLO-World?

All four do open-vocabulary detection. The difference is what they need at runtime and what they give back:

locateanything-coreml GroundingDINO / OWLv2 YOLO-World
Runtime stack on a Mac CoreML + numpy, no PyTorch PyTorch (MPS) PyTorch / ONNX
Open-vocabulary boxes ✅ (prompt-tuned)
Pointing + visual grounding ✅ (VLM-based)
Counting via language
Real-time video rates ❌ (~12 s/image) ❌ on Mac
MCP / REST / annotation exports built in

Pick YOLO-World when you need real-time speed. Pick this when you want grounding quality from a 3B VLM on a Mac without dragging in a PyTorch stack — e.g. pre-annotating datasets, adding local MCP vision, or batch-checking images with free-text categories.

How it works

  • Fixed-canvas vision: every input is stretched to a 1036×1036 canvas (74×74 patch grid), so one converted vision package covers any source resolution; box decoding in 0–1000 normalized coordinates undoes the stretch for free.
  • Stateful decoder: the Qwen2 36-layer decoder is converted with CoreML StateType KV-cache state; attention masks are passed as inputs, enabling the reference's hybrid MTP (parallel box decoding) / AR fallback generation outside the graph.
  • fp16 with fp32 islands: packages are fp16, with RMSNorm / softmax / lm_head pinned to fp32 to match reference numerics.
  • Pure-numpy generation: masks, top-p sampling, repetition penalty, and parallel box decoding are verbatim numpy ports of the reference inference code.

CLI reference

locateanything INPUT --categories CATS [-o OUT.png] [--out-json OUT.json]
               [--format {json,coco,yolo,labelstudio}]
               [--models-dir DIR] [--repo-id REPO] [--revision REV]
               [--compute-units {cpu_and_gpu,cpu_only,all}]
               [--generation-mode {fast,slow,hybrid}]
               [--temperature T] [--top-p P] [--repetition-penalty R]
               [--seed N] [--max-new-tokens N]
locateanything serve [--port 8765] [--models-dir DIR] [--repo-id REPO] [--revision REV]
locateanything mcp [--models-dir DIR] [--repo-id REPO] [--revision REV]

--temperature 0 gives deterministic greedy decoding. --revision pins a Hugging Face branch, tag, or commit.

Use as an MCP tool

pip install 'locateanything-coreml[mcp]'
locateanything mcp

Client configuration:

{
  "mcpServers": {
    "locateanything": {"command": "locateanything", "args": ["mcp"]}
  }
}

Or zero-install via uvx:

{
  "mcpServers": {
    "locateanything": {
      "command": "uvx",
      "args": ["--from", "locateanything-coreml[mcp]", "locateanything", "mcp"]
    }
  }
}

One tool is exposed — detect_objects(image_path, categories, temperature=0, save_annotated=false) — returning a structured result with detections in pixel coordinates, and optionally writing the annotated image so the client can look at the result. The first call loads the models (~35 s); they stay warm for the session.

Run as a local server

Pay the ~35 s model load once, then detect in seconds per request:

locateanything serve --port 8765
curl -s localhost:8765/detect \
  -d '{"image_path": "/abs/path/photo.jpg", "categories": ["person", "car"]}'

GET /openapi.json exposes the local API schema. image_b64 (base64 image bytes) works instead of image_path; "format" selects json (default) / coco / yolo / labelstudio; generation parameters (temperature, seed, ...) match the CLI. Binds 127.0.0.1 only — a local single-user tool, not a production server.

Export for annotation tools

locateanything photo.jpg --categories person,car --format coco
locateanything photo.jpg --categories person,car --format yolo
locateanything photo.jpg --categories person,car --format labelstudio

Label Studio output is a ready-to-import prediction (boxes and points); COCO and YOLO are boxes-only (points are skipped with a warning). YOLO writes <input>.detections.txt; the others write <input>.detections.json. Use it to pre-annotate datasets, then correct by hand.

More integration notes for MCP clients, REST callers, Label Studio, CVAT, and Roboflow are in docs/INTEGRATIONS.md.

Requirements

  • macOS on Apple Silicon (CoreML execution)
  • Python ≥ 3.10
  • ~8 GB disk for the model snapshot; 32 GB RAM recommended

Project layout

  • src/locateanything_coreml/: Python package, CLI, REST server, MCP server, preprocessing, generation, and export adapters.
  • examples/: demo image and small client snippets.
  • docs/: release and integration notes.
  • Model weights are intentionally excluded from GitHub and downloaded from Hugging Face on first use.

Roadmap

  • Video inference
  • Standardized detector interface for MCP clients and annotation pipelines — MCP server, local REST server, COCO / YOLO / Label Studio exports (v0.2)
  • Public OpenAPI schema for localhost integrations
  • Quantized (int8/int4) decoder variants
  • Batch dataset inference
  • Swift / native example app

License

Code is open source under MIT. The CoreML model weights are not MIT licensed; they are a derivative of NVIDIA's LocateAnything-3B under the NVIDIA License for research and evaluation use only. See NOTICE.

Citation

@article{locateanything2026,
  title={LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding},
  journal={arXiv preprint arXiv:2605.27365},
  year={2026}
}

About

Open-vocabulary object localization on Apple Silicon with CoreML, no PyTorch runtime.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages