locateanything-coreml

NVIDIA's LocateAnything-3B running natively on Apple Silicon — pure CoreML, no PyTorch at runtime.

Open-vocabulary object detection, visual grounding, and pointing on your Mac. Built for dataset pre-annotation, local vision tooling, and labeling workflows that need strong grounding without a PyTorch runtime.

Source code lives in this GitHub repo. Model files live on Hugging Face.

Highlights

Native Mac runtime: CoreML packages plus numpy; no PyTorch dependency at inference time.
Flexible prompts: locate free-text categories and return boxes or points in original-image pixel coordinates.
Useful interfaces: CLI, Python API, localhost REST server, and MCP stdio tool share the same deterministic pipeline.
Annotation-ready exports: write JSON, COCO, YOLO, and Label Studio outputs for fast review in existing labeling tools.
Follow along: star the repo to track benchmark updates, quantization work, and native Mac examples.

Quickstart

pip install locateanything-coreml
locateanything photo.jpg --categories person,car --temperature 0

Or run it without installing into an environment — the model cache is shared either way:

uvx --from locateanything-coreml locateanything photo.jpg --categories person,car
pipx install locateanything-coreml   # isolated install, `locateanything` on PATH

First run downloads the model snapshot (~7.2 GB) from devin-lai/LocateAnything-3B-CoreML and caches it. Inference is deterministic by default (temperature=0); set --temperature 0.7 when you want sampled output.

Python API

from locateanything_coreml import LocateAnything

model = LocateAnything.from_pretrained()
detections = model.detect("photo.jpg", ["person", "car"])

Boxes and points are returned in original-image pixel coordinates.

Performance

Measured on an Apple M5 / 32 GB (1536×1024 input, categories person,car), comparing post-load inference against the PyTorch MPS bf16 reference:

Metric	CoreML Optimized	PyTorch MPS bf16	Improvement
Post-load inference time	11.7 s	12.7 s	~1.1x faster
Generation time	7.64 s	12.56 s	~1.6x faster
Prefill time	1.72 s	7.97 s	~4.6x faster
Tokens per second	17.55 TPS	10.35 TPS	~1.7x higher throughput

Model load on first call is a one-time ~35 s; packages stay warm afterwards (see the server and MCP modes for paying it once per session).

Full benchmark notes live on the Hugging Face model card.

Accuracy

Verified against the PyTorch (MPS, bf16) reference:

mean IoU 0.988 (6/6 boxes) and 0.963 greedy (20/20 boxes) on the validation images;
against an fp32 ground-truth run, total box-coordinate delta of the CoreML pipeline (11 px) is smaller than the bf16 reference's own delta (12 px) — i.e. conversion noise is below the reference's own precision noise.

Why this instead of GroundingDINO / OWLv2 / YOLO-World?

All four do open-vocabulary detection. The difference is what they need at runtime and what they give back:

	locateanything-coreml	GroundingDINO / OWLv2	YOLO-World
Runtime stack on a Mac	CoreML + numpy, no PyTorch	PyTorch (MPS)	PyTorch / ONNX
Open-vocabulary boxes	✅	✅	✅ (prompt-tuned)
Pointing + visual grounding	✅ (VLM-based)	❌	❌
Counting via language	✅	❌	❌
Real-time video rates	❌ (~12 s/image)	❌ on Mac	✅
MCP / REST / annotation exports built in	✅	❌	❌

Pick YOLO-World when you need real-time speed. Pick this when you want grounding quality from a 3B VLM on a Mac without dragging in a PyTorch stack — e.g. pre-annotating datasets, adding local MCP vision, or batch-checking images with free-text categories.

How it works

Fixed-canvas vision: every input is stretched to a 1036×1036 canvas (74×74 patch grid), so one converted vision package covers any source resolution; box decoding in 0–1000 normalized coordinates undoes the stretch for free.
Stateful decoder: the Qwen2 36-layer decoder is converted with CoreML StateType KV-cache state; attention masks are passed as inputs, enabling the reference's hybrid MTP (parallel box decoding) / AR fallback generation outside the graph.
fp16 with fp32 islands: packages are fp16, with RMSNorm / softmax / lm_head pinned to fp32 to match reference numerics.
Pure-numpy generation: masks, top-p sampling, repetition penalty, and parallel box decoding are verbatim numpy ports of the reference inference code.

CLI reference

locateanything INPUT --categories CATS [-o OUT.png] [--out-json OUT.json]
               [--format {json,coco,yolo,labelstudio}]
               [--models-dir DIR] [--repo-id REPO] [--revision REV]
               [--compute-units {cpu_and_gpu,cpu_only,all}]
               [--generation-mode {fast,slow,hybrid}]
               [--temperature T] [--top-p P] [--repetition-penalty R]
               [--seed N] [--max-new-tokens N]
locateanything serve [--port 8765] [--models-dir DIR] [--repo-id REPO] [--revision REV]
locateanything mcp [--models-dir DIR] [--repo-id REPO] [--revision REV]

--temperature 0 gives deterministic greedy decoding. --revision pins a Hugging Face branch, tag, or commit.

Use as an MCP tool

pip install 'locateanything-coreml[mcp]'
locateanything mcp

Client configuration:

{
  "mcpServers": {
    "locateanything": {"command": "locateanything", "args": ["mcp"]}
  }
}

Or zero-install via uvx:

{
  "mcpServers": {
    "locateanything": {
      "command": "uvx",
      "args": ["--from", "locateanything-coreml[mcp]", "locateanything", "mcp"]
    }
  }
}

One tool is exposed — detect_objects(image_path, categories, temperature=0, save_annotated=false) — returning a structured result with detections in pixel coordinates, and optionally writing the annotated image so the client can look at the result. The first call loads the models (~35 s); they stay warm for the session.

Run as a local server

Pay the ~35 s model load once, then detect in seconds per request:

locateanything serve --port 8765

curl -s localhost:8765/detect \
  -d '{"image_path": "/abs/path/photo.jpg", "categories": ["person", "car"]}'

GET /openapi.json exposes the local API schema. image_b64 (base64 image bytes) works instead of image_path; "format" selects json (default) / coco / yolo / labelstudio; generation parameters (temperature, seed, ...) match the CLI. Binds 127.0.0.1 only — a local single-user tool, not a production server.

Export for annotation tools

locateanything photo.jpg --categories person,car --format coco
locateanything photo.jpg --categories person,car --format yolo
locateanything photo.jpg --categories person,car --format labelstudio

Label Studio output is a ready-to-import prediction (boxes and points); COCO and YOLO are boxes-only (points are skipped with a warning). YOLO writes <input>.detections.txt; the others write <input>.detections.json. Use it to pre-annotate datasets, then correct by hand.

More integration notes for MCP clients, REST callers, Label Studio, CVAT, and Roboflow are in docs/INTEGRATIONS.md.

Requirements

macOS on Apple Silicon (CoreML execution)
Python ≥ 3.10
~8 GB disk for the model snapshot; 32 GB RAM recommended

Project layout

src/locateanything_coreml/: Python package, CLI, REST server, MCP server, preprocessing, generation, and export adapters.
examples/: demo image and small client snippets.
docs/: release and integration notes.
Model weights are intentionally excluded from GitHub and downloaded from Hugging Face on first use.

Roadmap

Video inference
Standardized detector interface for MCP clients and annotation pipelines — MCP server, local REST server, COCO / YOLO / Label Studio exports (v0.2)
Public OpenAPI schema for localhost integrations
Quantized (int8/int4) decoder variants
Batch dataset inference
Swift / native example app

License

Code is open source under MIT. The CoreML model weights are not MIT licensed; they are a derivative of NVIDIA's LocateAnything-3B under the NVIDIA License for research and evaluation use only. See NOTICE.

Citation

@article{locateanything2026,
  title={LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding},
  journal={arXiv preprint arXiv:2605.27365},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github		.github
docs		docs
examples		examples
src/locateanything_coreml		src/locateanything_coreml
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

locateanything-coreml

Highlights

Quickstart

Python API

Performance

Accuracy

Why this instead of GroundingDINO / OWLv2 / YOLO-World?

How it works

CLI reference

Use as an MCP tool

Run as a local server

Export for annotation tools

Requirements

Project layout

Roadmap

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

locateanything-coreml

Highlights

Quickstart

Python API

Performance

Accuracy

Why this instead of GroundingDINO / OWLv2 / YOLO-World?

How it works

CLI reference

Use as an MCP tool

Run as a local server

Export for annotation tools

Requirements

Project layout

Roadmap

License

Citation

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages