NVIDIA's LocateAnything-3B running natively on Apple Silicon — pure CoreML, no PyTorch at runtime.
Open-vocabulary object detection, visual grounding, and pointing on your Mac. Built for dataset pre-annotation, local vision tooling, and labeling workflows that need strong grounding without a PyTorch runtime.
Source code lives in this GitHub repo. Model files live on Hugging Face.
- Native Mac runtime: CoreML packages plus numpy; no PyTorch dependency at inference time.
- Flexible prompts: locate free-text categories and return boxes or points in original-image pixel coordinates.
- Useful interfaces: CLI, Python API, localhost REST server, and MCP stdio tool share the same deterministic pipeline.
- Annotation-ready exports: write JSON, COCO, YOLO, and Label Studio outputs for fast review in existing labeling tools.
- Follow along: star the repo to track benchmark updates, quantization work, and native Mac examples.
pip install locateanything-coreml
locateanything photo.jpg --categories person,car --temperature 0Or run it without installing into an environment — the model cache is shared either way:
uvx --from locateanything-coreml locateanything photo.jpg --categories person,car
pipx install locateanything-coreml # isolated install, `locateanything` on PATHFirst run downloads the model snapshot (~7.2 GB) from
devin-lai/LocateAnything-3B-CoreML
and caches it. Inference is deterministic by default (temperature=0);
set --temperature 0.7 when you want sampled output.
from locateanything_coreml import LocateAnything
model = LocateAnything.from_pretrained()
detections = model.detect("photo.jpg", ["person", "car"])Boxes and points are returned in original-image pixel coordinates.
Measured on an Apple M5 / 32 GB (1536×1024 input, categories person,car),
comparing post-load inference against the PyTorch MPS bf16 reference:
| Metric | CoreML Optimized | PyTorch MPS bf16 | Improvement |
|---|---|---|---|
| Post-load inference time | 11.7 s | 12.7 s | ~1.1x faster |
| Generation time | 7.64 s | 12.56 s | ~1.6x faster |
| Prefill time | 1.72 s | 7.97 s | ~4.6x faster |
| Tokens per second | 17.55 TPS | 10.35 TPS | ~1.7x higher throughput |
Model load on first call is a one-time ~35 s; packages stay warm afterwards (see the server and MCP modes for paying it once per session).
Full benchmark notes live on the Hugging Face model card.
Verified against the PyTorch (MPS, bf16) reference:
- mean IoU 0.988 (6/6 boxes) and 0.963 greedy (20/20 boxes) on the validation images;
- against an fp32 ground-truth run, total box-coordinate delta of the CoreML pipeline (11 px) is smaller than the bf16 reference's own delta (12 px) — i.e. conversion noise is below the reference's own precision noise.
All four do open-vocabulary detection. The difference is what they need at runtime and what they give back:
| locateanything-coreml | GroundingDINO / OWLv2 | YOLO-World | |
|---|---|---|---|
| Runtime stack on a Mac | CoreML + numpy, no PyTorch | PyTorch (MPS) | PyTorch / ONNX |
| Open-vocabulary boxes | ✅ | ✅ | ✅ (prompt-tuned) |
| Pointing + visual grounding | ✅ (VLM-based) | ❌ | ❌ |
| Counting via language | ✅ | ❌ | ❌ |
| Real-time video rates | ❌ (~12 s/image) | ❌ on Mac | ✅ |
| MCP / REST / annotation exports built in | ✅ | ❌ | ❌ |
Pick YOLO-World when you need real-time speed. Pick this when you want grounding quality from a 3B VLM on a Mac without dragging in a PyTorch stack — e.g. pre-annotating datasets, adding local MCP vision, or batch-checking images with free-text categories.
- Fixed-canvas vision: every input is stretched to a 1036×1036 canvas (74×74 patch grid), so one converted vision package covers any source resolution; box decoding in 0–1000 normalized coordinates undoes the stretch for free.
- Stateful decoder: the Qwen2 36-layer decoder is converted with CoreML
StateTypeKV-cache state; attention masks are passed as inputs, enabling the reference's hybrid MTP (parallel box decoding) / AR fallback generation outside the graph. - fp16 with fp32 islands: packages are fp16, with RMSNorm / softmax / lm_head pinned to fp32 to match reference numerics.
- Pure-numpy generation: masks, top-p sampling, repetition penalty, and parallel box decoding are verbatim numpy ports of the reference inference code.
locateanything INPUT --categories CATS [-o OUT.png] [--out-json OUT.json]
[--format {json,coco,yolo,labelstudio}]
[--models-dir DIR] [--repo-id REPO] [--revision REV]
[--compute-units {cpu_and_gpu,cpu_only,all}]
[--generation-mode {fast,slow,hybrid}]
[--temperature T] [--top-p P] [--repetition-penalty R]
[--seed N] [--max-new-tokens N]
locateanything serve [--port 8765] [--models-dir DIR] [--repo-id REPO] [--revision REV]
locateanything mcp [--models-dir DIR] [--repo-id REPO] [--revision REV]
--temperature 0 gives deterministic greedy decoding. --revision pins a
Hugging Face branch, tag, or commit.
pip install 'locateanything-coreml[mcp]'
locateanything mcpClient configuration:
{
"mcpServers": {
"locateanything": {"command": "locateanything", "args": ["mcp"]}
}
}Or zero-install via uvx:
{
"mcpServers": {
"locateanything": {
"command": "uvx",
"args": ["--from", "locateanything-coreml[mcp]", "locateanything", "mcp"]
}
}
}One tool is exposed — detect_objects(image_path, categories, temperature=0, save_annotated=false) — returning a structured result with detections in
pixel coordinates, and optionally writing the annotated image so the client can
look at the result.
The first call loads the models (~35 s); they stay warm for the session.
Pay the ~35 s model load once, then detect in seconds per request:
locateanything serve --port 8765curl -s localhost:8765/detect \
-d '{"image_path": "/abs/path/photo.jpg", "categories": ["person", "car"]}'GET /openapi.json exposes the local API schema. image_b64 (base64 image
bytes) works instead of image_path; "format" selects json (default) /
coco / yolo / labelstudio; generation parameters (temperature,
seed, ...) match the CLI. Binds 127.0.0.1 only — a local single-user tool,
not a production server.
locateanything photo.jpg --categories person,car --format coco
locateanything photo.jpg --categories person,car --format yolo
locateanything photo.jpg --categories person,car --format labelstudioLabel Studio output is a ready-to-import prediction (boxes and points);
COCO and YOLO are boxes-only (points are skipped with a warning). YOLO
writes <input>.detections.txt; the others write <input>.detections.json.
Use it to pre-annotate datasets, then correct by hand.
More integration notes for MCP clients, REST callers, Label Studio, CVAT, and Roboflow are in docs/INTEGRATIONS.md.
- macOS on Apple Silicon (CoreML execution)
- Python ≥ 3.10
- ~8 GB disk for the model snapshot; 32 GB RAM recommended
src/locateanything_coreml/: Python package, CLI, REST server, MCP server, preprocessing, generation, and export adapters.examples/: demo image and small client snippets.docs/: release and integration notes.- Model weights are intentionally excluded from GitHub and downloaded from Hugging Face on first use.
- Video inference
- Standardized detector interface for MCP clients and annotation pipelines — MCP server, local REST server, COCO / YOLO / Label Studio exports (v0.2)
- Public OpenAPI schema for localhost integrations
- Quantized (int8/int4) decoder variants
- Batch dataset inference
- Swift / native example app
Code is open source under MIT. The CoreML model weights are not MIT licensed; they are a derivative of NVIDIA's LocateAnything-3B under the NVIDIA License for research and evaluation use only. See NOTICE.
@article{locateanything2026,
title={LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding},
journal={arXiv preprint arXiv:2605.27365},
year={2026}
}