This repository contains the reference implementation for FiNDR (Fine-grained Name Discovery via Reasoning), a fully automated framework for vocabulary-free fine-grained image recognition using reasoning-augmented Large Multi-Modal Models (LMMs).
Thinking Beyond Labels: Vocabulary-Free Fine-Grained Recognition using Reasoning-Augmented LMMs Dmitry Demidov, Zaigham Zaheer, Zongyan Han, Omkar Thawakar, Rao Anwer Mohamed bin Zayed University of Artificial Intelligence [arXiv] [CVPR 2026 (soon)]
FiNDR removes the need for predefined or human-curated label vocabularies by discovering, verifying, and using fine-grained semantic class names directly from unlabelled images. Our approach challenges the assumption that human-defined vocabularies represent an upper bound for fine-grained recognition performance by outperforming zero-shot classifiers that rely on ground-truth class names.
Traditional fine-grained recognition methods rely on fixed, human-defined label vocabularies, which limits scalability and robustness in open-world scenarios. FiNDR addresses this limitation by leveraging modern reasoning-capable large multi-modal models together with vision–language models to automatically induce fine-grained class names directly from visual data.
The entire pipeline operates without any predefined vocabulary, manual annotation, or supervised training.
- Introduces the first reasoning-augmented LMM framework for vocabulary-free fine-grained image recognition.
- Proposes a fully automated end-to-end pipeline that discovers, verifies, and uses semantic class names.
- Achieves state-of-the-art performance across multiple fine-grained benchmarks in the vocabulary-free setting.
- Outperforms zero-shot classifiers that rely on ground-truth class names, challenging the assumption that human-curated vocabularies represent an upper bound.
- Demonstrates that open-source LMMs, with carefully designed prompts, can match proprietary reasoning-enabled models.
Given a small unlabelled discovery set, FiNDR operates in three stages:
A reasoning-enabled LMM:
- Infers dataset-level meta-information (e.g., category, granularity, domain expertise)
- Generates fine-grained candidate class names for each image using step-by-step reasoning
A vision–language model (e.g., CLIP):
- Measures visual–semantic alignment between images and candidate names
- Filters and ranks candidate labels to form a refined vocabulary
Textual and visual prototypes are combined into a lightweight vision–language classifier, which is used at inference time to assign human-readable fine-grained labels to unseen images.
- Install Python 3.9.16 (skip if installed already):
# installing Python with conda, but you can use any other method
conda create -n findr python=3.9.16 -y
conda activate findr- Install dependencies with PIP:
pip install -r envs/pip_requirements.txt- Install dependencies with Conda:
conda env create -f envs/conda_environment.yml
# or
conda create --name findr --file envs/conda_requirements.txt- Activate the environment and install CLIP (non-conda package):
conda activate e-finer
pip install git+https://github.com/openai/CLIP.gita. For Qwen-VL (used in our experiments):
pip install openaib. For Gemini:
pip install google-genaic. For ChatGPT:
pip install openaid. For other LMMs, please refer to their respective repositories for installation instructions.
For dataset download and preparation, please follow a beautifully written guide available here.
All meta data needed for the supported datasets is provided in the data/data_stats.py file.
The discovered class names are already provided in the data/guessed_classnames/ directory for all supported datasets.
[Optionally] To re-discover clasnames, run:
# For a custom dataset - update generation config in the script
python -m data.generate_classnamesTo perform vocabulary-free classification on supported datasets, run the corresponding evaluation scripts:
sh run/eval_birds.sh
sh run/eval_cars.sh
sh run/eval_dogs.sh
sh run/eval_flowers.sh
sh run/eval_pets.she-finer/
├── configs/ # Configuration files for experiments
├── data/ # Dataset loaders, preprocessing, generated in-context sentences
├── datasets/ # Fine-grained datasets
├── envs/ # Environment setup files
├── models/ # Vision-language interfaces
├── run/ # Entry-point scripts for experiments
├── utils/ # Helper utilities
└── README.md
If you find this work useful, consider a citation:
@misc{demidov2025thinkinglabelsvocabularyfreefinegrained,
title={Thinking Beyond Labels: Vocabulary-Free Fine-Grained Recognition using Reasoning-Augmented LMMs},
author={Dmitry Demidov and Zaigham Zaheer and Zongyan Han and Omkar Thawakar and Rao Anwer},
year={2025},
eprint={2512.18897},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.18897},
}
For questions or collaborations:
- Dmitry Demidov – dmitry.demidov@mbzuai.ac.ae
This project builds upon and integrates ideas from CLIP, FineR, E-FineR, and recent advances in reasoning-based LMMs (Qwen, Gemini). We are thankful to the corresponding authors for making their code public.