FiNDR: Vocabulary-Free Fine-Grained Recognition via Reasoning-Augmented LMMs (CVPR 2026)

This repository contains the reference implementation for FiNDR (Fine-grained Name Discovery via Reasoning), a fully automated framework for vocabulary-free fine-grained image recognition using reasoning-augmented Large Multi-Modal Models (LMMs).

Thinking Beyond Labels: Vocabulary-Free Fine-Grained Recognition using Reasoning-Augmented LMMs Dmitry Demidov, Zaigham Zaheer, Zongyan Han, Omkar Thawakar, Rao Anwer Mohamed bin Zayed University of Artificial Intelligence [arXiv] [CVPR 2026 (soon)]

FiNDR removes the need for predefined or human-curated label vocabularies by discovering, verifying, and using fine-grained semantic class names directly from unlabelled images. Our approach challenges the assumption that human-defined vocabularies represent an upper bound for fine-grained recognition performance by outperforming zero-shot classifiers that rely on ground-truth class names.

Overview

Traditional fine-grained recognition methods rely on fixed, human-defined label vocabularies, which limits scalability and robustness in open-world scenarios. FiNDR addresses this limitation by leveraging modern reasoning-capable large multi-modal models together with vision–language models to automatically induce fine-grained class names directly from visual data.

The entire pipeline operates without any predefined vocabulary, manual annotation, or supervised training.

✨ Key Contributions

Introduces the first reasoning-augmented LMM framework for vocabulary-free fine-grained image recognition.
Proposes a fully automated end-to-end pipeline that discovers, verifies, and uses semantic class names.
Achieves state-of-the-art performance across multiple fine-grained benchmarks in the vocabulary-free setting.
Outperforms zero-shot classifiers that rely on ground-truth class names, challenging the assumption that human-curated vocabularies represent an upper bound.
Demonstrates that open-source LMMs, with carefully designed prompts, can match proprietary reasoning-enabled models.

Method Summary

Given a small unlabelled discovery set, FiNDR operates in three stages:

1. Vocabulary Discovery via Reasoning

A reasoning-enabled LMM:

Infers dataset-level meta-information (e.g., category, granularity, domain expertise)
Generates fine-grained candidate class names for each image using step-by-step reasoning

2. Class Name Refinement

A vision–language model (e.g., CLIP):

Measures visual–semantic alignment between images and candidate names
Filters and ranks candidate labels to form a refined vocabulary

3. Vision–Language Modalities Coupling

Textual and visual prototypes are combined into a lightweight vision–language classifier, which is used at inference time to assign human-readable fine-grained labels to unseen images.

Usage

1. Setup Environment

1.1a Set up environment using PIP:

Install Python 3.9.16 (skip if installed already):

# installing Python with conda, but you can use any other method
conda create -n findr python=3.9.16 -y
conda activate findr

Install dependencies with PIP:

pip install -r envs/pip_requirements.txt

1.1b Alternatively, set up environment using Conda:

Install dependencies with Conda:

conda env create -f envs/conda_environment.yml
# or
conda create --name findr --file envs/conda_requirements.txt

Activate the environment and install CLIP (non-conda package):

conda activate e-finer
pip install git+https://github.com/openai/CLIP.git

1.2 Install dependencies for a chosen LMM:

a. For Qwen-VL (used in our experiments):

pip install openai

b. For Gemini:

pip install google-genai

c. For ChatGPT:

pip install openai

d. For other LMMs, please refer to their respective repositories for installation instructions.

2. Prepare Datasets

For dataset download and preparation, please follow a beautifully written guide available here. All meta data needed for the supported datasets is provided in the data/data_stats.py file.

3. Classname Discovery

The discovered class names are already provided in the data/guessed_classnames/ directory for all supported datasets.

[Optionally] To re-discover clasnames, run:

# For a custom dataset - update generation config in the script
python -m data.generate_classnames

4. Vocabulary-free Classification

To perform vocabulary-free classification on supported datasets, run the corresponding evaluation scripts:

sh run/eval_birds.sh
sh run/eval_cars.sh
sh run/eval_dogs.sh
sh run/eval_flowers.sh
sh run/eval_pets.sh

Repository Structure

e-finer/
├── configs/               # Configuration files for experiments
├── data/                  # Dataset loaders, preprocessing, generated in-context sentences
├── datasets/              # Fine-grained datasets
├── envs/                  # Environment setup files
├── models/                # Vision-language interfaces
├── run/                   # Entry-point scripts for experiments
├── utils/                 # Helper utilities
└── README.md

Citation

If you find this work useful, consider a citation:

@misc{demidov2025thinkinglabelsvocabularyfreefinegrained,
      title={Thinking Beyond Labels: Vocabulary-Free Fine-Grained Recognition using Reasoning-Augmented LMMs}, 
      author={Dmitry Demidov and Zaigham Zaheer and Zongyan Han and Omkar Thawakar and Rao Anwer},
      year={2025},
      eprint={2512.18897},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.18897}, 
}

Contacts

For questions or collaborations:

Dmitry Demidov – dmitry.demidov@mbzuai.ac.ae

⭐ Acknowledgements

This project builds upon and integrates ideas from CLIP, FineR, E-FineR, and recent advances in reasoning-based LMMs (Qwen, Gemini). We are thankful to the corresponding authors for making their code public.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FiNDR: Vocabulary-Free Fine-Grained Recognition via Reasoning-Augmented LMMs (CVPR 2026)

Overview

✨ Key Contributions

Method Summary

1. Vocabulary Discovery via Reasoning

2. Class Name Refinement

3. Vision–Language Modalities Coupling

Usage

1. Setup Environment

1.1a Set up environment using PIP:

1.1b Alternatively, set up environment using Conda:

1.2 Install dependencies for a chosen LMM:

2. Prepare Datasets

3. Classname Discovery

4. Vocabulary-free Classification

Repository Structure

Citation

Contacts

⭐ Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
configs		configs
data		data
datasets		datasets
envs		envs
models		models
run		run
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
eval.py		eval.py

Folders and files

Latest commit

History

Repository files navigation

FiNDR: Vocabulary-Free Fine-Grained Recognition via Reasoning-Augmented LMMs (CVPR 2026)

Overview

✨ Key Contributions

Method Summary

1. Vocabulary Discovery via Reasoning

2. Class Name Refinement

3. Vision–Language Modalities Coupling

Usage

1. Setup Environment

1.1a Set up environment using PIP:

1.1b Alternatively, set up environment using Conda:

1.2 Install dependencies for a chosen LMM:

2. Prepare Datasets

3. Classname Discovery

4. Vocabulary-free Classification

Repository Structure

Citation

Contacts

⭐ Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages