Hi, I am trying to reproduce the MiniCPM-V-4.6 results reported in the README using VLMEvalKit, but I observe a noticeable gap.
Model:
Evaluation tool:
- VLMEvalKit
- Datasets: AI2D_TEST, MMMU_DEV_VAL
- Judge: exact_matching
- Mode: all
Command:
python run.py \
--data MMMU_DEV_VAL AI2D_TEST \
--model MiniCPM-V-4_6 \
--mode all \
--judge exact_matching \
--reuse \
--verbose
Dataset | Observed
AI2D_TEST | 78.76
MMMU_DEV_VAL | 42.56
Dataset | README
AI2D | 84.2
MMMU | 53.6
I see this warning during generation: Setting pad_token_id to eos_token_id:248044 for open-end generation.
I am not using an official MiniCPM-V-4.6 adapter from upstream VLMEvalKit. I added a local MiniCPM_V_4_6 adapter under vlmeval/vlm/minicpm_v.py, registered as MiniCPM-V-4_6. It inherits MiniCPM_V_4 for prompt building / CoT / answer extraction, and only changes loading/generation to use AutoModelForImageTextToText and AutoProcessor for openbmb/MiniCPM-V-4.6.
Parameters:
downsample_mode = "16x"
max_slice_nums = 36
num_beams = 3
do_sample = False
max_new_tokens = 2048 for CoT datasets
max_new_tokens = 1024 for non-CoT datasets
add_generation_prompt = True
processor_kwargs.images_kwargs.downsample_mode = "16x"
processor_kwargs.images_kwargs.max_slice_nums = 36
Could you please help check whether my evaluation setup is correct?
If there is any mismatch with the official evaluation protocol, could you suggest the recommended way to reproduce the reported AI2D/MMMU results for MiniCPM-V-4.6?
Please let me know if additional logs, prediction files, or environment details would be useful.
Hi, I am trying to reproduce the MiniCPM-V-4.6 results reported in the README using VLMEvalKit, but I observe a noticeable gap.
Model:
Evaluation tool:
Command: