Skip to content

Reproduction gap for MiniCPM-V-4.6 on AI2D_TEST and MMMU_DEV_VAL with VLMEvalKit #1115

Description

@github-haoyan

Hi, I am trying to reproduce the MiniCPM-V-4.6 results reported in the README using VLMEvalKit, but I observe a noticeable gap.

Model:

  • openbmb/MiniCPM-V-4.6

Evaluation tool:

  • VLMEvalKit
  • Datasets: AI2D_TEST, MMMU_DEV_VAL
  • Judge: exact_matching
  • Mode: all

Command:

python run.py \
  --data MMMU_DEV_VAL AI2D_TEST \
  --model MiniCPM-V-4_6 \
  --mode all \
  --judge exact_matching \
  --reuse \
  --verbose

Dataset | Observed
AI2D_TEST | 78.76
MMMU_DEV_VAL | 42.56

Dataset | README
AI2D | 84.2
MMMU | 53.6

I see this warning during generation: Setting pad_token_id to eos_token_id:248044 for open-end generation.

I am not using an official MiniCPM-V-4.6 adapter from upstream VLMEvalKit. I added a local MiniCPM_V_4_6 adapter under vlmeval/vlm/minicpm_v.py, registered as MiniCPM-V-4_6. It inherits MiniCPM_V_4 for prompt building / CoT / answer extraction, and only changes loading/generation to use AutoModelForImageTextToText and AutoProcessor for openbmb/MiniCPM-V-4.6.

Parameters:
downsample_mode = "16x"
max_slice_nums = 36
num_beams = 3
do_sample = False
max_new_tokens = 2048 for CoT datasets
max_new_tokens = 1024 for non-CoT datasets
add_generation_prompt = True
processor_kwargs.images_kwargs.downsample_mode = "16x"
processor_kwargs.images_kwargs.max_slice_nums = 36

Could you please help check whether my evaluation setup is correct?

If there is any mismatch with the official evaluation protocol, could you suggest the recommended way to reproduce the reported AI2D/MMMU results for MiniCPM-V-4.6?

Please let me know if additional logs, prediction files, or environment details would be useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions