Wenting Chen1,* Shengyuan Liu2,* Boyun Zheng2 Jipeng Zhang3 Wenxuan Wang3 Dejun Fan4 Raymond Shing Yan Tang5 Yuen Tung Lam6 Shannon Melissa Chan7 Lei Xing1 Jiancong Hu4,† Yixuan Yuan2,†
1 Department of Radiation Oncology, Stanford University, CA, USA
2 Department of Electronic Engineering, The Chinese University of Hong Kong, Hong Kong SAR, China
3 Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong SAR, China
4 The Sixth Affiliated Hospital, Sun Yat-sen University, Guangzhou, China
5 Department of Medicine and Therapeutics, The Chinese University of Hong Kong, Hong Kong SAR, China
6 The Nethersole School of Nursing, The Chinese University of Hong Kong, Hong Kong SAR, China
7 Department of Surgery, The Chinese University of Hong Kong, Hong Kong SAR, China
* These authors contributed equally.
† Correspondence to Yixuan Yuan and Jiancong Hu.
Gastrointestinal diseases affect 2.86 billion people globally, with capsule endoscopy (CE) providing crucial diagnostics but requiring manual review of over 60,000 frames per examination, a process associated with 17.4% disease miss rates. While artificial intelligence shows promise for CE analysis, existing endoscopic vision-language models (VLMs) lack multi-video understanding capability and cannot replicate the systematic multi-evidence reasoning that gastroenterologists integrate findings across anatomical regions to synthesize cohesive diagnoses. Here we introduce CE-R1, an adaptive foundation model with evidence-based clinical reasoning capabilities specifically designed for gastroenterology. CE-R1 incorporates a dynamic router that assesses query complexity and selectively routes cases to either a lightweight model for straightforward questions or a deep reasoning model that generates transparent, step-by-step diagnostic thought processes. To enable this capability, we construct CE-Bench, the first large-scale multimodal CE dataset comprising 502,066 visual question-answering pairs with chain-of-thought reasoning annotations, spanning 70 fine-grained clinical sub-tasks across five core diagnostic categories: anatomy identification, endoscopic findings recognition, disease diagnosis, treatment planning, and medical report generation. Comprehensive evaluation on both in-distribution and out-of-distribution datasets from four independent hospitals demonstrates that CE-R1 achieves 86.7% overall accuracy, substantially outperforming state-of-the-art VLMs (best baseline: 24.6%) and surpassing average physician performance (39.9%) by 21.1%. CE-R1 maintains superior generalization across external validation sets (65.1–81.9% accuracy). Critically, the multi-evidence clinical reasoning capability delivers substantial performance gains in complex diagnostic tasks: CE-R1 surpasses the model without reasoning by 8.5% in disease diagnosis, demonstrating the clinical value of transparent, step-by-step diagnostic processes. These results establish CE-R1 as a robust foundation model for comprehensive CE analysis with immediate applications in clinical decision support and medical education.
Install the requirements:
conda env create -f environment.yml
pip install git+https://github.com/huggingface/transformers.git@v4.49.0
pip install -e .
pip install -e ".[torch,metrics]"Please download the public datasets and our pre-trained models from this repository (https://huggingface.co/datasets/Valentina007/CE_R1_data/).
Please make sure this folder (CE_R1_data) is under the same directory of current folder (CE_R1)
hf download Valentina007/CE_R1_dataDirectory structure of this folder (CE_R1_data).
./anno and ./data include the part of data in CE-Bench. These folders include the public datasets, including kid-v1, kid-v2, and kvasir-capsule datasets.
./models includes the pre-trained models of CE-R1.
├── anno
│ ├── kid-v1-image_test.json
│ ├── kid-v2-image_test.json
│ ├── kvasir-capsule-image_test.json
│ └── kvasir-capsule-videoclip_test.json
├── data
│ ├── kid-dataset-1
│ ├── kid-dataset-2
│ ├── kvasir-capsule-labelled_images
│ └── video_clips_v1
└── models
├── deep
├── lite
└── router_models
INPUT_PATH_IMG="/path/to/input_image.png"
QUESTION_IMG="You question can be put here."
python test_single.py --path "$INPUT_PATH_IMG" --question "$QUESTION_IMG"
All the results will be saved at: ./results/model_output
In ./results/model_output/final_results.json, you can get the output as follows:
{
"input_path": "/path/to/input_image.png",
"question": "You question",
"probability": 0.3223,
"model_version": "lite",
"model_type": "lite",
"media_type": "image",
"generated_response": "Final output from CE-R1"
}This result mentions the input of the CE-R1, the probability from router, model_type we used, and output of the CE-R1. When the probability from router is larger than 0.5, we use the CE-R1-Deep. Otherwise, we use CE-R1-Lite.
Here, we provide an example about the WCE image or video as input.
sh ./lanuch/test_img_single.shMultimodal-BERT-in-Medical-Image-and-Text-Classification
Please contact me if you have any question (wentchen AT stanford dot edu)

