fidbench/docs/data.md at main · wenhaoli-xmu/fidbench

Dataset Overview

This dataset consists of three main subsets:

Dataset	Description	Path
full	The standard dataset used for all benchmarks in the paper.	data-full
sample	A smaller subset sampled from `full` for quick testing.	data-sample
visualize	A subset from `full` used for error visualization.	data-visualize

The data-full directory contains the following subsets:

This section describes how to build the dataset from scratch.

Download the original data from HuggingFace.
```
huggingface-cli download --dataset RyokoAI/ShareGPT52K --local-dir share-gpt
```
This will download sg_90k_part1.json and sg_90k_part2.json.
Modify the DATA_PATH variable in tools/clean.py and tools/clean_for_longctx.py to point to the downloaded JSON files.
Register for an API key on the DeepSeek official website and export it as an environment variable.
```
export DEEPSEEK_API_KEY="your_api_key"
```

Run the data processing scripts.

python tools/clean.py
python tools/clean_for_longctx.py

After the automated construction, we manually deduplicated the data and curated a selection of high-quality question-answer pairs for each task.