This dataset consists of three main subsets:
| Dataset | Description | Path |
|---|---|---|
| full | The standard dataset used for all benchmarks in the paper. | data-full |
| sample | A smaller subset sampled from full for quick testing. |
data-sample |
| visualize | A subset from full used for error visualization. |
data-visualize |
The data-full directory contains the following subsets:
| Category | Subset | Path |
|---|---|---|
| Reasoning | code | data-full/code.jsonl |
| Reasoning | math | data-full/math.jsonl |
| Knowledge | law | data-full/law.jsonl |
| Knowledge | medicine | data-full/medicine.jsonl |
| Knowledge | business | data-full/business.jsonl |
| Non-English | french | data-full/french.jsonl |
| Non-English | japanese | data-full/japanese.jsonl |
| Non-English | chinese | data-full/chinese.jsonl |
| Skills | translation (en->zh) | data-full/en2zh.jsonl |
| Skills | translation (zh->en) | data-full/zh2en.jsonl |
| Skills | summarization | data-full/sum.jsonl |
| Long-Context | 8k ctx | data-full/long8.jsonl |
| Long-Context | 16k ctx | data-full/long16.jsonl |
| Long-Context | 24k ctx | data-full/long24.jsonl |
This section describes how to build the dataset from scratch.
-
Download the original data from HuggingFace.
huggingface-cli download --dataset RyokoAI/ShareGPT52K --local-dir share-gpt
This will download
sg_90k_part1.jsonandsg_90k_part2.json. -
Modify the
DATA_PATHvariable intools/clean.pyandtools/clean_for_longctx.pyto point to the downloaded JSON files. -
Register for an API key on the DeepSeek official website and export it as an environment variable.
export DEEPSEEK_API_KEY="your_api_key"
-
Run the data processing scripts.
python tools/clean.py python tools/clean_for_longctx.py
After the automated construction, we manually deduplicated the data and curated a selection of high-quality question-answer pairs for each task.