Skip to content

Latest commit

 

History

History
58 lines (42 loc) · 2.47 KB

File metadata and controls

58 lines (42 loc) · 2.47 KB

Dataset Overview

This dataset consists of three main subsets:

Dataset Description Path
full The standard dataset used for all benchmarks in the paper. data-full
sample A smaller subset sampled from full for quick testing. data-sample
visualize A subset from full used for error visualization. data-visualize

The data-full directory contains the following subsets:

Category Subset Path
Reasoning code data-full/code.jsonl
Reasoning math data-full/math.jsonl
Knowledge law data-full/law.jsonl
Knowledge medicine data-full/medicine.jsonl
Knowledge business data-full/business.jsonl
Non-English french data-full/french.jsonl
Non-English japanese data-full/japanese.jsonl
Non-English chinese data-full/chinese.jsonl
Skills translation (en->zh) data-full/en2zh.jsonl
Skills translation (zh->en) data-full/zh2en.jsonl
Skills summarization data-full/sum.jsonl
Long-Context 8k ctx data-full/long8.jsonl
Long-Context 16k ctx data-full/long16.jsonl
Long-Context 24k ctx data-full/long24.jsonl

Dataset Construction

This section describes how to build the dataset from scratch.

  1. Download the original data from HuggingFace.

    huggingface-cli download --dataset RyokoAI/ShareGPT52K --local-dir share-gpt

    This will download sg_90k_part1.json and sg_90k_part2.json.

  2. Modify the DATA_PATH variable in tools/clean.py and tools/clean_for_longctx.py to point to the downloaded JSON files.

  3. Register for an API key on the DeepSeek official website and export it as an environment variable.

    export DEEPSEEK_API_KEY="your_api_key"
  4. Run the data processing scripts.

    python tools/clean.py
    python tools/clean_for_longctx.py

After the automated construction, we manually deduplicated the data and curated a selection of high-quality question-answer pairs for each task.