TabDomainExtractor: Automatic Domain Classification of Tabular Datasets Using Large Language Models (LLMs)
This repository implements an automated solution for determining the domains of tabular datasets using Large Language Models (LLM). The project leverages models such as GPT-4o and Deepseek to classify tabular data into thematic domains (e.g., medicine, finance) using zero-shot, one-shot, and few-shot learning approaches. It supports stratified splitting (StratifiedShuffleSplit) and K-fold cross-validation with experimental scripts for different sizes of datasets. The method is evaluated on a collected benchmark of 258 OpenML datasets labeled across 14 domains, demonstrating strong performance in domain classification tasks.
Manually labeling the domain of numerous datasets is time-consuming and subjective. This project automates that process. The core idea is that the semantic meaning of column names and some lines in a table strongly indicates its overall domain. An LLM, with its extensive world knowledge, is perfectly suited to interpret these column names and provide a consistent, accurate domain classification.
- Comprehensive Benchmark: Evaluated on a manually curated benchmark of 258 datasets from OpenML across 14 distinct domains.
- Automated Domain Classification: Identifies thematic domains of tabular data using LLM.
- Supported Models: gpt-4o and deepseek-chat via OpenRouter API.
- Classification Modes: Zero-shot, one-shot, and few-shot learning.
- Flexible Partitioning: Supports stratified splitting and K-fold cross-validation.
- Experimentation: Includes scripts for processing subsets (5 and 50 rows) for testing.
TabDomainExtractor/
├── baseline/ # Baseline implementations
│ ├── d4_baseline.py # Baseline using D4 system approach
│ ├── embeddings_of_columns_baseline.py # Baseline using column embeddings
│ └── metafeatures_baseline.py # Baseline using traditional meta-features
├── benchmark/ # Benchmarking utilities and results
│ ├── benchmark.json # Benchmark dataset definitions
│ └── collecting_benchmark.ipynb # Notebook for collecting benchmark data
├── experiments/ # Experimental runs with different configurations
│ ├── 50rows_deepseek.py # Experiment with 50 rows using DeepSeek
│ ├── 50rows_gpt4o.py # Experiment with 50 rows using GPT-4o
│ ├── 5rows_deepseek.py # Experiment with 5 rows using DeepSeek
│ └── 5rows_gpt4o.py # Experiment with 5 rows using GPT-4o
├── metafeatures_experiments/ # Experiments based on metafeatures
│ ├── metafeatures_experiment.py # Main experiment with metafeatures
├── nyc_experiments/ # Experimental runs with different configurations on NYC Open Data datasets
│ ├── nyc_5rows_deepseek.py # Experiment with 5 rows using DeepSeek
│ ├── nyc_5rows_gpt4o.py # Experiment with 5 rows using GPT-4o
│ └── nyc_datasets.json # NYC Open Data datasets definition
├── prompts/ # LLM prompt templates
│ ├── few_shot_prompt.txt # Few-shot learning prompt template
│ └── zero_shot_prompt.txt # Zero-shot prompt template
├── .gitignore
├── requirements.txt
└── README.md
- Clone the repository:
git clone https://github.com/ITMO-NSS-team/TabDomain_Extractor.git
cd TabDomain_Extractor
- Install dependencies:
pip install -r requirements.txt
- Configure the OpenRouter API key: create a .env file in the project root:
echo "OPENROUTER_API_KEY=your-api-key-here" > .env