AI Generated Text Detection

This repository contains the source code and resources for a binary classification project aimed at detecting AI-generated texts. The project is based on the Kaggle competition and utilizes a variety of classical machine learning models as well as a fine-tuned DistilRoBERTa model to achieve its goal.

Project Structure

data/: Contains pre-processed and post-processed training and test datasets in CSV format. Training datasets can be augmented with custom-generated synthetic data. The test set provided is a placeholder and should be replaced for actual use.
model_checkpoints/: Stores trained models' checkpoints.
EDA.ipynb: Jupyter notebook for exploratory data analysis on the training set.
generate_synthetic_essays.ipynb: Notebook for generating synthetic training data using Mistral-7b instruct.
data_processing.py: Processes the training and test sets, tokenizes and vectorizes texts, and saves the resulting sparse matrices as NPZ files in the data/ folder.
optuna.ipynb: Contains hyperparameter optimization for classical ML models (Ridge, Multinomial Naive Bayes, SVM, and XGBoost) and visualizations of optimization history and parameter importance.
classical_models_training.py: Trains the four classical ML models and saves them as .pkl files in the model_checkpoints/ folder.
distilroberta_training.py: Fine-tunes the pre-trained DistilRoBERTa-base model on the training set and saves the checkpoint to the model_checkpoints/ folder.
inference.py: Loads trained classical ML models and DistilRoBERTa, ensembles them using weights to make predictions on the test set.
pseudo_labeling.py: Implements advanced pseudo-labeling techniques to leverage accurate predictions for accuracy improvement.

Setup

Clone the Repository:

git clone https://github.com/q-xZzz/ai-text-detection.git

Install Dependencies:
```
pip install -r requirements.txt
```

Running the Project

To get the project up and running, follow these steps:

(optional)Exploratory Data Analysis: Open EDA.ipynb with Jupyter Notebook or JupyterLab to explore the training dataset.
(optional)Generating Synthetic Training Data: Use generate_synthetic_essays.ipynb to create additional synthetic data for training.
Data Processing: Run python data_processing.py to tokenize and vectorize the datasets, and save the processed data for training and testing.
(optional)Hyperparameter Optimization: Launch optuna.ipynb to find the optimal hyperparameters for the classical ML models.
Model Training:
- For classical ML models, execute python classical_models_training.py.
- For DistilRoBERTa, run python distilroberta_training.py.
Inference: Use python inference.py to load the trained models, ensemble them, and make predictions on the test set.
Pseudo Labeling: Advanced pseudo-labeling techniques can be applied using pseudo_labeling.py to further refine the model's accuracy by leveraging confident predictions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Generated Text Detection

Project Structure

Setup

Running the Project

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
data		data
model_checkpoints		model_checkpoints
.gitignore		.gitignore
EDA.ipynb		EDA.ipynb
LICENSE		LICENSE
README.md		README.md
classical_models_training.py		classical_models_training.py
data_processing.py		data_processing.py
distilroberta_training.py		distilroberta_training.py
generate_synthetic_essays.ipynb		generate_synthetic_essays.ipynb
inference.py		inference.py
optuna.ipynb		optuna.ipynb
pseudo_labeling.py		pseudo_labeling.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

AI Generated Text Detection

Project Structure

Setup

Running the Project

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages