This repository contains the source code and resources for a binary classification project aimed at detecting AI-generated texts. The project is based on the Kaggle competition and utilizes a variety of classical machine learning models as well as a fine-tuned DistilRoBERTa model to achieve its goal.
data/: Contains pre-processed and post-processed training and test datasets in CSV format. Training datasets can be augmented with custom-generated synthetic data. The test set provided is a placeholder and should be replaced for actual use.model_checkpoints/: Stores trained models' checkpoints.EDA.ipynb: Jupyter notebook for exploratory data analysis on the training set.generate_synthetic_essays.ipynb: Notebook for generating synthetic training data using Mistral-7b instruct.data_processing.py: Processes the training and test sets, tokenizes and vectorizes texts, and saves the resulting sparse matrices as NPZ files in thedata/folder.optuna.ipynb: Contains hyperparameter optimization for classical ML models (Ridge, Multinomial Naive Bayes, SVM, and XGBoost) and visualizations of optimization history and parameter importance.classical_models_training.py: Trains the four classical ML models and saves them as.pklfiles in themodel_checkpoints/folder.distilroberta_training.py: Fine-tunes the pre-trained DistilRoBERTa-base model on the training set and saves the checkpoint to themodel_checkpoints/folder.inference.py: Loads trained classical ML models and DistilRoBERTa, ensembles them using weights to make predictions on the test set.pseudo_labeling.py: Implements advanced pseudo-labeling techniques to leverage accurate predictions for accuracy improvement.
- Clone the Repository:
git clone https://github.com/q-xZzz/ai-text-detection.git
- Install Dependencies:
pip install -r requirements.txt
To get the project up and running, follow these steps:
- (optional)Exploratory Data Analysis: Open
EDA.ipynbwith Jupyter Notebook or JupyterLab to explore the training dataset. - (optional)Generating Synthetic Training Data: Use
generate_synthetic_essays.ipynbto create additional synthetic data for training. - Data Processing: Run
python data_processing.pyto tokenize and vectorize the datasets, and save the processed data for training and testing. - (optional)Hyperparameter Optimization: Launch
optuna.ipynbto find the optimal hyperparameters for the classical ML models. - Model Training:
- For classical ML models, execute
python classical_models_training.py. - For DistilRoBERTa, run
python distilroberta_training.py.
- For classical ML models, execute
- Inference: Use
python inference.pyto load the trained models, ensemble them, and make predictions on the test set. - Pseudo Labeling: Advanced pseudo-labeling techniques can be applied using
pseudo_labeling.pyto further refine the model's accuracy by leveraging confident predictions.