Duplicate Question Detection

Overview

This project applies Natural Language Processing (NLP) and Machine Learning techniques to identify whether two questions have the same intent and are duplicates of each other. By extracting semantic, syntactic, and structural features from text pairs, the models classify question pairs as duplicates or non-duplicates.

Dataset

The primary dataset used for this analysis is questions.csv (stored in a Dataset folder). It contains pairs of questions with the following key columns:

question1 & question2: The raw text of the question pairs.
is_duplicate: The target variable (1 = Duplicate, 0 = Not a duplicate).

Methodology

The project follows an advanced NLP data science pipeline:

Data Preprocessing: Cleaning text by removing HTML tags, expanding contractions, stripping punctuation, removing stopwords, and applying WordNet lemmatization.
Feature Engineering: A robust set of features was extracted to capture text similarity:
- Basic & Length Features: Jaccard similarity, length differences, mean lengths, and longest common substring ratios.
- Fuzzy Features: Ratios using the FuzzyWuzzy library (QRatio, partial ratio, token sort, token set).
- Semantic Embeddings: Cosine similarity calculated using HuggingFace's pre-trained Sentence-BERT model (all-MiniLM-L6-v2).
- Syntactic & Keyword Features: Part-of-Speech (POS) tagging overlap (nouns, verbs, adjectives) and RAKE (Rapid Automatic Keyword Extraction) overlaps.
Model Building & Evaluation: The data is split into training and testing sets, and multiple classifiers are trained and compared:
- Logistic Regression
- Random Forest Classifier
- Gradient Boosting Classifier
- XGBoost Classifier
- LightGBM Classifier
Visualization & Clustering: Dimensionality reduction using t-SNE to visualize the feature space of the question pairs, and KMeans clustering evaluated with silhouette scores.

Technologies Used

Python 3
Pandas & NumPy: Data manipulation
NLTK, FuzzyWuzzy & RAKE: Text processing, string matching, and keyword extraction
Sentence-Transformers (HuggingFace): Deep learning text embeddings
Scikit-Learn, XGBoost, & LightGBM: Machine learning modeling and evaluation
Matplotlib: Data visualization (t-SNE)

How to Run

Clone this repository or download the files.
Ensure you have the required libraries installed (e.g., transformers, sentence-transformers, fuzzywuzzy, xgboost, lightgbm, rake-nltk).
Open Duplicate_Question.ipynb in Google Colab or Jupyter Notebook.
Upload the questions.csv dataset when prompted by the notebook.
Run all cells to process the text, extract features, and evaluate the classification models.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
Dataset		Dataset
Notebook		Notebook
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Duplicate Question Detection

Overview

Dataset

Methodology

Technologies Used

How to Run

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Duplicate Question Detection

Overview

Dataset

Methodology

Technologies Used

How to Run

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages