This repository contains a Twitter sentiment analysis project using classical machine learning models. The goal is to classify tweets into four sentiment categories: Positive, Negative, Neutral, and Irrelevant. The project includes full preprocessing, feature extraction, model training, evaluation, and saving the final model for deployment.
The dataset used for this project is taken from Kaggle: Twitter Entity Sentiment Analysis.
- Training set:
twitter_training.csv - Validation set:
twitter_validation.csv - Columns:
ID→ Tweet IDTopic→ Topic of the tweetSentiment→ Sentiment label (Positive, Negative, Neutral, Irrelevant)Text→ Original tweet text
-
Text Cleaning
- Lowercasing, HTML decoding
- Remove URLs, mentions (@user), hashtags, emojis, and special characters
- Tokenization, stopwords removal, lemmatization
-
Label Encoding
- Convert sentiment labels into integers for model training
-
Feature Extraction
- TF-IDF vectorization with unigrams and bigrams
- Max features: 10,000
- Logistic Regression
- Multinomial Naive Bayes
- Decision Tree
- Random Forest (final model)
- Linear SVM
Evaluation metrics: Accuracy, Precision, Recall, F1-score, Confusion Matrix, ROC Curve
Final Model: Random Forest with 200 estimators trained on the full dataset.
The table below shows the accuracy of different machine learning models on the validation set:
| Model | Accuracy |
|---|---|
| Logistic Regression | 0.865 |
| Naive Bayes | 0.745 |
| Decision Tree | 0.886 |
| Random Forest | 0.946 |
| Linear SVM | 0.904 |
Key Insight:
- Random Forest achieved the highest accuracy (0.946) across all classes.
- It demonstrates the best balance between precision, recall, and F1-score.
- Therefore, Random Forest is selected as the final model for deployment and further predictions.