This notebook was created as a follow-up to a poster presentation by Minkyung Kim at the CSEDU 2025 conference. It explores complementary techniques for paraphrase evaluation using lexical, semantic, and syntactic similarity features.
The notebook includes:
- Jaccard Similarity (Lexical)
- Sentence-BERT Cosine Similarity (Semantic)
- Tree Edit Distance using spaCy & ZSS (Syntactic)
- Visual analyses of the scores
- Supervised classification models (Logistic Regression, XGBoost, SVM)
- Hyperparameter tuning with
GridSearchCV
It uses the MRPC dataset from the GLUE benchmark for experimentation.
You can open the notebook in Google Colab using the following link:
Once opened:
- Click
Runtime > Run allto execute all cells.
If the Colab runtime doesn't start properly or if resources are limited:
- You can still copy the code from each cell in the notebook.
- Then, paste it into your own new Colab notebook via site : colab.research.google.com.
- This will let you run everything independently.
The notebook installs all required libraries automatically via pip. These include:
datasetsspacysentence-transformerszssscikit-learnxgboostmatplotlibseaborn
It also downloads the spaCy models:
en_core_web_smen_core_web_md
This prototype is a humble contribution to the ongoing research on paraphrase evaluation for EFL learners.
Feel free to reuse, modify, or extend it.
With kind regards,
Mikel Vandeloise
Université de Namur, Belgium