Alexandria: A Multi-Domain Dialectal Arabic Machine Translation Dataset for Culturally Inclusive and Linguistically Diverse LLMs
Alexandria covers 13 Arab countries, 11 domains, and 107K community-driven samples.
This repository accompanies the Alexandria paper and collects the project assets used to build and evaluate a benchmark for Dialectal Arabic Machine Translation, Arabic dialect translation, English-to-dialect Arabic translation, dialect-to-English translation, and multi-turn conversational MT. Alexandria is organized into four splits: Train, Dev, Public Test, and Private Test. This repository focuses on the materials behind the dataset creation pipeline: prompt templates for English source conversation generation, participant guidelines for translation and revision, and the evaluation area for benchmarking Arabic MT systems and LLMs on the Alexandria public test set.
Alexandria is introduced in the paper Alexandria: A Multi-Domain Dialectal Arabic Machine Translation Dataset for Culturally Inclusive and Linguistically Diverse LLMs. The project targets a persistent gap in Arabic NLP: strong support for Modern Standard Arabic but much weaker coverage of dialectal Arabic, especially in realistic, culturally grounded conversational settings.
This repository organizes the Alexandria resources used across the creation workflow and the public-test evaluation setup for dialectal Arabic MT, city-level Arabic dialect translation, culturally grounded machine translation, code-switching-aware translation, gender-aware translation, and Arabic LLM evaluation.
Keywords: dialectal Arabic machine translation, Arabic dialect translation, conversational machine translation, multi-turn MT, city-level dialect benchmarking, English-dialect parallel data, culturally grounded Arabic NLP, code-switching, persona-aware translation, gender-conditioned translation, Arabic LLM evaluation.
Alexandria contains 107,631 total turns across 13 Arab country contexts and 11 domains. The dataset is city-level, multi-turn, English <-> Dialect Arabic, averages 13.23 words per turn, and has 0.826 Distinct-2 lexical diversity.
Alexandria is organized into four standard benchmark splits:
TrainDevPublic TestPrivate Test
The Public Test split is intended for open benchmarking and reproducible reporting, while the Private Test split supports held-out evaluation.
You can access Alexandria directly from Hugging Face using the datasets library. The example below loads a specific country subset and reads the first English and dialectal turns from the training split.
- Hugging Face dataset: UBC-NLP/alexandria
from datasets import load_dataset
# Load a specific country subset (e.g., 'MA' for Morocco, 'EG' for Egypt) with a specific split
train_data = load_dataset("UBC-NLP/alexandria", name="MA", split="train")
test_data = load_dataset("UBC-NLP/alexandria", name="MA", split="test")
# View the first parallel turn of the first conversation from the train set
first_conv = train_data[0]
eng_turn = first_conv['english_conversation'][0]
dialect_turn = first_conv['dialectal_conversation'][0]
print(f"English: {eng_turn['text']}")
print(f"Dialect: {dialect_turn['text']}")The 13 dialect settings covered in Alexandria are Jordanian Arabic, Lebanese Arabic, Palestinian Arabic, Syrian Arabic, Saudi Arabic, Omani Arabic, Yemeni Arabic, Egyptian Arabic, Sudanese Arabic, Libyan Arabic, Moroccan Arabic, Mauritanian Arabic, and Tunisian Arabic.
Regional grouping in the table below:
Levant: JO (Jordanian Arabic), LB (Lebanese Arabic), PS (Palestinian Arabic), SY (Syrian Arabic)Gulf: SA (Saudi Arabic), OM (Omani Arabic), YE (Yemeni Arabic)Nile: EG (Egyptian Arabic), SD (Sudanese Arabic)Maghreb: LY (Libyan Arabic), MA (Moroccan Arabic), MR (Mauritanian Arabic), TN (Tunisian Arabic)
| Domain | JO | LB | PS | SY | SA | OM | YE | EG | SD | LY | MA | MR | TN | Total |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Agriculture/Farming | 825 | 1140 | 1770 | 931 | 1162 | 915 | 529 | 583 | 163 | 231 | 570 | 970 | 481 | 10270 |
| Commerce/Transactions | 750 | 1004 | 1595 | 749 | 1020 | 650 | 579 | 506 | 201 | 160 | 445 | 757 | 401 | 8817 |
| Construction/Real Estate | 859 | 995 | 1761 | 861 | 1161 | 974 | 696 | 660 | 225 | 271 | 574 | 673 | 485 | 10195 |
| Education/Academia | 816 | 1191 | 1513 | 831 | 1017 | 1079 | 563 | 549 | 170 | 220 | 601 | 863 | 551 | 9964 |
| Energy/Resources | 786 | 1048 | 1715 | 928 | 1177 | 937 | 587 | 625 | 189 | 243 | 447 | 719 | 470 | 9871 |
| Everyday/Social | 967 | 1215 | 1697 | 787 | 1020 | 888 | 642 | 604 | 175 | 210 | 595 | 824 | 550 | 10174 |
| Healthcare/Medical | 727 | 1240 | 1728 | 781 | 1043 | 895 | 548 | 487 | 164 | 253 | 556 | 948 | 522 | 9892 |
| Legal/Financial | 693 | 1006 | 1566 | 757 | 857 | 753 | 496 | 539 | 177 | 174 | 481 | 642 | 412 | 8553 |
| Logistics/Transport | 842 | 1020 | 1512 | 950 | 1234 | 842 | 629 | 646 | 189 | 187 | 593 | 877 | 515 | 10036 |
| Professional/Workplace | 845 | 1220 | 1810 | 959 | 1112 | 866 | 549 | 645 | 178 | 253 | 480 | 709 | 526 | 10152 |
| Tourism/Hospitality | 720 | 1161 | 1596 | 884 | 1004 | 815 | 608 | 608 | 190 | 216 | 567 | 878 | 460 | 9707 |
| Total | 8830 | 12240 | 18263 | 9418 | 11807 | 9614 | 6426 | 6452 | 2021 | 2418 | 5909 | 8860 | 5373 | 107631 |
Alexandria is designed to extend prior Arabic dialect MT resources with broader domain coverage, multi-turn conversational structure, local context, code-switching support, gender-direction annotations, and persona roles.
| Dataset | # Sentence Pairs / Turns | # Dialects | Granularity | Src Type | Direction | # Domains | Avg. words | Distinct-2 | LC | CS | GD | PR |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PADIC (Meftouh et al., 2015) | 38K | 6 | Country | Sentence | Eng <-> Dialect |
1 | 6.77 | 0.782 | No | No | No | No |
| MADAR (Bouamor et al., 2018) | 100K | 13 | City | Sentence | Eng <-> Dialect |
1 | 5.73 | 0.768 | No | No | No | No |
| FLORES+ (Team et al., 2022) | 16K | 9 | Country | Sentence | Eng <-> Dialect |
3 | 18.39 | 0.898 | No | No | No | No |
| Alexandria (ours) | 107K | 13 | City | Multi-turn | Eng <-> Dialect |
11 | 13.23 | 0.826 | Yes | Yes | Yes | Yes |
LC = Local Context, CS = Code-Switching, GD = gender-direction annotations, PR = persona roles.
.
βββ evaluation_code/
βββ guidelines/
β βββ Alexandria_MT_Revision_Phase_Guidelines.pdf
β βββ Alexandria_MT_Translation_Phase_Guidelines.pdf
βββ images/
β βββ alexandria_overview.webp
βββ prompts/
βββ coversations_generation_prompt.txt
βββ *_prompt.txt
βββ topics_examples/
The prompts/ directory covers the prompts used per domain to generate the English source conversations that were later translated into local dialects and languages. It includes:
- Domain-specific prompt (for topics generation) files for:
agriculture_farming,commerce_transactions,construction_real_estate,education_academia,energy_resources,everyday_social,healthcare_medical,legal_financial,logistics_transportation,professional_workplace, andtourism_hospitality - Example topic files under
prompts/topics_examples/for the same set of domains - A shared instruction template (for conversations generation) in
coversations_generation_prompt.txt
The guidelines/ directory contains the documents given to participants during the human data creation stages:
Alexandria_MT_Translation_Phase_Guidelines.pdffor the translation phaseAlexandria_MT_Revision_Phase_Guidelines.pdffor the revision phase
The evaluation_code/ directory covers the evaluation code for running Alexandria benchmarking on your own models, with the public evaluation setup centered on the Public Test split.
Alexandria spans 11 practical domains designed to reflect everyday and specialized communication across Arab communities:
- Agriculture and farming
- Commerce and transactions
- Construction and real estate
- Education and academia
- Energy and resources
- Everyday social interactions
- Healthcare and medical settings
- Legal and financial settings
- Logistics and transportation
- Professional workplace communication
- Tourism and hospitality
If you use this repository or the Alexandria dataset in your research, please cite the paper:
@inproceedings{el-mekki-etal-2026-alexandria,
title = "Alexandria: A Multi-Domain Dialectal {A}rabic Machine Translation Dataset for Culturally Inclusive and Linguistically Diverse {LLM}s",
author = "EL Mekki, Abdellah and
Magdy, Samar M. and
Atou, Houdaifa and
AbuHweidi, Ruwa and
Qawasmeh, Baraah and
Nacar, Omer and
Al-hibiri, Thikra and
Saadie, Razan and
Alsayadi, Hamzah A. and
Hammouda, Nadia Ghezaiel and
Alkhazimi, Alshima Mohammed and
Hamod, Aya and
Al-Ghafri, Al-Yas Yaqoob and
El-Sayed, Wesam and
al Sharji, Asila Ismail and
Ballout, Mohamad and
Belfathi, Anas and
Ghaddar, Karim and
Sibaee, Serry and
Aoun, Alaa and
Aseri, Aeej Mohammed and
Abureesh, Lina and
Bashiti, Ahlam and
Yousef, Majdal and
Hafiz, Abdulaziz and
Mohamed, Yehdih and
Hamedtou, Emira and
Emehah, Brakehe and
Alhamouri, Rahaf and
Nafea, Youssef and
El Aatar, Aya and
Al-Dhabyani, Walid and
Hamed, Emhemed S. and
Shatnawi, Sara and
Alwajih, Fakhraddin and
Elkhidir, Khalid and
Alasmari, Ashwag and
Gerrio, Abdurrahman and
Alshahri, Omar Said and
Elmadany, AbdelRahim A. and
Berrada, Ismail and
Al-kathiri, Amir Azad Adli and
Zaraket, Fadi and
Jarrar, Mustafa and
EL Hadj, Yahya Mohamed and
Alhuzali, Hassan and
Abdul-Mageed, Muhammad",
editor = "Liakata, Maria and
Moreira, Viviane P. and
Zhang, Jiajun and
Jurgens, David",
booktitle = "Proceedings of the 64th Annual Meeting of the {A}ssociation for {C}omputational {L}inguistics (Volume 1: Long Papers)",
month = jul,
year = "2026",
address = "San Diego, California, United States",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2026.acl-long.1503/",
pages = "32567--32592",
ISBN = "979-8-89176-390-6"
}For questions, corrections, or feedback, please open an issue in this repository.
