-
Notifications
You must be signed in to change notification settings - Fork 24
Gsoc26h development #54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 6 commits
74e091f
491a0c3
9998a48
5c6fadb
c7a5b79
3d6cd3a
19ba447
36c5819
0bf0f0b
ddb93f2
f4be3be
8ca5811
3013015
3a1e89d
4d7981b
f6b8a77
e225a0d
cd912ca
df26dfc
8be47aa
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,55 @@ | ||
| # Python | ||
| __pycache__/ | ||
| *.py[cod] | ||
| *.egg-info/ | ||
| dist/ | ||
| build/ | ||
| .eggs/ | ||
| *.egg | ||
| .env | ||
| .venv/ | ||
| venv/ | ||
|
|
||
| # Outputs (large files — don't commit) | ||
| outputs/models/ | ||
| outputs/logs/ | ||
| outputs/results/*.json | ||
| *.bin | ||
| *.safetensors | ||
| *.pt | ||
| *.pth | ||
|
|
||
| # Data (too large for git — use DVC or Drive links) | ||
| data/raw/ | ||
| data/processed/ | ||
| data/training/ | ||
| data/feedback/ | ||
|
|
||
| # Keep ontology properties (small, curated) | ||
| !data/ontology/dbpedia_properties.json | ||
| !data/ontology/property_descriptions_hi.json | ||
|
|
||
| # Jupyter | ||
| .ipynb_checkpoints/ | ||
| *.ipynb_checkpoints | ||
|
|
||
| # HuggingFace cache | ||
| .cache/ | ||
| *.hf_cache/ | ||
|
|
||
| # Streamlit | ||
| .streamlit/ | ||
|
|
||
| # Colab credentials | ||
| *.json | ||
| !data/ontology/*.json | ||
| !configs/*.json | ||
|
|
||
| # OS | ||
| .DS_Store | ||
| Thumbs.db | ||
|
|
||
| # IDE | ||
| .vscode/ | ||
| .idea/ | ||
| *.swp |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,291 @@ | ||
| # DBpedia Hindi Chapter — Neural Relational Triple Extraction | ||
|
|
||
| > Building a Hindi knowledge graph from Wikipedia, one sentence at a time. | ||
|
|
||
| []() | ||
| [](https://www.python.org/) | ||
| [](https://www.dbpedia.org/) | ||
| []() | ||
|
|
||
| --- | ||
|
|
||
| ## 🌏 The Problem | ||
|
|
||
| Hindi is spoken by over **600 million people**. Hindi Wikipedia has **160,000+ articles** packed with factual knowledge. Yet, when researchers, search engines, or AI systems need *structured* Hindi facts, they hit a wall. | ||
|
|
||
| DBpedia is the world's largest open knowledge graph extracted from Wikipedia — but its **Hindi chapter is sparse**. Why? Because the relational knowledge sitting inside Hindi Wikipedia articles is locked in **free text**, not structured infoboxes: | ||
|
|
||
| > *"ताज महल का निर्माण शाहजहाँ ने करवाया था।"* | ||
| > (The Taj Mahal was built by Shah Jahan.) | ||
|
|
||
| A human reads this and knows: **Taj Mahal — `builder` → Shah Jahan**. | ||
|
|
||
| A machine, looking for an infobox or a clean table? Finds nothing. | ||
|
|
||
| **Result:** A massive gap between what Hindi Wikipedia *contains* and what Hindi DBpedia *structures*. | ||
|
|
||
| --- | ||
|
|
||
| ## 🎯 What This Project Does | ||
|
|
||
| This project closes that gap by building an **end-to-end pipeline** that reads free-text Hindi sentences and outputs DBpedia-compatible structured triples — ready to be ingested into the knowledge graph. | ||
|
|
||
| ``` | ||
| Hindi Sentence → Extract → Align to Ontology → Validate → RDF Triple | ||
| ``` | ||
|
Comment on lines
+33
to
+35
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add language identifiers to fenced code blocks. At Line 33, Line 65, Line 154, and Line 229, fenced blocks are missing a language tag. Add Also applies to: 65-110, 154-177, 229-249 🧰 Tools🪛 markdownlint-cli2 (0.22.1)[warning] 33-33: Fenced code blocks should have a language specified (MD040, fenced-code-language) 🤖 Prompt for AI Agents |
||
|
|
||
| For the example above, the pipeline produces: | ||
|
|
||
| ```turtle | ||
| :Taj_Mahal dbo:builder :Shah_Jahan . | ||
| ``` | ||
|
|
||
| That's a structured fact a machine can query, reason over, and link to other knowledge. | ||
|
|
||
| --- | ||
|
|
||
| ## 🧩 Why Hindi Is Hard | ||
|
|
||
| Generic information extraction tools fail on Hindi for specific linguistic reasons: | ||
|
|
||
| | Hindi Feature | What It Means | Why It Breaks Generic IE | | ||
| |---|---|---| | ||
| | **Free word order** | Subject, object, verb can appear in any order | Pattern-based extractors miss arguments | | ||
| | **Postposition system** | Tiny words like *का, ने, में* carry the relation | English-trained models ignore them | | ||
| | **Pro-drop** | Subjects often implicit | Extractors return empty subjects | | ||
| | **Verb-final syntax** | The action comes at the end | Left-to-right parsers miss the predicate | | ||
| | **Copula relations** | *"है"* (is) hides the real relationship | Models extract *"है"* as the predicate — meaningless | | ||
|
|
||
| These aren't edge cases. They show up in nearly every Hindi Wikipedia sentence. | ||
|
|
||
| --- | ||
|
|
||
| ## 🏗️ The Pipeline | ||
|
|
||
| ``` | ||
| ┌──────────────────────────────────────────────────────────────────┐ | ||
| │ HINDI WIKIPEDIA SENTENCE │ | ||
| │ "ताज महल का निर्माण शाहजहाँ ने करवाया था।" │ | ||
| └─────────────────────────────┬────────────────────────────────────┘ | ||
| ▼ | ||
| ┌──────────────────────────────────────────────────────────────────┐ | ||
| │ 1️⃣ RULE-BASED EXTRACTION (IndIE) │ | ||
| │ Identifies candidate arguments using Hindi syntax rules │ | ||
| │ Output: subject = "ताज महल", object = "शाहजहाँ" │ | ||
| └─────────────────────────────┬────────────────────────────────────┘ | ||
| ▼ | ||
| ┌──────────────────────────────────────────────────────────────────┐ | ||
| │ 2️⃣ FINE-TUNED SMALL LANGUAGE MODEL (Gemma-3 + LoRA) │ | ||
| │ Predicts the relation, conditioned on the arguments │ | ||
| │ Output: predicate = "का निर्माण" │ | ||
| └─────────────────────────────┬────────────────────────────────────┘ | ||
| ▼ | ||
| ┌──────────────────────────────────────────────────────────────────┐ | ||
| │ 3️⃣ ONTOLOGY ALIGNMENT LAYER │ | ||
| │ Maps the Hindi surface predicate to a DBpedia property │ | ||
| │ using multilingual sentence embeddings │ | ||
| │ Output: dbo:builder (confidence: 0.87) │ | ||
| └─────────────────────────────┬────────────────────────────────────┘ | ||
| ▼ | ||
| ┌───────────┴───────────┐ | ||
| │ Confidence > 0.45? │ | ||
| └───────────┬───────────┘ | ||
| │ | ||
| ┌───────────────┴───────────────┐ | ||
| ▼ ▼ | ||
| ┌─────────────────┐ ┌──────────────────────┐ | ||
| │ ✅ Accepted │ │ 🧑 HUMAN REVIEW │ | ||
| │ Direct to KG │ │ Reviewer corrects │ | ||
| └────────┬────────┘ │ → retraining data │ | ||
| │ └──────────┬───────────┘ | ||
| │ │ | ||
| └───────────────┬───────────────┘ | ||
| ▼ | ||
| ┌──────────────────────────────────────────────────────────────────┐ | ||
| │ 4️⃣ RDF SERIALIZATION │ | ||
| │ :Taj_Mahal dbo:builder :Shah_Jahan . │ | ||
| └─────────────────────────────┬────────────────────────────────────┘ | ||
| ▼ | ||
| 📊 DBpedia Hindi Knowledge Graph | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## 💡 The Key Ideas | ||
|
|
||
| ### 1. Predicates are the hard part | ||
|
|
||
| Multilingual language models, even in zero-shot mode, are surprisingly good at identifying *who* and *what* in a Hindi sentence. They struggle with the **relation between them**. | ||
|
|
||
| This insight — that subjects and objects come for free, but predicates need work — shapes the entire architecture. Fine-tuning focuses where it matters; the easy slots are left alone. | ||
|
|
||
| ### 2. Don't ask the model to do schema alignment | ||
|
|
||
| Asking a language model to output `dbo:birthPlace` from raw Hindi text is asking it to do two things at once: understand the relation **and** know the DBpedia ontology. Models hallucinate properties that don't exist. | ||
|
|
||
| A separate **ontology alignment layer** handles the second task. The model produces a Hindi surface form ("का जन्म"); a multilingual embedding model maps it to the right DBpedia property via cosine similarity. Each component does one job well. | ||
|
|
||
| ### 3. Humans review what matters | ||
|
|
||
| A naive review pipeline asks a human to check every extraction — expensive and boring. A smarter one asks humans only when the model is **uncertain**. Confidence scores from the alignment layer become the trigger: high confidence → straight into the KG, low confidence → human review queue. | ||
|
|
||
| Every correction becomes new training data. The system gets better with use. | ||
|
|
||
| --- | ||
|
|
||
| ## 🔬 Error Taxonomy | ||
|
|
||
| When extraction fails, *how* it fails matters more than aggregate accuracy. Five failure modes are tracked across every evaluation: | ||
|
|
||
| | Error Type | What Happens | Example | | ||
| |---|---|---| | ||
| | **Predicate Normalization Failure** | Surface Hindi extracted instead of DBpedia property | `का निर्माण` → should be `dbo:builder` | | ||
| | **Language Mixing** | Model outputs English on Hindi input | `was born in` → should be `dbo:birthPlace` | | ||
| | **Implicit Relation Error** | Copula extracted as predicate | `है` → should be `dbo:capital` | | ||
| | **Argument Span Error** | Wrong subject/object boundaries | Captures *"ताज महल का"* instead of *"ताज महल"* | | ||
| | **Missing Triple** | No extraction at all | Sentence skipped entirely | | ||
|
|
||
| This taxonomy turns *"the model is 60% accurate"* into actionable diagnostics: *"language mixing dropped from 40% to 5%, but argument span errors are still our biggest loss."* | ||
|
|
||
| --- | ||
|
|
||
| ## 📦 What's Inside | ||
|
|
||
| ``` | ||
| GSoC26_H/ | ||
| ├── 📄 README.md | ||
| ├── 📄 requirements.txt | ||
| │ | ||
| ├── 📓 notebooks/ ← Phase-by-phase Colab notebooks | ||
| │ | ||
| ├── 🐍 src/ | ||
| │ ├── baseline/ ← Rule-based and zero-shot baselines | ||
| │ ├── ontology/ ← Hindi → DBpedia property alignment | ||
| │ ├── finetune/ ← LoRA training pipeline | ||
| │ ├── evaluation/ ← Error taxonomy + metrics | ||
| │ └── pipeline/ ← End-to-end extractor + RDF export | ||
| │ | ||
| ├── 🖥️ hitl/ ← Human-in-the-loop Streamlit interface | ||
| │ | ||
| ├── 📊 data/ | ||
| │ ├── ontology/ ← DBpedia properties with Hindi surface forms | ||
| │ ├── training/ ← Instruction-tuning pairs | ||
| │ └── feedback/ ← Reviewer corrections (JSONL) | ||
| │ | ||
| ├── ⚙️ configs/ ← LoRA configs, hyperparameters | ||
| └── 📈 results/ ← Evaluation outputs, ablation tables | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## 🚀 Getting Started | ||
|
|
||
| ### Run in Google Colab (no setup) | ||
|
|
||
| The Phase 1 baseline notebook reproduces all three baselines on the full Hindi-BenchIE benchmark using a free T4 GPU. | ||
|
|
||
| ```python | ||
| !git clone https://github.com/dbpedia/neural-extraction-framework.git | ||
| %cd neural-extraction-framework/GSoC26_H | ||
| !pip install -r requirements.txt | ||
| # Open notebooks/01_week1_baselines.ipynb and run all cells | ||
| ``` | ||
|
|
||
| ### Run locally | ||
|
|
||
| ```bash | ||
| git clone https://github.com/dbpedia/neural-extraction-framework.git | ||
| cd neural-extraction-framework/GSoC26_H | ||
|
|
||
| python -m venv venv | ||
| source venv/bin/activate # Windows: venv\Scripts\activate | ||
| pip install -r requirements.txt | ||
|
|
||
| jupyter notebook notebooks/01_week1_baselines.ipynb | ||
| ``` | ||
|
|
||
| **Hardware:** Phase 1 runs on free Colab T4. Phase 2 (LoRA fine-tuning) needs ~10GB VRAM — Colab T4 is sufficient with 4-bit quantization. | ||
|
|
||
| --- | ||
|
|
||
| ## 📊 Evaluation | ||
|
|
||
| All systems are evaluated on **Hindi-BenchIE** using the official `BenchIEDetailedComparator` from the GSoC 2025 Hindi pipeline. This ensures numbers are directly comparable across years. | ||
|
|
||
| The evaluation uses **fact-cluster matching** instead of string overlap: | ||
| - Each gold annotation lists *all* valid surface forms of the same fact | ||
| - A prediction is correct if it matches *any* form in the cluster | ||
| - Avoids penalizing valid paraphrases — a known weakness of older OIE benchmarks | ||
|
|
||
| Results are reported as: | ||
| - **Aggregate**: Precision, Recall, F1 across the full benchmark | ||
| - **Per-system slot accuracy**: Subject / Predicate / Object separately | ||
| - **Per-error-type breakdown**: Counts and percentages for each of the 5 failure modes | ||
|
|
||
| --- | ||
|
|
||
| ## 🛣️ Roadmap | ||
|
|
||
| ``` | ||
| Phase 1 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Baselines & Ablation | ||
| ✅ Reproduce IndIE, Gemma-3 zero-shot, GSoC25_H pipelines | ||
| ✅ Aggregate P/R/F1 on Hindi-BenchIE | ||
| 🔄 Per-error-type breakdown across all systems | ||
|
|
||
| Phase 2 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Fine-Tuning + Ontology Alignment | ||
| ⏳ Build training dataset from BenchIE gold annotations | ||
| ⏳ LoRA fine-tune Gemma-3 with iterative slot prompting | ||
| ⏳ Extend ontology alignment to full DBpedia property coverage | ||
|
|
||
| Phase 3 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Human-in-the-Loop Feedback | ||
| ⏳ Production Streamlit annotation interface | ||
| ⏳ Confidence-based review queue | ||
| ⏳ Annotation round on 50–100 sentences | ||
|
|
||
| Phase 4 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Iteration & Release | ||
| ⏳ Retrain with reviewer-corrected data | ||
| ⏳ Final pipeline, dataset, evaluation report | ||
| ⏳ Integration documentation for DBpedia maintainers | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## 📚 Built On | ||
|
|
||
| This work stands on three pieces of recent research: | ||
|
|
||
| - **[BenchIE](https://aclanthology.org/2022.acl-long.307/)** — Fact-cluster evaluation that doesn't get fooled by surface paraphrase | ||
| - **[MILIE](https://aclanthology.org/2022.acl-long.555/)** — Iterative slot extraction; predicates conditioned on arguments | ||
| - **[OpenIE Survey 2024](https://aclanthology.org/2024.findings-emnlp.222/)** — State of the field, where neural methods help and where they don't | ||
| - **IndIE** — Rule-based Hindi OIE; provides the strong argument-extraction foundation | ||
|
|
||
| And on the prior DBpedia GSoC Hindi work in [`GSoC24_H/`](../GSoC24_H/) and [`GSoC25_H/`](../GSoC25_H/). | ||
|
|
||
| --- | ||
|
|
||
| ## 🌟 Why It Matters | ||
|
|
||
| Every triple extracted is one more queryable fact about Hindi-speaking history, geography, science, and culture — added to a knowledge graph used by researchers, search engines, voice assistants, and AI systems worldwide. | ||
|
|
||
| A pipeline that works for Hindi can be adapted for **other low-resource languages**: Bengali, Tamil, Telugu, Marathi. Eight hundred million more speakers, eight more low-resource Wikipedias waiting to become structured knowledge. | ||
|
|
||
| Closing the Hindi gap is the first step. | ||
|
|
||
| --- | ||
|
|
||
| ## 🤝 Contributing | ||
|
|
||
| This project is part of Google Summer of Code 2026 with the DBpedia Association. | ||
|
|
||
| - 💬 **Forum:** [forum.dbpedia.org](https://forum.dbpedia.org/) | ||
| - 💼 **Slack:** [dbpedia.slack.com](https://dbpedia.slack.com/) | ||
| - 🌐 **Project home:** [dbpedia.org](https://www.dbpedia.org/) | ||
|
|
||
| Contributions, issues, and discussion welcome. | ||
|
|
||
| --- | ||
|
|
||
| <p align="center"> | ||
| <i>Part of the DBpedia Neural Extraction Framework</i><br> | ||
| <a href="https://www.dbpedia.org/">🏛️ DBpedia Association</a> | ||
| </p> | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix empty badge links (broken markdown links).
At Line 5 and Line 8, the badge links use empty
()targets, which creates broken links in rendered docs. Please either add valid URLs or remove link wrappers.🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 5-5: No empty links
(MD042, no-empty-links)
[warning] 8-8: No empty links
(MD042, no-empty-links)
🤖 Prompt for AI Agents