Skip to content

Gsoc26h development#54

Open
singhhnitin wants to merge 20 commits into
dbpedia:mainfrom
singhhnitin:gsoc26h-development
Open

Gsoc26h development#54
singhhnitin wants to merge 20 commits into
dbpedia:mainfrom
singhhnitin:gsoc26h-development

Conversation

@singhhnitin

@singhhnitin singhhnitin commented May 30, 2026

Copy link
Copy Markdown

Summary by CodeRabbit

  • New Features
    • Added a zero-shot Gemma Hindi triple-extraction baseline with simultaneous and iterative prompting.
    • Added ontology predicate alignment with confidence scoring and a Streamlit human-in-the-loop review workflow.
    • Added a CLI tool to generate synthetic “noisy” training data for staged fine-tuning.
  • Documentation
    • Added detailed project and HITL interface READMEs.
  • Data
    • Added curated Hindi DBpedia properties JSON.
  • Configuration
    • Added LoRA/QLoRA training configuration and pinned dependency lists.
  • Evaluation
    • Added an error taxonomy, baseline workflow notebook, and Phase‑1 ablation report.
  • Chores
    • Updated git ignore rules for common local/ML artifacts.

@coderabbitai

coderabbitai Bot commented May 30, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@singhhnitin, you've reached your PR review limit, so we couldn't start this review.

Next review available in: 44 minutes

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 615373ce-4de1-4e7a-a286-262f1cc0f0d9

📥 Commits

Reviewing files that changed from the base of the PR and between 3a1e89d and 8be47aa.

📒 Files selected for processing (5)
  • GSoC26_H/configs/lora_config.yaml
  • GSoC26_H/data/GSoC26_H/data/ground_truth/ground_truth_benchie_triples.jsonl
  • GSoC26_H/data/ontology/dbpedia_properties.json
  • GSoC26_H/src/evaluation/error_taxonomy.py
  • GSoC26_H/src/ontology/alignment.py
📝 Walkthrough

Walkthrough

Adds project scaffolding, Hindi→DBpedia extraction and alignment code, a Phase 1 Colab evaluation notebook, a Streamlit HITL review app, and a CLI for generating and validating synthetic noisy training data.

Changes

Hindi Triple Extraction Pipeline

Layer / File(s) Summary
Repository setup and training inputs
GSoC26_H/.gitignore, requirements.txt, GSoC26_H/requirements.txt, GSoC26_H/configs/lora_config.yaml, GSoC26_H/data/ontology/dbpedia_properties.json
Repository ignore rules, dependency lists, LoRA/QLoRA training configuration, and curated DBpedia property metadata are added.
Project and results documentation
GSoC26_H/README.md, GSoC26_H/results/*.md
The main project README and result writeups describe the pipeline, evaluation framing, model comparisons, ground-truth creation, noisy dataset generation, and Wikipedia-based synthetic data collection.
Zero-shot extraction and evaluation model
GSoC26_H/src/baseline/gemma_zero_shot.py, GSoC26_H/src/evaluation/error_taxonomy.py, GSoC26_H/src/ontology/alignment.py
Gemma zero-shot extraction, error classification, and ontology alignment code define the model output shape, parsing, taxonomy, alignment scoring, diagnostics, and persistence.
Phase 1 baseline notebook
GSoC26_H/notebooks/01_week1_baselines.ipynb
The Colab notebook mounts Drive, installs dependencies, clones the comparison repo, runs three baselines, and aggregates evaluation output into a CSV table.
HITL review app
GSoC26_H/hitl/README.md, GSoC26_H/hitl/hitl_app.py, GSoC26_H/hitl/requirement.txt
The Streamlit review app, its README, and its dependency file implement queue-based accept/modify/reject/skip review for alignment suggestions.
Synthetic noisy data generation
GSoC26_H/src/finetune/generate_noisy_data.py
The noisy-data CLI builds low-score seeds, formats chat history, calls an external model in parallel, validates generated JSONL, and prints output statistics.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • mommi84
  • smilingprogrammer
🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 67.86% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title is too generic and only names the project branch, so it does not clearly describe the main change. Use a concise, specific title that summarizes the primary change, such as adding the Hindi triple-extraction pipeline and related docs/configs.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@GSoC26_H/configs/lora_config.yaml`:
- Around line 40-44: The config sets load_best_model_at_end: true but lacks
compatible strategies; add evaluation_strategy: "steps" (or "epoch") and a
matching save_strategy: "steps" (or "epoch") or "best" so checkpointing is
compatible with load_best_model_at_end, optionally set metric_for_best_model if
you need a metric other than the default "loss", and ensure save_steps is a
multiple of eval_steps (keep save_steps: 50 and eval_steps: 50 or adjust
accordingly); leave save_total_limit as-is or adjust if you want fewer retained
checkpoints.
- Line 10: The config currently sets bnb_4bit_compute_dtype: "bfloat16" and
training.bf16: true which is incompatible with NVIDIA T4 GPUs; update
bnb_4bit_compute_dtype to "float16", set training.fp16: true and set
training.bf16: false (or remove bf16) so the code uses FP16 on T4; locate these
keys (bnb_4bit_compute_dtype, training.bf16, training.fp16) in the
lora_config.yaml and make the value changes accordingly.

In `@GSoC26_H/notebooks/01_week1_baselines.ipynb`:
- Around line 127-135: The clone step currently checks out the default branch
HEAD which can drift; update the cloning logic around gsoc_path and the
neural-extraction-framework checkout so it pins to a known revision: add a
constant commit/tag (e.g. GSO C25H_COMMIT) and after cloning run a git checkout
of that commit (or use git clone --branch <tag> --single-branch) against the
neural-extraction-framework repo, ensuring gsoc25h (which uses PROJECT) points
to the pinned revision; keep the existing exists-check and print messages but
ensure the repo is forced/checked out to the pinned commit when cloning or when
already present.
- Around line 324-331: The parser currently swallows all exceptions with a bare
"except Exception: pass" and returns unvalidated JSON directly from json.loads,
so update the block that cleans and extracts the JSON (the
re.sub/re.search/match logic and json.loads call) to only catch
json.JSONDecodeError, validate that the decoded payload is a dict (assign to a
local payload variable), and return a sanitized dict with 'subject', 'relation',
'object' coerced to strings (using payload.get(..., '') and str(...)) plus
'raw': response; otherwise fall back to the empty-triple response as before—this
ensures you don't hide parser bugs and that later .strip() calls won't fail on
non-string fields.
- Around line 309-319: The extract_triple function currently tokenizes
PROMPT.format(sentence=...) directly which bypasses Gemma’s Italian
chat/assistant formatting; update the input creation to use the tokenizer's chat
template API (e.g., tokenizer.apply_chat_template or similar) with
add_generation_prompt=True and tokenize=True for the "google/gemma-3-1b-it"
checkpoint, then move the resulting tensors to model.device before calling
model.generate (leave the generate(...) call unchanged); reference
extract_triple, PROMPT, tokenizer.apply_chat_template, and model.generate when
making the change.

In `@GSoC26_H/README.md`:
- Around line 5-8: Fix the broken badge links in the README by removing the
empty link wrappers or replacing them with valid targets: update the markdown
for the [![Status] badge and the [![License] badge (the lines showing
[![Status](... )() and [![License](... )()]) so they either point to proper URLs
or are converted to plain images without trailing () link wrappers; ensure the
other badges remain unchanged.
- Around line 33-35: Update the fenced code blocks in the README that are
missing language tags: add ```text``` for the plain pipeline line "Hindi
Sentence  →  Extract  →  Align to Ontology  →  Validate  →  RDF Triple" and
other plain text blocks, use ```bash``` for shell/command examples, and
```turtle``` for RDF/Turtle snippets found in the blocks around the referenced
ranges (lines shown in the review). Ensure each opening triple-backtick includes
the appropriate language identifier so linting and syntax highlighting work
correctly.

In `@GSoC26_H/requirements.txt`:
- Around line 2-4: Update the vulnerable dependency floors in requirements.txt
by bumping torch and transformers to known patched/minimum safe versions (e.g.,
replace torch>=2.1.0 and transformers>=4.40.0 with pinned or raised minima such
as torch>=2.1.1+ or an agreed-tested exact version and transformers>=4.40.1 or a
tested pin), keep or update peft as needed, and add a short comment in
requirements.txt documenting the security baseline and the advisories addressed;
ensure you run your dependency resolver/test suite after changing the entries to
confirm compatibility.

In `@GSoC26_H/src/baseline/gemma_zero_shot.py`:
- Line 303: The code currently calls text = re.sub(r"```(?:json)?", "",
text).strip().rstrip("```").strip(), but rstrip("```") is misleading and
redundant because re.sub already removes the triple-backtick fences and rstrip
treats its argument as a character set; update the assignment (in
gemma_zero_shot.py where variable text is set) to remove the rstrip("```") and
leave a single .strip() so the line becomes a simple re.sub(...).strip().
- Around line 274-289: The triple parsing loop in gemma_zero_shot.py can crash
when a non-dict element is present because HindiTriple(**t) raises TypeError
which isn’t caught; update the loop that builds triples (the for t in
data["triples"] section) to first verify each element is a dict (or convert/skip
non-dict) and catch both ValidationError and TypeError around HindiTriple(**t)
so malformed or non-dict entries are skipped safely; ensure the same defensive
check is applied wherever HindiTriple(...) is constructed (including the
fallback branch that builds a single HindiTriple from
subject/predicate/object_).
- Around line 207-215: _build_chat_prompt currently emits a {"role":"system",
...} entry which breaks Gemma 1.x/2.x because apply_chat_template rejects a
system role; instead build the chat as only user messages by embedding
SYSTEM_PROMPT into the first user content (e.g., concatenate SYSTEM_PROMPT +
"\n\n" + user_message) and remove the system-role object so the list contains
only {"role":"user", "content": ...} before calling
self._tokenizer.apply_chat_template(chat, tokenize=False,
add_generation_prompt=True); update references to SYSTEM_PROMPT and the chat
variable in the _build_chat_prompt method accordingly.
- Around line 190-192: The tokenization in _generate re-tokenizes the string
returned by _build_chat_prompt (which calls apply_chat_template(...,
tokenize=False, add_generation_prompt=True)) and can duplicate BOS; update the
tokenizer call that produces inputs (the call assigning to inputs in _generate)
to pass add_special_tokens=False so it does not re-add model special tokens when
tokenizing the already-formatted prompt.

In `@GSoC26_H/src/evaluation/error_taxonomy.py`:
- Around line 72-75: classify_error currently mutates the same ExtractionResult
and can leave stale derived fields; before computing effective_pred (and before
any early returns) clear/reset derived state like self.error_notes,
self.error_type (and any other fields set by previous classifications) so a
reclassification reflects current predicates/alignment; update the logic in
classify_error (which reads aligned_predicate and pred_predicate) to explicitly
reset those derived fields at the top so save() will not persist contradictory
data after alignment or manual correction.
- Around line 109-115: The current check uses substring matching (any(c in raw
for c in HINDI_COPULAS)) which mislabels longer predicates that merely contain a
copula; change this to test token-level membership instead: first keep the
exact-match check (raw in HINDI_COPULAS) then tokenize raw (e.g., split on
whitespace or use regex word boundaries) and check if any token is equal to a
member of HINDI_COPULAS before setting ErrorType.IMPLICIT_RELATION and composing
self.error_notes referencing self.pred_predicate and self.gold_predicate.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 93ae304b-105b-48f9-9ec6-f12f5d11b0b3

📥 Commits

Reviewing files that changed from the base of the PR and between d8b1dd8 and c7a5b79.

📒 Files selected for processing (15)
  • GSoC26_H/.gitignore
  • GSoC26_H/README.md
  • GSoC26_H/configs/lora_config.yaml
  • GSoC26_H/data/ontology/dbpedia_properties.json
  • GSoC26_H/notebooks/01_week1_baselines.ipynb
  • GSoC26_H/requirements.txt
  • GSoC26_H/src/__init__.py
  • GSoC26_H/src/baseline/__init__.py
  • GSoC26_H/src/baseline/gemma_zero_shot.py
  • GSoC26_H/src/evaluation/__init__.py
  • GSoC26_H/src/evaluation/error_taxonomy.py
  • GSoC26_H/src/finetune/__init__.py
  • GSoC26_H/src/ontology/__init__.py
  • GSoC26_H/src/ontology/alignment.py
  • GSoC26_H/src/pipeline/__init__.py

use_4bit: true # QLoRA: 4-bit quantization (required for T4)
bnb_4bit_quant_type: "nf4"
bnb_4bit_use_double_quant: true
bnb_4bit_compute_dtype: "bfloat16"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

Does NVIDIA T4 (Turing) support native bfloat16 training in PyTorch/Transformers, and what precision settings are recommended for QLoRA on T4?

💡 Result:

The NVIDIA T4 GPU (Turing architecture) does not provide native hardware support for bfloat16 (BF16) [1][2][3]. While PyTorch may allow you to specify dtype=torch.bfloat16 or enable bf16=True in training arguments, this will either trigger an error or fall back to slower, emulated software computations rather than using optimized hardware kernels [4][2]. For QLoRA training on an NVIDIA T4, the recommended precision settings are as follows: 1. Computation Precision: Use torch.float16 (FP16) instead of bfloat16 for the bnb_4bit_compute_dtype parameter in your BitsAndBytesConfig [5][3][6]. Since the T4 lacks native BF16 support, using FP16 is generally faster, more memory-efficient, and serves as the standard, stable choice for this hardware [5][3]. 2. Quantization Type: Use nf4 (Normal Float 4) as the bnb_4bit_quant_type [7][3]. NF4 is specifically designed for neural network weights and provides superior accuracy compared to standard integer-based 4-bit quantization [7][3][8]. 3. Additional Configuration: Enabling bnb_4bit_use_double_quant=True is highly recommended to further reduce memory overhead by quantizing the quantization constants [3][6]. In summary, while BF16 is preferred on newer architectures (Ampere and later) for its numerical stability, FP16 is the correct and most efficient choice for QLoRA fine-tuning on the Turing-based NVIDIA T4 [2][3][9].

Citations:


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="GSoC26_H/configs/lora_config.yaml"
if [ ! -f "$FILE" ]; then
  echo "Missing file: $FILE"
  exit 1
fi

echo "==== File: $FILE ===="
nl -ba "$FILE" | sed -n '1,120p'

Repository: dbpedia/neural-extraction-framework

Length of output: 191


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="GSoC26_H/configs/lora_config.yaml"
if [ ! -f "$FILE" ]; then
  echo "Missing file: $FILE"
  exit 1
fi

echo "==== File: $FILE (lines 1-120) ===="
cat -n "$FILE" | sed -n '1,120p'

Repository: dbpedia/neural-extraction-framework

Length of output: 3253


🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n "load_best_model_at_end" -S .
rg -n "lora_config\.ya?ml" -S .
rg -n "bf16|bfloat16|bnb_4bit_compute_dtype" -S .
rg -n "TrainingArguments" -S .

Repository: dbpedia/neural-extraction-framework

Length of output: 149


🏁 Script executed:

#!/bin/bash
set -uo pipefail

ls -R GSoC26_H | head -n 200

rg -n "configs/lora_config\.ya?ml|lora_config\.ya?ml" -S GSoC26_H || true
rg -n "TrainingArguments\(" -S GSoC26_H . || true
rg -n "Trainer\(" -S GSoC26_H . || true
rg -n "bf16|bfloat16" -S GSoC26_H . || true
rg -n "evaluation_strategy|save_strategy" -S GSoC26_H . || true

Repository: dbpedia/neural-extraction-framework

Length of output: 26298


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Focus only on Python source to avoid ipynb noise
rg -n --type py "lora_config\.ya?ml|safe_load|yaml\.safe_load|load_best_model_at_end|TrainingArguments\(|Trainer\(" GSoC26_H/src || true
rg -n --type py "eval_steps|save_steps|evaluation_strategy|save_strategy|bf16|bfloat16|fp16" GSoC26_H/src || true

echo "---- GSoC26_H/README.md ----"
sed -n '1,200p' GSoC26_H/README.md || true

echo "---- finetune dir listing ----"
find GSoC26_H/src/finetune -maxdepth 3 -type f -print
for f in $(find GSoC26_H/src/finetune -type f -name "*.py" -print); do
  echo "==== $f (head) ===="
  sed -n '1,200p' "$f"
done

Repository: dbpedia/neural-extraction-framework

Length of output: 9447


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Search only in Python sources to find the actual training entrypoint and TrainingArguments wiring
rg -n --type py "TrainingArguments\(" . || true
rg -n --type py "Trainer\(" . || true
rg -n --type py "evaluation_strategy|save_strategy|eval_strategy" . || true
rg -n --type py "load_best_model_at_end" . || true

# Find any direct reference to the lora config file / YAML keys
rg -n --type py "lora_config\.ya?ml|bnb_4bit_compute_dtype|bf16:|training\.bf16|bnb_4bit_compute_dtype" . || true

Repository: dbpedia/neural-extraction-framework

Length of output: 181


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Broader searches across repo (including ipynb) to find trainer wiring and config usage
rg -n "lora_config\.ya?ml" . || true
rg -n "load_best_model_at_end" . || true
rg -n "evaluation_strategy|save_strategy" . || true
rg -n "SFTTrainer|trl\.SFTTrainer|Seq2SeqTrainingArguments|Trainer\(" . || true
rg -n "from transformers import TrainingArguments|TrainingArguments" . || true
rg -n "bf16|bfloat16|fp16" GSoC26_H -S . || true

Repository: dbpedia/neural-extraction-framework

Length of output: 50391


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# 1) Where load_best_model_at_end is mentioned (code + notebooks)
echo "== load_best_model_at_end occurrences =="
rg -n --max-count 50 "load_best_model_at_end" . || true

# 2) Find evaluation/save strategy wiring (code + notebooks)
echo "== evaluation_strategy / save_strategy occurrences =="
rg -n --max-count 50 "evaluation_strategy|save_strategy" . || true

# 3) Find SFTTrainer / Trainer usage
echo "== SFTTrainer / Trainer occurrences =="
rg -n --max-count 50 "SFTTrainer|Seq2SeqTrainer|Trainer\(" . || true

# 4) Find config loading / yaml usage for lora_config.yaml
echo "== YAML loading occurrences (lora_config + yaml.safe_load) =="
rg -n --max-count 50 "lora_config\.ya?ml|yaml\.safe_load|safe_load|load\(.+\.ya?ml" . || true

# 5) Specifically in GSoC26_H (to reduce noise)
echo "== GSoC26_H: SFTTrainer / evaluation_strategy / save_strategy =="
rg -n --max-count 50 "SFTTrainer|evaluation_strategy|save_strategy|Trainer\(" GSoC26_H . || true

Repository: dbpedia/neural-extraction-framework

Length of output: 675


🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n --max-count 50 "TrainingArguments" GSoC26_H . || true
rg -n --max-count 50 "SFTTrainer|Seq2SeqTrainer" GSoC26_H . || true
rg -n --max-count 50 "eval_steps|save_steps|evaluation_strategy|save_strategy" GSoC26_H . || true

Repository: dbpedia/neural-extraction-framework

Length of output: 475


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find any reference to this exact YAML path/name
rg -n --max-count 100 "lora_config\.ya?ml|configs/lora_config\.ya?ml|GSoC26_H/configs/lora_config\.ya?ml" . || true

# Look for loading YAML into training args in notebooks too
rg -n --max-count 100 "yaml\.safe_load|yaml\.load|OmegaConf|Hydra|ruamel|load_config" GSoC26_H . || true

# Look for finetune scripts that might build TrainingArguments using TRL SFTTrainer
rg -n --max-count 100 "SFTTrainer|training_args|TrainingArguments|trl\.|Seq2SeqTrainingArguments" . || true

# Restrict to notebooks if code isn't present in .py
rg -n --max-count 50 "SFTTrainer|TrainingArguments|bf16|evaluation_strategy|save_strategy|load_best_model_at_end" GSoC26_H/notebooks . || true

Repository: dbpedia/neural-extraction-framework

Length of output: 2298


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Search within the GSoC26_H notebooks only (small output, no full dumps)
rg -n "lora_config\.ya?ml|configs/lora_config\.ya?ml|SFTTrainer|TrainingArguments\(|load_best_model_at_end|evaluation_strategy|save_strategy|bf16|bfloat16|fp16" GSoC26_H/notebooks . || true

Repository: dbpedia/neural-extraction-framework

Length of output: 36096


Fix BF16 precision for the stated T4 target

GSoC26_H/configs/lora_config.yaml sets bnb_4bit_compute_dtype: "bfloat16" (line 10) and training.bf16: true while targeting “Google Colab Pro (T4 GPU)”. T4 has no native BF16 support, so this may fail or fall back to slower emulation/silently degrade precision/performance.

  bnb_4bit_compute_dtype: "bfloat16"   # line 10
  fp16: false                            # line 37
  bf16: true                             # line 38

Switch to FP16 for T4: use bnb_4bit_compute_dtype: "float16" and set training.fp16: true / training.bf16: false.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@GSoC26_H/configs/lora_config.yaml` at line 10, The config currently sets
bnb_4bit_compute_dtype: "bfloat16" and training.bf16: true which is incompatible
with NVIDIA T4 GPUs; update bnb_4bit_compute_dtype to "float16", set
training.fp16: true and set training.bf16: false (or remove bf16) so the code
uses FP16 on T4; locate these keys (bnb_4bit_compute_dtype, training.bf16,
training.fp16) in the lora_config.yaml and make the value changes accordingly.

Comment on lines +40 to +44
save_steps: 50
eval_steps: 50
save_total_limit: 3
load_best_model_at_end: true
report_to: "none" # Set to "wandb" if you use W&B

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

For Hugging Face Transformers TrainingArguments, what fields are required when load_best_model_at_end=True (evaluation_strategy, save_strategy, metric_for_best_model, save/eval steps)?

💡 Result:

When using load_best_model_at_end=True in Hugging Face Transformers TrainingArguments, the following requirements and constraints must be met to ensure the trainer correctly tracks, saves, and reloads the best model [1][2]: Required Configuration: 1. Evaluation Strategy: You must set an evaluation strategy (eval_strategy or the deprecated evaluation_strategy) [1]. The trainer relies on regular evaluations to determine the best checkpoint [1]. 2. Save Strategy: Generally, your save_strategy must match your eval_strategy (e.g., both set to "epoch" or both set to "steps") [1][2]. - Exception: If save_strategy is set to "best", it does not need to match the eval_strategy [1][2]. - Constraint: You cannot use "no" for the save_strategy, as the trainer must be able to save checkpoints to identify and reload the best one [3][4]. 3. Steps Alignment (for "steps" strategy): If both eval_strategy and save_strategy are set to "steps", the save_steps must be a multiple of the eval_steps [1][2]. This ensures that an evaluation occurs at every save point, allowing the trainer to associate a metric score with that saved checkpoint [1]. Optional but Recommended Configuration: - metric_for_best_model: By default, the trainer uses "loss" as the metric [1]. If you want to track a different metric (e.g., "accuracy" or "f1"), you must specify it using this field [1][5]. The metric name must be one returned by your evaluation (often with or without an "eval_" prefix) [1]. - greater_is_better: This field defaults based on the metric name: it is True if the metric name does not end in "loss", and False otherwise [1]. If you are using a custom metric, you may need to explicitly set this to tell the trainer whether higher or lower values are preferred [1]. Important Behavior Notes: - Checkpoint Retention: When load_best_model_at_end=True is enabled, the best checkpoint is always retained, even if save_total_limit is reached (the best checkpoint is protected from deletion) [1][6]. - Evaluation Source: When using multiple evaluation datasets, ensure metric_for_best_model references exactly one of them (e.g., "eval_dataset1_loss") so the trainer knows which metric to optimize [7].

Citations:


Add compatible evaluation_strategy/save_strategy for load_best_model_at_end: true

With load_best_model_at_end: true set, Transformers also requires an evaluation strategy (not "no") and a compatible save strategy. Since the config only sets eval_steps/save_steps, add:

  • evaluation_strategy: "steps" (or "epoch")
  • save_strategy: "steps" (or "epoch") or save_strategy: "best"
  • Optionally set metric_for_best_model only if you’re not using the default "loss"
  • If using "steps" for both, ensure save_steps is a multiple of eval_steps (50/50 is OK once strategies are set)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@GSoC26_H/configs/lora_config.yaml` around lines 40 - 44, The config sets
load_best_model_at_end: true but lacks compatible strategies; add
evaluation_strategy: "steps" (or "epoch") and a matching save_strategy: "steps"
(or "epoch") or "best" so checkpointing is compatible with
load_best_model_at_end, optionally set metric_for_best_model if you need a
metric other than the default "loss", and ensure save_steps is a multiple of
eval_steps (keep save_steps: 50 and eval_steps: 50 or adjust accordingly); leave
save_total_limit as-is or adjust if you want fewer retained checkpoints.

Comment on lines +127 to +135
"if not os.path.exists(f'{gsoc_path}/neural-extraction-framework'):\n",
" os.chdir(gsoc_path)\n",
" !git clone --depth=1 https://github.com/dbpedia/neural-extraction-framework.git\n",
" print('Cloned.')\n",
"else:\n",
" print('Already cloned.')\n",
"\n",
"gsoc25h = f'{PROJECT}/data/gsoc25h/neural-extraction-framework/GSoC25_H'\n",
"print(f'GSoC25_H path: {gsoc25h}')"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Pin the reused GSoC25_H checkout to a known revision.

This clones the default branch HEAD, so the evaluator path, shipped extraction files, and reported baselines can drift as the upstream repo changes. That makes the Week 1 table non-reproducible.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@GSoC26_H/notebooks/01_week1_baselines.ipynb` around lines 127 - 135, The
clone step currently checks out the default branch HEAD which can drift; update
the cloning logic around gsoc_path and the neural-extraction-framework checkout
so it pins to a known revision: add a constant commit/tag (e.g. GSO C25H_COMMIT)
and after cloning run a git checkout of that commit (or use git clone --branch
<tag> --single-branch) against the neural-extraction-framework repo, ensuring
gsoc25h (which uses PROJECT) points to the pinned revision; keep the existing
exists-check and print messages but ensure the repo is forced/checked out to the
pinned commit when cloning or when already present.

Comment on lines +309 to +319
"def extract_triple(sentence: str) -> dict:\n",
" prompt = PROMPT.format(sentence=sentence)\n",
" inputs = tokenizer(prompt, return_tensors='pt').to(model.device)\n",
" with torch.no_grad():\n",
" outputs = model.generate(\n",
" **inputs,\n",
" max_new_tokens=80,\n",
" temperature=0.1,\n",
" do_sample=True,\n",
" pad_token_id=tokenizer.eos_token_id,\n",
" )\n",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

For google/gemma-3-1b-itwith Hugging Face Transformers, what is the official inference format for prompts? Do the official examples usetokenizer.apply_chat_template(..., add_generation_prompt=True)for-it checkpoints?

💡 Result:

Official inference prompt format (HF Transformers) for google/gemma-3-1b-it: use the model/tokenizer’s built-in chat template, and build a tokenizable prompt from a messages list of {role, content} objects via tokenizer.apply_chat_template(..., add_generation_prompt=True) (or, equivalently for multimodal variants, processor.apply_chat_template). The model card for google/gemma-3-1b-it explicitly shows using tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, ...) before calling the generation pipeline/model.[1] Do the official examples use tokenizer.apply_chat_template(..., add_generation_prompt=True) for -it checkpoints? Yes. The google/gemma-3-1b-it model card’s example constructs inputs = tokenizer.apply_chat_template(..., add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt") ....[1] Additionally, Hugging Face’s Transformers documentation for Gemma 3 demonstrates the same pattern (apply the chat template with add_generation_prompt=True and pass the resulting inputs to generate).[2][3]

Citations:


Use Gemma’s chat template for the -it checkpoint (don’t tokenize the raw prompt string).

For google/gemma-3-1b-it, Hugging Face’s official examples build inputs via tokenizer.apply_chat_template(..., add_generation_prompt=True, tokenize=True, ...) before generate(). Tokenizing PROMPT.format(sentence=...) directly bypasses the -it chat/assistant formatting, which can materially change the measured baseline.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@GSoC26_H/notebooks/01_week1_baselines.ipynb` around lines 309 - 319, The
extract_triple function currently tokenizes PROMPT.format(sentence=...) directly
which bypasses Gemma’s Italian chat/assistant formatting; update the input
creation to use the tokenizer's chat template API (e.g.,
tokenizer.apply_chat_template or similar) with add_generation_prompt=True and
tokenize=True for the "google/gemma-3-1b-it" checkpoint, then move the resulting
tensors to model.device before calling model.generate (leave the generate(...)
call unchanged); reference extract_triple, PROMPT,
tokenizer.apply_chat_template, and model.generate when making the change.

Comment on lines +324 to +331
" try:\n",
" clean = re.sub(r'```json|```', '', response).strip()\n",
" match = re.search(r'\\{.*?\\}', clean, re.DOTALL)\n",
" if match:\n",
" return json.loads(match.group(0))\n",
" except Exception:\n",
" pass\n",
" return {'subject': '', 'relation': '', 'object': '', 'raw': response}\n",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t swallow parse failures in the evaluation run.

except Exception: pass turns parser bugs and malformed model output into the empty-triple fallback, which silently depresses recall. It also returns unvalidated JSON; if json.loads yields non-string fields, the later .strip() calls can still fail.

Suggested fix
     try:
         clean = re.sub(r'```json|```', '', response).strip()
         match = re.search(r'\{.*?\}', clean, re.DOTALL)
         if match:
-            return json.loads(match.group(0))
-    except Exception:
-        pass
+            payload = json.loads(match.group(0))
+            if isinstance(payload, dict):
+                return {
+                    'subject': str(payload.get('subject', '')),
+                    'relation': str(payload.get('relation', '')),
+                    'object': str(payload.get('object', '')),
+                    'raw': response,
+                }
+    except json.JSONDecodeError:
+        pass
     return {'subject': '', 'relation': '', 'object': '', 'raw': response}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@GSoC26_H/notebooks/01_week1_baselines.ipynb` around lines 324 - 331, The
parser currently swallows all exceptions with a bare "except Exception: pass"
and returns unvalidated JSON directly from json.loads, so update the block that
cleans and extracts the JSON (the re.sub/re.search/match logic and json.loads
call) to only catch json.JSONDecodeError, validate that the decoded payload is a
dict (assign to a local payload variable), and return a sanitized dict with
'subject', 'relation', 'object' coerced to strings (using payload.get(..., '')
and str(...)) plus 'raw': response; otherwise fall back to the empty-triple
response as before—this ensures you don't hide parser bugs and that later
.strip() calls won't fail on non-string fields.

Comment on lines +207 to +215
def _build_chat_prompt(self, user_message: str) -> str:
"""Format as Gemma chat template."""
chat = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": user_message},
]
return self._tokenizer.apply_chat_template(
chat, tokenize=False, add_generation_prompt=True
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

Does google/gemma-2-2b-it apply_chat_template raise "System role not supported" when passed a system role message?

💡 Result:

Yes, the google/gemma-2-2b-it model's chat template explicitly raises a TemplateError with the message "System role not supported" if a message with the role "system" is passed to apply_chat_template [1][2][3]. This occurs because the Gemma instruction-tuned models are designed to support only two roles: "user" and "model" [4]. They were not trained with system instructions [5], and therefore the official chat template includes a check that triggers this error to prevent invalid conversation structures [1][2]. To provide system-level instructions for this model, the recommended approach is to include the instructions directly within the initial user prompt [4]. Alternatively, some users have bypassed this by modifying the chat template or prepending instructions to the user message [6][7][5].

Citations:


Portability: avoid using role: "system" with apply_chat_template for Gemma 1.x/2.x

_build_chat_prompt always passes {"role": "system", ...}. For google/gemma-2-2b-it, apply_chat_template raises TemplateError: System role not supported (and Gemma 1.1 follows the same constraint), so swapping model IDs would crash. Embed SYSTEM_PROMPT into the first "user" message instead of relying on a native system role.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@GSoC26_H/src/baseline/gemma_zero_shot.py` around lines 207 - 215,
_build_chat_prompt currently emits a {"role":"system", ...} entry which breaks
Gemma 1.x/2.x because apply_chat_template rejects a system role; instead build
the chat as only user messages by embedding SYSTEM_PROMPT into the first user
content (e.g., concatenate SYSTEM_PROMPT + "\n\n" + user_message) and remove the
system-role object so the list contains only {"role":"user", "content": ...}
before calling self._tokenizer.apply_chat_template(chat, tokenize=False,
add_generation_prompt=True); update references to SYSTEM_PROMPT and the chat
variable in the _build_chat_prompt method accordingly.

Comment on lines +274 to +289
if "triples" in data and isinstance(data["triples"], list):
for t in data["triples"]:
try:
triples.append(HindiTriple(**t))
except ValidationError as e:
pass # Skip malformed triples
elif "subject" in data and "predicate" in data:
obj_key = "object_" if "object_" in data else "object"
try:
triples.append(HindiTriple(
subject=data.get("subject", ""),
predicate=data.get("predicate", ""),
object=data.get(obj_key, data.get("object", "")),
))
except ValidationError:
pass

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Harden triple parsing against non-dict list elements.

If the model returns {"triples": [...]} where an element is not a dict (e.g. a string), HindiTriple(**t) raises TypeError, which is not caught by except ValidationError and aborts the whole extraction. Given LLM output is unpredictable, guard the element type.

🛡️ Proposed fix
         if "triples" in data and isinstance(data["triples"], list):
             for t in data["triples"]:
+                if not isinstance(t, dict):
+                    continue
                 try:
                     triples.append(HindiTriple(**t))
                 except ValidationError as e:
                     pass  # Skip malformed triples
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@GSoC26_H/src/baseline/gemma_zero_shot.py` around lines 274 - 289, The triple
parsing loop in gemma_zero_shot.py can crash when a non-dict element is present
because HindiTriple(**t) raises TypeError which isn’t caught; update the loop
that builds triples (the for t in data["triples"] section) to first verify each
element is a dict (or convert/skip non-dict) and catch both ValidationError and
TypeError around HindiTriple(**t) so malformed or non-dict entries are skipped
safely; ensure the same defensive check is applied wherever HindiTriple(...) is
constructed (including the fallback branch that builds a single HindiTriple from
subject/predicate/object_).

text = text.strip()

# Remove markdown code fences if present
text = re.sub(r"```(?:json)?", "", text).strip().rstrip("```").strip()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Remove the misleading/redundant rstrip("```").

str.rstrip("```") strips any trailing backtick characters, not the literal ` substring (Ruff B005). It is also redundant: the preceding re.sub already removes all ``` fences (opening and closing).

♻️ Proposed fix
-        text = re.sub(r"```(?:json)?", "", text).strip().rstrip("```").strip()
+        text = re.sub(r"```(?:json)?", "", text).strip()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
text = re.sub(r"```(?:json)?", "", text).strip().rstrip("```").strip()
text = re.sub(r"
🧰 Tools
🪛 Ruff (0.15.14)

[warning] 303-303: Using .strip() with multi-character strings is misleading

(B005)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@GSoC26_H/src/baseline/gemma_zero_shot.py` at line 303, The code currently
calls text = re.sub(r"```(?:json)?", "", text).strip().rstrip("```").strip(),
but rstrip("```") is misleading and redundant because re.sub already removes the
triple-backtick fences and rstrip treats its argument as a character set; update
the assignment (in gemma_zero_shot.py where variable text is set) to remove the
rstrip("```") and leave a single .strip() so the line becomes a simple
re.sub(...).strip().

Comment on lines +72 to +75
def classify_error(self) -> "ExtractionResult":
"""Auto-classify error type. Call after setting pred_* fields."""

effective_pred = self.aligned_predicate or self.pred_predicate

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Reset derived state before reclassifying.

classify_error() mutates the same ExtractionResult. If it is called again after alignment or manual correction, the old error_notes can survive even when the result becomes CORRECT, and save() will serialize a contradictory record.

Suggested fix
     def classify_error(self) -> "ExtractionResult":
         """Auto-classify error type. Call after setting pred_* fields."""
+        self.error_notes = ""
 
         effective_pred = self.aligned_predicate or self.pred_predicate
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def classify_error(self) -> "ExtractionResult":
"""Auto-classify error type. Call after setting pred_* fields."""
effective_pred = self.aligned_predicate or self.pred_predicate
def classify_error(self) -> "ExtractionResult":
"""Auto-classify error type. Call after setting pred_* fields."""
self.error_notes = ""
effective_pred = self.aligned_predicate or self.pred_predicate
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@GSoC26_H/src/evaluation/error_taxonomy.py` around lines 72 - 75,
classify_error currently mutates the same ExtractionResult and can leave stale
derived fields; before computing effective_pred (and before any early returns)
clear/reset derived state like self.error_notes, self.error_type (and any other
fields set by previous classifications) so a reclassification reflects current
predicates/alignment; update the logic in classify_error (which reads
aligned_predicate and pred_predicate) to explicitly reset those derived fields
at the top so save() will not persist contradictory data after alignment or
manual correction.

Comment thread GSoC26_H/src/evaluation/error_taxonomy.py
Evaluated 3 baseline systems on full Hindi-BenchIE (112 sentences):
- IndIE: P=0.44, R=0.49, F1=0.46
- Zero-shot Gemma-3-1B: P=0, R=0, F1=0
- GSoC25_H best: P=0.21, R=0.58, F1=0.31 (matches README claim)

Discovered new error type PREDICATE_PLACEHOLDER affecting 28.9% of IndIE
failures, reduced to 8.2% in GSoC25_H. Argument span errors are 0%
across all systems, confirming the predicate slot is the sole failure
mode and motivating Phase 2 fine-tuning.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@GSoC26_H/results/phase1_ablation_table.md`:
- Line 28: The sentence "All three systems share the same downstream gap:
surface Hindi predicates that are not aligned to DBpedia ontology — 82% of
GSoC25_H's failures." is ambiguous because the 82% refers to false positives,
not all failures; update that sentence to explicitly say "82% of GSoC25_H's
false positives" (or replace "failures" with "false positives") and, if you want
to mention totals, add the counts (e.g., "GSoC25_H has 536 FPs + 104 FNs = 640
total failures, and predicate normalization accounts for 81.9% of FPs") so the
statistic is precise and unambiguous when referring to the Error Type
Distribution table.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 174337e8-bee3-4e08-888f-cf00dfa16881

📥 Commits

Reviewing files that changed from the base of the PR and between c7a5b79 and 3d6cd3a.

⛔ Files ignored due to path filters (1)
  • GSoC26_H/results/phase1_ablation_table.csv is excluded by !**/*.csv
📒 Files selected for processing (2)
  • GSoC26_H/notebooks/02_week2_error_analysis.ipynb
  • GSoC26_H/results/phase1_ablation_table.md


2. **GSoC25_H reduces IndIE's placeholder failures from 28.9% → 8.2%** by using the 12B LLM to fill empty predicate slots.

3. **All three systems share the same downstream gap:** surface Hindi predicates that are not aligned to DBpedia ontology — 82% of GSoC25_H's failures.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clarify "failures" to specify "false positives".

The finding states "82% of GSoC25_H's failures," but this percentage refers specifically to false positives (from the Error Type Distribution table), not all failures. GSoC25_H has 536 FPs + 104 FNs = 640 total failures, so predicate normalization accounts for 81.9% of FPs (438.8 errors) but only 68.6% of all failures.

📝 Proposed fix for precision
-3. **All three systems share the same downstream gap:** surface Hindi predicates that are not aligned to DBpedia ontology — 82% of GSoC25_H's failures.
+3. **All three systems share the same downstream gap:** surface Hindi predicates that are not aligned to DBpedia ontology — 82% of GSoC25_H's false positives.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
3. **All three systems share the same downstream gap:** surface Hindi predicates that are not aligned to DBpedia ontology — 82% of GSoC25_H's failures.
3. **All three systems share the same downstream gap:** surface Hindi predicates that are not aligned to DBpedia ontology — 82% of GSoC25_H's false positives.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@GSoC26_H/results/phase1_ablation_table.md` at line 28, The sentence "All
three systems share the same downstream gap: surface Hindi predicates that are
not aligned to DBpedia ontology — 82% of GSoC25_H's failures." is ambiguous
because the 82% refers to false positives, not all failures; update that
sentence to explicitly say "82% of GSoC25_H's false positives" (or replace
"failures" with "false positives") and, if you want to mention totals, add the
counts (e.g., "GSoC25_H has 536 FPs + 104 FNs = 640 total failures, and
predicate normalization accounts for 81.9% of FPs") so the statistic is precise
and unambiguous when referring to the Error Type Distribution table.

Added README for Human-in-the-Loop Review Interface with setup instructions and deployment details.
This script generates noisy synthetic datasets for staged fine-tuning by utilizing flawed examples from a previous dataset. It implements functions for seed selection, generation, and validation of the output data.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (4)
GSoC26_H/src/finetune/generate_noisy_data.py (3)

250-251: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Blind except Exception hides API/transport failures.

Swallowing every exception as return False makes it impossible to distinguish transient API errors (timeouts, rate limits) from genuine empty responses, which can silently depress the success rate. Consider logging the exception (at least at debug level) so batch failures are diagnosable.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@GSoC26_H/src/finetune/generate_noisy_data.py` around lines 250 - 251, The
bare `except Exception:` block that returns False without logging swallows all
errors, making it impossible to diagnose API or transport failures. Add a
logging statement (at least at debug level) in the exception handler before the
return False statement to capture the exception details. This will allow
developers to distinguish between transient API errors and genuine empty
responses while maintaining the False return behavior for the normal flow.

Source: Linters/SAST tools


365-371: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Prefer importlib.util over exec(compile(...)) for loading the gen script.

The manual source-splitting + exec approach is fragile (the if __name__ == "__main__": string match breaks on quoting/whitespace variations) and triggers code-execution warnings. importlib.util loads the file as a proper module; the __main__ guard already prevents the CLI block from running since the module name won't be "__main__".

♻️ Proposed refactor
-    import types
-    with open(args.gen_script, "r", encoding="utf-8") as f:
-        source = f.read().split('if __name__ == "__main__":')[0]
-    gen_module = types.ModuleType("synthetic_data_gen_2")
-    exec(compile(source, args.gen_script, "exec"), gen_module.__dict__)
+    import importlib.util
+    spec = importlib.util.spec_from_file_location("synthetic_data_gen_2", args.gen_script)
+    gen_module = importlib.util.module_from_spec(spec)
+    spec.loader.exec_module(gen_module)

Note: the S102/no-exec and path-traversal static-analysis hints here are low-risk in practice since --gen-script is a developer-supplied local path, not untrusted input.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@GSoC26_H/src/finetune/generate_noisy_data.py` around lines 365 - 371, The
current approach of manually splitting source code on the string `if __name__ ==
"__main__":` and using `exec(compile(...))` is fragile and triggers
code-execution warnings. Replace this manual module loading with
`importlib.util.spec_from_file_location` to create a module spec from the gen
script file specified in `args.gen_script`, then use
`importlib.util.module_from_spec` and `spec.loader.exec_module` to properly load
the module as `gen_module`. This approach avoids the fragile string-splitting
logic and naturally respects the `__main__` guard since the loaded module name
won't be `__main__`.

Source: Linters/SAST tools


102-103: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

random.seed() mutates global RNG state shared by generation.

Seeding the global random module here also fixes the starting state for the strategy/relation selection in _build_prompt, which runs concurrently across worker threads. Using a dedicated random.Random(rng_seed) instance for seed sampling keeps reproducibility local and avoids surprising side effects on the generation diversity.

♻️ Proposed refactor
-    random.seed(rng_seed)
-    sampled = random.sample(low_score_pool, seed_count)
+    rng = random.Random(rng_seed)
+    sampled = rng.sample(low_score_pool, seed_count)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@GSoC26_H/src/finetune/generate_noisy_data.py` around lines 102 - 103, The
code is using random.seed() which mutates the global RNG state that is shared
across worker threads, causing unexpected side effects on generation diversity
in concurrent operations like _build_prompt. Replace the random.seed(rng_seed)
call with creating a dedicated random.Random(rng_seed) instance, then use this
local instance's sample() method to draw from low_score_pool instead of calling
random.sample(). This approach maintains reproducibility for the seed sampling
while preventing the global RNG state mutation that affects other concurrent
operations.
requirements.txt (1)

1-1: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick win

Pin Streamlit to a tested release.

streamlit>=1.35.0 leaves the Cloud build free to pick newer releases; Streamlit's docs recommend pinning the version in requirements.txt, which is safer for a deployed HITL app. (docs.streamlit.io)

Possible fix
-streamlit>=1.35.0
+streamlit==<tested-version>
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@requirements.txt` at line 1, The Streamlit dependency in requirements.txt is
specified with a minimum version constraint (>=1.35.0) which allows Cloud builds
to pick newer releases that may contain breaking changes. Pin Streamlit to a
specific tested version by changing the constraint from >= to == (for example,
streamlit==1.35.0) to ensure reproducible and stable deployments for the HITL
application.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@GSoC26_H/hitl/hitl_app.py`:
- Around line 362-372: Dynamic values from the JSONL data (sentence, subject,
relation, object) are being directly injected into HTML without escaping before
rendering with unsafe_allow_html=True, creating a security vulnerability. Import
the html module and use html.escape() to sanitize each value retrieved from
row.get() calls before inserting them into the chip_html and sentence_box HTML
strings. Apply this escaping to all four fields (sentence, subject, relation,
object) in both the affected sections (lines 362-372 and 377-388) to prevent
malicious or malformed content from being rendered as raw HTML.
- Around line 292-299: The queue initialization and retrieval does not exclude
items that have already been reviewed. Items are added to
st.session_state.decisions when reviewed, but the queue in view_queue continues
to display all items including those already decided. Filter out reviewed items
from the queue by comparing items in st.session_state.queue against
st.session_state.decisions using a unique identifier (such as row ID or index)
to exclude already-reviewed items before assigning to the queue variable. This
filtering should be applied in all places where st.session_state.queue is
retrieved, including the initialization block and subsequent queue refreshes
throughout the application.
- Around line 429-440: The code saves the placeholder text "— type a custom
property below —" as the final_dbo_uri when a user selects the placeholder
option but leaves the custom property input empty. Add a validation check before
the save_decision call that ensures if new_prop equals the placeholder option,
the custom_prop must be non-empty and stripped. If this validation fails
(placeholder selected with no custom input), either prevent the save_decision
from being called by using a conditional check or display a warning to the user,
ensuring only valid property URIs are persisted.

In `@GSoC26_H/hitl/README.md`:
- Around line 23-26: The README statement about Streamlit Community Cloud's
requirements.txt handling is misleading because it suggests only the local
folder file is used and excludes repo-root usage. Revise the text in the
deployment section (around lines 23-26) to clarify that Streamlit Community
Cloud searches for requirements.txt first in the entrypoint directory
(GSoC26_H/hitl/) and then falls back to the repo root if no file exists in the
entrypoint directory, rather than stating it uses the local file "not the repo
root", which incorrectly implies the repo-root file is never used.

In `@GSoC26_H/src/finetune/generate_noisy_data.py`:
- Around line 376-378: The print statement in the seed-building section attempts
to log scores from seeds by calling `s.get('score')`, but the seed dictionaries
constructed in the `build_noisy_few_shot_seeds` function only contain
`hindi_sentence`, `thought_process`, and `extracted_triplets` keys with no
`score` field. Modify the `build_noisy_few_shot_seeds` function to include the
score value from each example's judgement in the seed dictionary construction
(add `"score": ex["judgement"]["score"]` to the seed dict when iterating through
examples), so that the subsequent log message will reflect the actual seed
quality scores instead of printing None values.
- Around line 149-164: The _parse_generated_json function only catches
json.JSONDecodeError and IndexError, but when the model returns valid JSON that
is not a dictionary (like a list, string, or number), the .get() calls on parsed
raise an AttributeError which is not caught, causing the entire batch to fail.
Fix this by either adding AttributeError to the except clause alongside
json.JSONDecodeError and IndexError, or by adding a type check
(isinstance(parsed, dict)) before attempting to call .get() on the parsed value
to ensure only dictionary objects are validated.

---

Nitpick comments:
In `@GSoC26_H/src/finetune/generate_noisy_data.py`:
- Around line 250-251: The bare `except Exception:` block that returns False
without logging swallows all errors, making it impossible to diagnose API or
transport failures. Add a logging statement (at least at debug level) in the
exception handler before the return False statement to capture the exception
details. This will allow developers to distinguish between transient API errors
and genuine empty responses while maintaining the False return behavior for the
normal flow.
- Around line 365-371: The current approach of manually splitting source code on
the string `if __name__ == "__main__":` and using `exec(compile(...))` is
fragile and triggers code-execution warnings. Replace this manual module loading
with `importlib.util.spec_from_file_location` to create a module spec from the
gen script file specified in `args.gen_script`, then use
`importlib.util.module_from_spec` and `spec.loader.exec_module` to properly load
the module as `gen_module`. This approach avoids the fragile string-splitting
logic and naturally respects the `__main__` guard since the loaded module name
won't be `__main__`.
- Around line 102-103: The code is using random.seed() which mutates the global
RNG state that is shared across worker threads, causing unexpected side effects
on generation diversity in concurrent operations like _build_prompt. Replace the
random.seed(rng_seed) call with creating a dedicated random.Random(rng_seed)
instance, then use this local instance's sample() method to draw from
low_score_pool instead of calling random.sample(). This approach maintains
reproducibility for the seed sampling while preventing the global RNG state
mutation that affects other concurrent operations.

In `@requirements.txt`:
- Line 1: The Streamlit dependency in requirements.txt is specified with a
minimum version constraint (>=1.35.0) which allows Cloud builds to pick newer
releases that may contain breaking changes. Pin Streamlit to a specific tested
version by changing the constraint from >= to == (for example,
streamlit==1.35.0) to ensure reproducible and stable deployments for the HITL
application.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1739d1c5-d446-4295-a26c-7a0cf89ddaad

📥 Commits

Reviewing files that changed from the base of the PR and between 3d6cd3a and 3013015.

📒 Files selected for processing (5)
  • GSoC26_H/hitl/README.md
  • GSoC26_H/hitl/hitl_app.py
  • GSoC26_H/hitl/requirement.txt
  • GSoC26_H/src/finetune/generate_noisy_data.py
  • requirements.txt
✅ Files skipped from review due to trivial changes (1)
  • GSoC26_H/hitl/requirement.txt

Comment thread GSoC26_H/hitl/hitl_app.py
Comment on lines +292 to +299
if "queue" not in st.session_state:
queue = sorted(all_rows, key=lambda r: r.get("score", 0), reverse=True)
st.session_state.queue = queue
st.session_state.idx = 0
st.session_state.decisions = []

queue = st.session_state.queue
total = len(queue)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Reviewed items are never excluded from the queue.

decisions increments progress, but the same examples remain in view_queue, so users can re-decide items and metrics drift from true completion.

Suggested fix
 if "queue" not in st.session_state:
-    queue = sorted(all_rows, key=lambda r: r.get("score", 0), reverse=True)
+    queue = sorted(all_rows, key=lambda r: r.get("score", 0), reverse=True)
+    for i, r in enumerate(queue):
+        r["_qid"] = i
     st.session_state.queue = queue
     st.session_state.idx = 0
     st.session_state.decisions = []
@@
-    if filter_mode == "Auto-aligned (score ≥ 0.55)":
-        view_queue = [r for r in queue if r.get("score", 0) >= THRESHOLD]
+    reviewed_qids = {d.get("_qid") for d in st.session_state.decisions}
+    remaining = [r for r in queue if r.get("_qid") not in reviewed_qids]
+    if filter_mode == "Auto-aligned (score ≥ 0.55)":
+        view_queue = [r for r in remaining if r.get("score", 0) >= THRESHOLD]
     elif filter_mode == "Needs review (score < 0.55)":
-        view_queue = [r for r in queue if r.get("score", 0) < THRESHOLD]
+        view_queue = [r for r in remaining if r.get("score", 0) < THRESHOLD]
     else:
-        view_queue = queue
+        view_queue = remaining
@@
 def save_decision(action, **extra):
     decision = {
+        "_qid": row.get("_qid"),
         "sentence": row.get("sentence", ""),

Also applies to: 317-323, 342-343, 396-413

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@GSoC26_H/hitl/hitl_app.py` around lines 292 - 299, The queue initialization
and retrieval does not exclude items that have already been reviewed. Items are
added to st.session_state.decisions when reviewed, but the queue in view_queue
continues to display all items including those already decided. Filter out
reviewed items from the queue by comparing items in st.session_state.queue
against st.session_state.decisions using a unique identifier (such as row ID or
index) to exclude already-reviewed items before assigning to the queue variable.
This filtering should be applied in all places where st.session_state.queue is
retrieved, including the initialization block and subsequent queue refreshes
throughout the application.

Comment thread GSoC26_H/hitl/hitl_app.py
Comment on lines +362 to +372
st.markdown(f'<div class="sentence-box">{row.get("sentence", "")}</div>', unsafe_allow_html=True)

st.markdown("<br>**Extracted triple**", unsafe_allow_html=True)
chip_html = (
'<div class="chip-row">'
f'<div class="chip"><div class="chip-label">Subject</div><div class="chip-value">{row.get("subject","")}</div></div>'
f'<div class="chip"><div class="chip-label">Relation</div><div class="chip-value">{row.get("relation","")}</div></div>'
f'<div class="chip"><div class="chip-label">Object</div><div class="chip-value">{row.get("object","")}</div></div>'
'</div>'
)
st.markdown(chip_html, unsafe_allow_html=True)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Escape dynamic values before rendering with unsafe_allow_html=True.

JSONL-sourced fields are injected into raw HTML without escaping, so malformed/malicious content can alter the UI and reviewer decisions.

Suggested fix
 import streamlit as st
 import json
 import os
+import html
 from datetime import datetime
@@
-    st.markdown(f'<div class="sentence-box">{row.get("sentence", "")}</div>', unsafe_allow_html=True)
+    sentence = html.escape(str(row.get("sentence", "")))
+    st.markdown(f'<div class="sentence-box">{sentence}</div>', unsafe_allow_html=True)
@@
+    subject = html.escape(str(row.get("subject", "")))
+    relation = html.escape(str(row.get("relation", "")))
+    obj = html.escape(str(row.get("object", "")))
     chip_html = (
         '<div class="chip-row">'
-        f'<div class="chip"><div class="chip-label">Subject</div><div class="chip-value">{row.get("subject","")}</div></div>'
-        f'<div class="chip"><div class="chip-label">Relation</div><div class="chip-value">{row.get("relation","")}</div></div>'
-        f'<div class="chip"><div class="chip-label">Object</div><div class="chip-value">{row.get("object","")}</div></div>'
+        f'<div class="chip"><div class="chip-label">Subject</div><div class="chip-value">{subject}</div></div>'
+        f'<div class="chip"><div class="chip-label">Relation</div><div class="chip-value">{relation}</div></div>'
+        f'<div class="chip"><div class="chip-label">Object</div><div class="chip-value">{obj}</div></div>'
         '</div>'
     )
@@
-        st.markdown(f'<div class="suggestion-uri">{dbo_uri}</div>', unsafe_allow_html=True)
+        st.markdown(f'<div class="suggestion-uri">{html.escape(str(dbo_uri))}</div>', unsafe_allow_html=True)
@@
-        f'<div class="suggestion-caption">Confidence {score:.2f} · matched via {method or "—"}</div>'
+        f'<div class="suggestion-caption">Confidence {score:.2f} · matched via {html.escape(str(method or "—"))}</div>'
     )

Also applies to: 377-388

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@GSoC26_H/hitl/hitl_app.py` around lines 362 - 372, Dynamic values from the
JSONL data (sentence, subject, relation, object) are being directly injected
into HTML without escaping before rendering with unsafe_allow_html=True,
creating a security vulnerability. Import the html module and use html.escape()
to sanitize each value retrieved from row.get() calls before inserting them into
the chip_html and sentence_box HTML strings. Apply this escaping to all four
fields (sentence, subject, relation, object) in both the affected sections
(lines 362-372 and 377-388) to prevent malicious or malformed content from being
rendered as raw HTML.

Comment thread GSoC26_H/hitl/hitl_app.py
Comment on lines +429 to +440
new_prop = st.selectbox(
"Correct dbo: property",
options=["— type a custom property below —"] + PROPERTY_OPTIONS,
key=f"modify_select_{idx}",
label_visibility="collapsed",
)
custom_prop = ""
if new_prop == "— type a custom property below —":
custom_prop = st.text_input("Custom property", placeholder="dbo:somePropertyName", key=f"custom_{idx}")
if st.button("Save correction", key=f"save_mod_{idx}"):
final_uri = custom_prop.strip() if custom_prop.strip() else new_prop
save_decision("modify", final_dbo_uri=final_uri)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Prevent saving the placeholder as final_dbo_uri.

If the user keeps the default option and leaves custom input empty, "— type a custom property below —" is persisted as a property URI.

Suggested fix
 if st.session_state.get("show_modify"):
@@
     if st.button("Save correction", key=f"save_mod_{idx}"):
-        final_uri = custom_prop.strip() if custom_prop.strip() else new_prop
-        save_decision("modify", final_dbo_uri=final_uri)
+        if new_prop == "— type a custom property below —" and not custom_prop.strip():
+            st.warning("Select a curated property or enter a custom `dbo:` property.")
+        else:
+            final_uri = custom_prop.strip() if custom_prop.strip() else new_prop
+            save_decision("modify", final_dbo_uri=final_uri)
🧰 Tools
🪛 Ruff (0.15.18)

[warning] 431-431: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@GSoC26_H/hitl/hitl_app.py` around lines 429 - 440, The code saves the
placeholder text "— type a custom property below —" as the final_dbo_uri when a
user selects the placeholder option but leaves the custom property input empty.
Add a validation check before the save_decision call that ensures if new_prop
equals the placeholder option, the custom_prop must be non-empty and stripped.
If this validation fails (placeholder selected with no custom input), either
prevent the save_decision from being called by using a conditional check or
display a warning to the user, ensuring only valid property URIs are persisted.

Comment thread GSoC26_H/hitl/README.md
Comment on lines +23 to +26
Deployed via [Streamlit Community Cloud](https://share.streamlit.io) —
point it at this repo, branch `gsoc26h-development`, main file path
`GSoC26_H/hitl/hitl_app.py`. Streamlit Cloud uses the `requirements.txt`
in this same folder (not the repo root) for a fast, lightweight build.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cat -n GSoC26_H/hitl/README.md | sed -n '20,30p'

Repository: dbpedia/neural-extraction-framework

Length of output: 517


🌐 Web query:

Streamlit Community Cloud requirements.txt file organization placement repo root entrypoint

💡 Result:

In Streamlit Community Cloud, you have flexibility regarding where to place your requirements.txt file and your entrypoint file [1]. Requirements File Placement Streamlit Community Cloud automatically detects your Python dependencies by searching for a recognized dependency file (such as requirements.txt) [1]. It searches in the following order [1]: 1. The directory containing your app's entrypoint file [1]. 2. The root of your repository [1]. If you have multiple dependency files or files in both locations, the one in the directory of your entrypoint file takes precedence [1]. You should use only one dependency file (e.g., requirements.txt) per app [1]. Other supported files include pyproject.toml, Pipfile, environment.yml, and uv.lock, which are also searched for using this priority [1]. Important: Unlike requirements.txt, if you use a packages.txt file for system-level dependencies (apt-get), it must be placed at the root of your repository [2]. Entrypoint File Your entrypoint file (typically your main Python script, such as streamlit_app.py) defines the starting point of your application [3][4]. You can place this file anywhere in your repository; however, you must specify its path in the Streamlit Community Cloud deployment settings under the Main file path field during the app setup or edit process [5]. For the most straightforward configuration, it is common practice to place both your requirements.txt and your entrypoint file in the root of your repository [1][6]. If you choose to organize your repository with subdirectories, ensure your entrypoint file is correctly identified in the Streamlit cloud dashboard [5].

Citations:


Clarify the Streamlit Cloud dependency-file note.

The README states that Streamlit Cloud uses the requirements.txt "in this same folder (not the repo root)", which is misleading. Streamlit Community Cloud actually searches for requirements.txt in two locations in order: first in the directory containing the entrypoint file (GSoC26_H/hitl/), then in the repo root. It will use the repo-root file if no file exists in the entrypoint directory. The current wording may incorrectly suggest that a repo-root requirements.txt won't be used.

Fix
-Streamlit Cloud uses the `requirements.txt` in this same folder (not the repo root) for a fast, lightweight build.
+Streamlit Cloud can use a `requirements.txt` either in the repo root or alongside `hitl_app.py`.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Deployed via [Streamlit Community Cloud](https://share.streamlit.io)
point it at this repo, branch `gsoc26h-development`, main file path
`GSoC26_H/hitl/hitl_app.py`. Streamlit Cloud uses the `requirements.txt`
in this same folder (not the repo root) for a fast, lightweight build.
Deployed via [Streamlit Community Cloud](https://share.streamlit.io)
point it at this repo, branch `gsoc26h-development`, main file path
`GSoC26_H/hitl/hitl_app.py`. Streamlit Cloud can use a `requirements.txt` either in the repo root or alongside `hitl_app.py`.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@GSoC26_H/hitl/README.md` around lines 23 - 26, The README statement about
Streamlit Community Cloud's requirements.txt handling is misleading because it
suggests only the local folder file is used and excludes repo-root usage. Revise
the text in the deployment section (around lines 23-26) to clarify that
Streamlit Community Cloud searches for requirements.txt first in the entrypoint
directory (GSoC26_H/hitl/) and then falls back to the repo root if no file
exists in the entrypoint directory, rather than stating it uses the local file
"not the repo root", which incorrectly implies the repo-root file is never used.

Comment on lines +149 to +164
def _parse_generated_json(raw: str) -> Optional[dict]:
"""Strip optional markdown code fences and parse the model's JSON output."""
try:
cleaned = raw.strip()
if cleaned.startswith("```"):
cleaned = cleaned.split("```")[1].replace("json", "", 1)
parsed = json.loads(cleaned)
if not (
parsed.get("hindi_sentence")
and parsed.get("thought_process")
and isinstance(parsed.get("extracted_triplets"), list)
):
return None
return parsed
except (json.JSONDecodeError, IndexError):
return None

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Non-dict JSON output crashes the entire batch.

_parse_generated_json only catches json.JSONDecodeError and IndexError. If the model returns valid JSON that is not an object (e.g. a list, string, or number), parsed.get(...) at lines 156-159 raises AttributeError. Since line 253 calls this outside any try, the exception propagates through future.result() at line 301 and terminates the whole run_batch run — losing all in-progress work for a single malformed response. For a resilient mass-generation script this is a real availability risk.

🛡️ Proposed fix: guard for non-dict payloads
         parsed = json.loads(cleaned)
+        if not isinstance(parsed, dict):
+            return None
         if not (
             parsed.get("hindi_sentence")
             and parsed.get("thought_process")
             and isinstance(parsed.get("extracted_triplets"), list)
         ):
             return None
         return parsed
-    except (json.JSONDecodeError, IndexError):
+    except (json.JSONDecodeError, IndexError, AttributeError):
         return None
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@GSoC26_H/src/finetune/generate_noisy_data.py` around lines 149 - 164, The
_parse_generated_json function only catches json.JSONDecodeError and IndexError,
but when the model returns valid JSON that is not a dictionary (like a list,
string, or number), the .get() calls on parsed raise an AttributeError which is
not caught, causing the entire batch to fail. Fix this by either adding
AttributeError to the except clause alongside json.JSONDecodeError and
IndexError, or by adding a type check (isinstance(parsed, dict)) before
attempting to call .get() on the parsed value to ensure only dictionary objects
are validated.

Comment on lines +376 to +378
seeds = build_noisy_few_shot_seeds(scored_examples)
print(f"Built {len(seeds)} noisy few-shot seeds "
f"(scores: {[s.get('score') for s in seeds]})")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Seed-score diagnostic always prints None.

The seed dicts built in build_noisy_few_shot_seeds contain only hindi_sentence, thought_process, and extracted_triplets (lines 109-113) — there is no score key. So s.get('score') yields None for every seed, defeating the purpose of this log line, which is meant to surface the (low) quality of the sampled seeds.

Consider carrying the score forward in the seed dict, e.g. add "score": ex["judgement"]["score"] in the seed construction loop, then this log will reflect actual values.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@GSoC26_H/src/finetune/generate_noisy_data.py` around lines 376 - 378, The
print statement in the seed-building section attempts to log scores from seeds
by calling `s.get('score')`, but the seed dictionaries constructed in the
`build_noisy_few_shot_seeds` function only contain `hindi_sentence`,
`thought_process`, and `extracted_triplets` keys with no `score` field. Modify
the `build_noisy_few_shot_seeds` function to include the score value from each
example's judgement in the seed dictionary construction (add `"score":
ex["judgement"]["score"]` to the seed dict when iterating through examples), so
that the subsequent log message will reflect the actual seed quality scores
instead of printing None values.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (4)
GSoC26_H/results/ground_truth_summary.md (1)

13-24: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Add language specifier to fenced code block.

The method steps block is missing a language identifier. Add text to satisfy markdownlint MD040.

- ```
+ ```text
  Step 1: Parse hindi_benchie_gold.txt → 139 canonical Cluster-1 triples
  ...
  Step 5: All 139 entries manually verified by contributor; 6 borderline cases
          corrected; 15 data-corruption entries (ERROR in reasoning) regenerated;
          4 reasoning/answer contradictions fixed

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @GSoC26_H/results/ground_truth_summary.md around lines 13 - 24, The fenced
methods block in ground_truth_summary.md is missing a language tag, so update
the markdown fence used for the Step 1–Step 5 block to specify text and satisfy
markdownlint MD040. Locate the block containing the numbered pipeline steps and
change the opening fence to a text-labeled fence while leaving the content
unchanged.


</details>

<!-- cr-comment:v1:8bf8ad1af7b879cc48008693 -->

</blockquote></details>
<details>
<summary>GSoC26_H/results/embedding_model_comparison.md (1)</summary><blockquote>

`59-65`: _📐 Maintainability & Code Quality_ | _🔵 Trivial_ | _💤 Low value_

**Add language specifier to fenced code block.**

The pipeline configuration block is missing a language identifier. Add `text` or `yaml` to satisfy markdownlint MD040 and improve syntax highlighting.

```diff
- ```
+ ```text
  Stage 1: intfloat/multilingual-e5-large-instruct
           → top-15 candidates per predicate (no threshold)
  
  Stage 2: openai/gpt-oss-120b (NVIDIA hosted)
           → disambiguates using full sentence context + 15 diverse few-shot examples
           → outputs final dbo: property or NONE

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @GSoC26_H/results/embedding_model_comparison.md around lines 59 - 65, The
fenced pipeline configuration block in embedding_model_comparison.md is missing
a language specifier, so update that Markdown code fence to include a valid
identifier such as text or yaml. Locate the block containing the Stage 1 / Stage
2 configuration and adjust the opening fence only so markdownlint MD040 is
satisfied and syntax highlighting works.


</details>

<!-- cr-comment:v1:916ea7b596f5465f5e87f639 -->

</blockquote></details>
<details>
<summary>GSoC26_H/results/noisy_dataset_summary.md (1)</summary><blockquote>

`39-47`: _📐 Maintainability & Code Quality_ | _🔵 Trivial_ | _💤 Low value_

**Add language specifier to fenced code block.**

The staged training plan block is missing a language identifier. Add `text` to satisfy markdownlint MD040.

```diff
- ```
+ ```text
  Phase 1 (noisy training data):
    Original score<8 examples:     ~11,367
    ...
  Phase 2 (clean refinement data):
    Original score≥9 examples:      8,633 (unchanged, verified quality)

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @GSoC26_H/results/noisy_dataset_summary.md around lines 39 - 47, The fenced
block in the staged training plan is missing a language specifier, so update the
markdown snippet in the noisy dataset summary to use a text-labeled fence.
Locate the code block containing the Phase 1 and Phase 2 training data summary
and add the text identifier to the opening fence so it satisfies markdownlint
MD040.


</details>

<!-- cr-comment:v1:9991c4a32e85ec3e04875ece -->

</blockquote></details>
<details>
<summary>GSoC26_H/results/wikipedia_generation_summary.md (1)</summary><blockquote>

`43-47`: _📐 Maintainability & Code Quality_ | _🔵 Trivial_ | _💤 Low value_

**Add language specifier to fenced code block.**

The prompt schema block is missing a language identifier. Add `json` or `text` to satisfy markdownlint MD040.

```diff
- ```
+ ```json
  System: Extract all subject-relation-object triplets...
  User: <real Hindi Wikipedia sentence>
  Assistant: {"thought_process": "...", "extracted_triplets": [...]}

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @GSoC26_H/results/wikipedia_generation_summary.md around lines 43 - 47, The
fenced prompt schema block is missing a language tag, so update the markdown
snippet in the summary to use an appropriate specifier such as json or text on
the opening fence. Locate the example block containing System/User/Assistant and
adjust only the fence syntax so it satisfies markdownlint MD040 while keeping
the content unchanged.


</details>

<!-- cr-comment:v1:e115f9f2f61f25f74aa83d6c -->

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In @GSoC26_H/results/embedding_model_comparison.md:

  • Around line 59-65: The fenced pipeline configuration block in
    embedding_model_comparison.md is missing a language specifier, so update that
    Markdown code fence to include a valid identifier such as text or yaml. Locate
    the block containing the Stage 1 / Stage 2 configuration and adjust the opening
    fence only so markdownlint MD040 is satisfied and syntax highlighting works.

In @GSoC26_H/results/ground_truth_summary.md:

  • Around line 13-24: The fenced methods block in ground_truth_summary.md is
    missing a language tag, so update the markdown fence used for the Step 1–Step 5
    block to specify text and satisfy markdownlint MD040. Locate the block
    containing the numbered pipeline steps and change the opening fence to a
    text-labeled fence while leaving the content unchanged.

In @GSoC26_H/results/noisy_dataset_summary.md:

  • Around line 39-47: The fenced block in the staged training plan is missing a
    language specifier, so update the markdown snippet in the noisy dataset summary
    to use a text-labeled fence. Locate the code block containing the Phase 1 and
    Phase 2 training data summary and add the text identifier to the opening fence
    so it satisfies markdownlint MD040.

In @GSoC26_H/results/wikipedia_generation_summary.md:

  • Around line 43-47: The fenced prompt schema block is missing a language tag,
    so update the markdown snippet in the summary to use an appropriate specifier
    such as json or text on the opening fence. Locate the example block containing
    System/User/Assistant and adjust only the fence syntax so it satisfies
    markdownlint MD040 while keeping the content unchanged.

</details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: Organization UI

**Review profile**: CHILL

**Plan**: Pro

**Run ID**: `d5f088cb-c5f8-44b0-93d6-2dc09a4e465d`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 3013015aed21534ef4ddf50e559c35854c3d6e23 and 3a1e89d2a1e716527ca38ac148bfdeff8ad6c5a9.

</details>

<details>
<summary>📒 Files selected for processing (4)</summary>

* `GSoC26_H/results/embedding_model_comparison.md`
* `GSoC26_H/results/ground_truth_summary.md`
* `GSoC26_H/results/noisy_dataset_summary.md`
* `GSoC26_H/results/wikipedia_generation_summary.md`

</details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

Updated error taxonomy to include a new error type for Predicate Placeholder and adjusted related comments and logic.
Updated GPU target and model configuration for A2 and T4. Adjusted alignment threshold and added ground truth file for evaluation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant