Skip to content

Fix incorrect variable reference for reference answers#90

Open
kibitzing wants to merge 1 commit intolmarena:mainfrom
kibitzing:patch-1
Open

Fix incorrect variable reference for reference answers#90
kibitzing wants to merge 1 commit intolmarena:mainfrom
kibitzing:patch-1

Conversation

@kibitzing
Copy link
Copy Markdown

Replaced answer_dir with model_answers when loading reference answers.
answer_dir is a path string, not a mapping of model names to answers.
Using model_answers correctly retrieves the answers for reference models.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Oct 31, 2025

Walkthrough

The change modifies gen_judgment.py to source reference answers from an in-memory dictionary (model_answers[model]) instead of a directory-mapped lookup (answer_dir[model]). The data retrieval source is changed while subsequent logic for constructing reference prompts remains unaltered.

Changes

Cohort / File(s) Summary
Reference Answer Sourcing
gen_judgment.py
Modified reference answer retrieval to use in-memory model_answers dictionary instead of answer_dir directory-based lookup when a reference set is provided

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Verify that model_answers[model] contains the expected data structure and completeness compared to the previous answer_dir[model] source
  • Confirm downstream usage of ref_answers remains compatible with the new data source

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title "Fix incorrect variable reference for reference answers" clearly and specifically summarizes the main change in the pull request. According to the summary, the changeset replaces answer_dir with model_answers when loading reference answers—exactly what the title indicates. The title is concise, avoids generic language, and directly communicates the primary intent of the change to a reviewer scanning the history.
Description Check ✅ Passed The description is clearly related to the changeset and provides meaningful context. It explains what was changed (replacing answer_dir with model_answers), why the change was necessary (because answer_dir is a path string, not a model name mapping), and the benefit (correctly retrieves reference model answers). The description is neither vague nor unrelated to the changeset, meeting the lenient pass criteria.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
gen_judgment.py (1)

183-186: Fix reference handling: reference is constructed as a list but used as a single dict.

The code at line 184 creates kwargs["reference"] as a list of reference answers, but this list is passed directly to pairwise_judgment() which expects a single reference object. At line 42, pairwise_judgment() attempts dict-style access (reference["messages"][-1]), which will fail when reference is a list.

Either:

  • Line 184 should assign a single reference (e.g., kwargs["reference"] = ref_answers[0][uid] if only one reference is needed)
  • Or judgment() should iterate through reference answers and call pairwise_judgment() separately for each one
  • Or the reference list needs to be unpacked before being passed to pairwise_judgment()
🧹 Nitpick comments (1)
gen_judgment.py (1)

9-16: Remove duplicate import.

The load_questions function is imported twice (lines 10 and 12).

Apply this diff to remove the duplicate:

 from utils.completion import (
     load_questions,
     registered_api_completion,
-    load_questions,
     load_model_answers,
     get_endpoint,
     make_config,
 )
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 196f6b8 and 0c6045f.

📒 Files selected for processing (1)
  • gen_judgment.py (1 hunks)
🔇 Additional comments (1)
gen_judgment.py (1)

140-144: Correct fix for reference answer retrieval.

The change correctly retrieves reference answers from model_answers (a dictionary mapping model names to their answers) instead of answer_dir (a string path). This aligns with how model_answers is used throughout the code (e.g., lines 170, 178-181).

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Nov 10, 2025

Warning

Docstrings generation is disabled for your repository or organization.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Nov 10, 2025

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Fetch the content of gen_judgment.py to understand the code structure
cat gen_judgment.py

Length of output: 6309


@qazwsx74269: I'll generate comprehensive docstrings for the functions in gen_judgment.py. Here are the docstrings for the module and all functions:

Module-level docstring (add at the top of the file after imports):

"""
Arena Hard Auto - Judgment Generation Module

This module implements pairwise comparison judgments between model answers using
a judge model. It supports reference-based evaluation and produces judgment scores
through LLM-based evaluation.

The module can be run as a CLI tool to generate judgments for multiple models in
parallel, with results stored in JSONL format.
"""

For get_score function:

def get_score(judgment, patterns):
    """
    Extract a score from a judgment text using regex patterns.
    
    Attempts to match the judgment text against a list of regex patterns and
    returns the last non-empty match found. The judgment is converted to uppercase
    before matching.
    
    Args:
        judgment (str): The judgment text to extract a score from.
        patterns (list[str]): List of regex patterns to match against the judgment.
    
    Returns:
        str or None: The extracted score string with newlines stripped, or None if
            no matches are found.
    """

For pairwise_judgment function:

def pairwise_judgment(question, baseline, answer, reference, configs, settings):
    """
    Perform a pairwise judgment comparison between a baseline and candidate answer.
    
    Constructs a prompt with the question and both answers, optionally including
    a reference answer. Sends the prompt to a judge model API and extracts a score
    from the response.
    
    Args:
        question (dict): Question dictionary containing 'prompt', 'category', and 'uid'.
        baseline (dict): Baseline answer dictionary with 'messages' containing the answer.
        answer (dict): Candidate answer dictionary with 'messages' containing the answer.
        reference (dict or None): Optional reference answer dictionary, or None if not used.
        configs (dict): Configuration dictionary with 'prompt_template', 'temperature',
            'max_tokens', and 'regex_patterns'.
        settings (dict): Endpoint settings dictionary with 'endpoints' and 'api_type'.
    
    Returns:
        dict or None: Dictionary containing 'score', 'judgment', and 'prompt' fields,
            or None if the API call fails.
    """

For judgment function:

def judgment(args):
    """
    Execute a complete judgment cycle with two rounds of pairwise comparisons.
    
    Performs two rounds of pairwise judgment: first comparing answer vs baseline,
    then baseline vs answer (swapped positions). Results from both rounds are
    appended to an output file in JSONL format.
    
    Args:
        args (dict): Arguments dictionary containing:
            - question (dict): Question data with 'uid', 'category', and 'prompt'.
            - answer (dict): Candidate answer with 'model' name and 'messages'.
            - baseline (dict): Baseline answer with 'model' name and 'messages'.
            - reference (dict or None): Optional reference answer or None.
            - configs (dict): Configuration with 'judge_model' and judgment parameters.
            - settings (dict): Endpoint settings for the judge API.
            - output_file (str): Path to the output JSONL file.
    
    Returns:
        None: Results are written directly to the output file.
    """

These docstrings follow the Google style guide and provide clear documentation for:

  • Purpose of each function
  • Parameter descriptions with types
  • Return value descriptions with types
  • Important behavioral notes

Would you like me to provide the complete file with all docstrings integrated, or would you prefer any modifications to the docstring style or content?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant