Fix incorrect variable reference for reference answers by kibitzing · Pull Request #90 · lmarena/arena-hard-auto

kibitzing · 2025-10-31T06:47:34Z

Replaced answer_dir with model_answers when loading reference answers.
answer_dir is a path string, not a mapping of model names to answers.
Using model_answers correctly retrieves the answers for reference models.

coderabbitai · 2025-10-31T06:50:04Z

Walkthrough

The change modifies gen_judgment.py to source reference answers from an in-memory dictionary (model_answers[model]) instead of a directory-mapped lookup (answer_dir[model]). The data retrieval source is changed while subsequent logic for constructing reference prompts remains unaltered.

Changes

Cohort / File(s)	Summary
Reference Answer Sourcing `gen_judgment.py`	Modified reference answer retrieval to use in-memory `model_answers` dictionary instead of `answer_dir` directory-based lookup when a reference set is provided

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Verify that model_answers[model] contains the expected data structure and completeness compared to the previous answer_dir[model] source
Confirm downstream usage of ref_answers remains compatible with the new data source

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title "Fix incorrect variable reference for reference answers" clearly and specifically summarizes the main change in the pull request. According to the summary, the changeset replaces `answer_dir` with `model_answers` when loading reference answers—exactly what the title indicates. The title is concise, avoids generic language, and directly communicates the primary intent of the change to a reviewer scanning the history.
Description Check	✅ Passed	The description is clearly related to the changeset and provides meaningful context. It explains what was changed (replacing `answer_dir` with `model_answers`), why the change was necessary (because `answer_dir` is a path string, not a model name mapping), and the benefit (correctly retrieves reference model answers). The description is neither vague nor unrelated to the changeset, meeting the lenient pass criteria.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

gen_judgment.py (1)

183-186: Fix reference handling: reference is constructed as a list but used as a single dict.

The code at line 184 creates kwargs["reference"] as a list of reference answers, but this list is passed directly to pairwise_judgment() which expects a single reference object. At line 42, pairwise_judgment() attempts dict-style access (reference["messages"][-1]), which will fail when reference is a list.

Either:

Line 184 should assign a single reference (e.g., kwargs["reference"] = ref_answers[0][uid] if only one reference is needed)

Or judgment() should iterate through reference answers and call pairwise_judgment() separately for each one

Or the reference list needs to be unpacked before being passed to pairwise_judgment()

🧹 Nitpick comments (1)

gen_judgment.py (1)
9-16: Remove duplicate import.

The load_questions function is imported twice (lines 10 and 12).

Apply this diff to remove the duplicate:
 from utils.completion import (
     load_questions,
     registered_api_completion,
-    load_questions,
     load_model_answers,
     get_endpoint,
     make_config,
 )

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 196f6b8 and 0c6045f.

📒 Files selected for processing (1)

gen_judgment.py (1 hunks)

🔇 Additional comments (1)

gen_judgment.py (1)

140-144: Correct fix for reference answer retrieval.

The change correctly retrieves reference answers from model_answers (a dictionary mapping model names to their answers) instead of answer_dir (a string path). This aligns with how model_answers is used throughout the code (e.g., lines 170, 178-181).

coderabbitai · 2025-11-10T03:44:09Z

Warning

Docstrings generation is disabled for your repository or organization.

coderabbitai · 2025-11-10T03:44:44Z

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Fetch the content of gen_judgment.py to understand the code structure
cat gen_judgment.py

Length of output: 6309

@qazwsx74269: I'll generate comprehensive docstrings for the functions in gen_judgment.py. Here are the docstrings for the module and all functions:

Module-level docstring (add at the top of the file after imports):

"""
Arena Hard Auto - Judgment Generation Module

This module implements pairwise comparison judgments between model answers using
a judge model. It supports reference-based evaluation and produces judgment scores
through LLM-based evaluation.

The module can be run as a CLI tool to generate judgments for multiple models in
parallel, with results stored in JSONL format.
"""

For get_score function:

def get_score(judgment, patterns):
    """
    Extract a score from a judgment text using regex patterns.
    
    Attempts to match the judgment text against a list of regex patterns and
    returns the last non-empty match found. The judgment is converted to uppercase
    before matching.
    
    Args:
        judgment (str): The judgment text to extract a score from.
        patterns (list[str]): List of regex patterns to match against the judgment.
    
    Returns:
        str or None: The extracted score string with newlines stripped, or None if
            no matches are found.
    """

For pairwise_judgment function:

def pairwise_judgment(question, baseline, answer, reference, configs, settings):
    """
    Perform a pairwise judgment comparison between a baseline and candidate answer.
    
    Constructs a prompt with the question and both answers, optionally including
    a reference answer. Sends the prompt to a judge model API and extracts a score
    from the response.
    
    Args:
        question (dict): Question dictionary containing 'prompt', 'category', and 'uid'.
        baseline (dict): Baseline answer dictionary with 'messages' containing the answer.
        answer (dict): Candidate answer dictionary with 'messages' containing the answer.
        reference (dict or None): Optional reference answer dictionary, or None if not used.
        configs (dict): Configuration dictionary with 'prompt_template', 'temperature',
            'max_tokens', and 'regex_patterns'.
        settings (dict): Endpoint settings dictionary with 'endpoints' and 'api_type'.
    
    Returns:
        dict or None: Dictionary containing 'score', 'judgment', and 'prompt' fields,
            or None if the API call fails.
    """

For judgment function:

def judgment(args):
    """
    Execute a complete judgment cycle with two rounds of pairwise comparisons.
    
    Performs two rounds of pairwise judgment: first comparing answer vs baseline,
    then baseline vs answer (swapped positions). Results from both rounds are
    appended to an output file in JSONL format.
    
    Args:
        args (dict): Arguments dictionary containing:
            - question (dict): Question data with 'uid', 'category', and 'prompt'.
            - answer (dict): Candidate answer with 'model' name and 'messages'.
            - baseline (dict): Baseline answer with 'model' name and 'messages'.
            - reference (dict or None): Optional reference answer or None.
            - configs (dict): Configuration with 'judge_model' and judgment parameters.
            - settings (dict): Endpoint settings for the judge API.
            - output_file (str): Path to the output JSONL file.
    
    Returns:
        None: Results are written directly to the output file.
    """

These docstrings follow the Google style guide and provide clear documentation for:

Purpose of each function
Parameter descriptions with types
Return value descriptions with types
Important behavioral notes

Would you like me to provide the complete file with all docstrings integrated, or would you prefer any modifications to the docstring style or content?

Fix reference answers retrieval in gen_judgment.py

0c6045f

coderabbitai Bot reviewed Oct 31, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix incorrect variable reference for reference answers#90

Fix incorrect variable reference for reference answers#90
kibitzing wants to merge 1 commit intolmarena:mainfrom
kibitzing:patch-1

kibitzing commented Oct 31, 2025

Uh oh!

coderabbitai Bot commented Oct 31, 2025

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot commented Nov 10, 2025

Uh oh!

coderabbitai Bot commented Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kibitzing commented Oct 31, 2025

Uh oh!

coderabbitai Bot commented Oct 31, 2025

Walkthrough

Changes

Estimated code review effort

Pre-merge checks

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented Nov 10, 2025

Uh oh!

coderabbitai Bot commented Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant