Fix incorrect variable reference for reference answers#90
Fix incorrect variable reference for reference answers#90kibitzing wants to merge 1 commit intolmarena:mainfrom
Conversation
WalkthroughThe change modifies Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes
Pre-merge checks❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
gen_judgment.py (1)
183-186: Fix reference handling: reference is constructed as a list but used as a single dict.The code at line 184 creates
kwargs["reference"]as a list of reference answers, but this list is passed directly topairwise_judgment()which expects a single reference object. At line 42,pairwise_judgment()attempts dict-style access (reference["messages"][-1]), which will fail when reference is a list.Either:
- Line 184 should assign a single reference (e.g.,
kwargs["reference"] = ref_answers[0][uid]if only one reference is needed)- Or
judgment()should iterate through reference answers and callpairwise_judgment()separately for each one- Or the reference list needs to be unpacked before being passed to
pairwise_judgment()
🧹 Nitpick comments (1)
gen_judgment.py (1)
9-16: Remove duplicate import.The
load_questionsfunction is imported twice (lines 10 and 12).Apply this diff to remove the duplicate:
from utils.completion import ( load_questions, registered_api_completion, - load_questions, load_model_answers, get_endpoint, make_config, )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
gen_judgment.py(1 hunks)
🔇 Additional comments (1)
gen_judgment.py (1)
140-144: Correct fix for reference answer retrieval.The change correctly retrieves reference answers from
model_answers(a dictionary mapping model names to their answers) instead ofanswer_dir(a string path). This aligns with howmodel_answersis used throughout the code (e.g., lines 170, 178-181).
|
Warning Docstrings generation is disabled for your repository or organization. |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 6309 @qazwsx74269: I'll generate comprehensive docstrings for the functions in Module-level docstring (add at the top of the file after imports): """
Arena Hard Auto - Judgment Generation Module
This module implements pairwise comparison judgments between model answers using
a judge model. It supports reference-based evaluation and produces judgment scores
through LLM-based evaluation.
The module can be run as a CLI tool to generate judgments for multiple models in
parallel, with results stored in JSONL format.
"""For def get_score(judgment, patterns):
"""
Extract a score from a judgment text using regex patterns.
Attempts to match the judgment text against a list of regex patterns and
returns the last non-empty match found. The judgment is converted to uppercase
before matching.
Args:
judgment (str): The judgment text to extract a score from.
patterns (list[str]): List of regex patterns to match against the judgment.
Returns:
str or None: The extracted score string with newlines stripped, or None if
no matches are found.
"""For def pairwise_judgment(question, baseline, answer, reference, configs, settings):
"""
Perform a pairwise judgment comparison between a baseline and candidate answer.
Constructs a prompt with the question and both answers, optionally including
a reference answer. Sends the prompt to a judge model API and extracts a score
from the response.
Args:
question (dict): Question dictionary containing 'prompt', 'category', and 'uid'.
baseline (dict): Baseline answer dictionary with 'messages' containing the answer.
answer (dict): Candidate answer dictionary with 'messages' containing the answer.
reference (dict or None): Optional reference answer dictionary, or None if not used.
configs (dict): Configuration dictionary with 'prompt_template', 'temperature',
'max_tokens', and 'regex_patterns'.
settings (dict): Endpoint settings dictionary with 'endpoints' and 'api_type'.
Returns:
dict or None: Dictionary containing 'score', 'judgment', and 'prompt' fields,
or None if the API call fails.
"""For def judgment(args):
"""
Execute a complete judgment cycle with two rounds of pairwise comparisons.
Performs two rounds of pairwise judgment: first comparing answer vs baseline,
then baseline vs answer (swapped positions). Results from both rounds are
appended to an output file in JSONL format.
Args:
args (dict): Arguments dictionary containing:
- question (dict): Question data with 'uid', 'category', and 'prompt'.
- answer (dict): Candidate answer with 'model' name and 'messages'.
- baseline (dict): Baseline answer with 'model' name and 'messages'.
- reference (dict or None): Optional reference answer or None.
- configs (dict): Configuration with 'judge_model' and judgment parameters.
- settings (dict): Endpoint settings for the judge API.
- output_file (str): Path to the output JSONL file.
Returns:
None: Results are written directly to the output file.
"""These docstrings follow the Google style guide and provide clear documentation for:
Would you like me to provide the complete file with all docstrings integrated, or would you prefer any modifications to the docstring style or content? |
Replaced
answer_dirwithmodel_answerswhen loading reference answers.answer_diris a path string, not a mapping of model names to answers.Using
model_answerscorrectly retrieves the answers for reference models.