Description:
I observed a significant flaw in the current evaluation protocol when testing Deep Research Agents (DRAs) on the HLE dataset. Since the official code does not explicitly restrict the model from accessing the benchmark's own data or its unencrypted mirrors during the search process, the model can—and often does—perform "Search-Time Data Contamination."
How it happens:
Direct Search: When the agent processes a difficult HLE question, its internal planning often triggers a search for the exact problem statement.
Finding Mirrors: The agent can quickly locate unencrypted JSON/CSV files on third-party HuggingFace repositories, GitHub forks, or discussion forums (e.g., Reddit, Twitter, or academic mirrors).
Cheat Extraction: The model extracts the ground truth label from these files rather than solving the problem via reasoning.
Evidence/Examples:
Recent audits (e.g., Search-Time Data Contamination arXiv:2508.13180) show that models like Perplexity and OpenAI's Deep Research can find the ground truth for HLE questions in about 3% of cases simply by navigating to ungated third-party uploads.
Impact:
This invalidates the core purpose of HLE. If the dataset isn't encrypted or the search space isn't sandboxed, we aren't measuring Reasoning Capability; we are measuring Retrieval Efficiency. Any Deep Research benchmark that remains in plaintext on the web is "dead on arrival" for tool-use agents.
Suggested Mitigations:
Input Obfuscation: Encrypt or hash the questions/labels in the public repository so they aren't indexed by search engines as "HLE Answer".
Search Sandboxing: In the evaluation code, implement a domain blacklist (e.g., block github.com, huggingface.co, lastexam.ai) during the testing phase.
Dynamic Evaluation: Use versions of questions where the constants or logic are slightly altered (dynamic noise) to prevent exact-string match retrieval.
Process-Based Verification: Evaluate the agent's internal reasoning steps rather than just the final answer to detect if it "jumped" to a conclusion found online.
Question to the community:
How is the official leaderboard currently handling this? Is there a "closed-net" environment for these evaluations that I missed?
Description:
I observed a significant flaw in the current evaluation protocol when testing Deep Research Agents (DRAs) on the HLE dataset. Since the official code does not explicitly restrict the model from accessing the benchmark's own data or its unencrypted mirrors during the search process, the model can—and often does—perform "Search-Time Data Contamination."
How it happens:
Direct Search: When the agent processes a difficult HLE question, its internal planning often triggers a search for the exact problem statement.
Finding Mirrors: The agent can quickly locate unencrypted JSON/CSV files on third-party HuggingFace repositories, GitHub forks, or discussion forums (e.g., Reddit, Twitter, or academic mirrors).
Cheat Extraction: The model extracts the ground truth label from these files rather than solving the problem via reasoning.
Evidence/Examples:
Recent audits (e.g., Search-Time Data Contamination arXiv:2508.13180) show that models like Perplexity and OpenAI's Deep Research can find the ground truth for HLE questions in about 3% of cases simply by navigating to ungated third-party uploads.
Impact:
This invalidates the core purpose of HLE. If the dataset isn't encrypted or the search space isn't sandboxed, we aren't measuring Reasoning Capability; we are measuring Retrieval Efficiency. Any Deep Research benchmark that remains in plaintext on the web is "dead on arrival" for tool-use agents.
Suggested Mitigations:
Input Obfuscation: Encrypt or hash the questions/labels in the public repository so they aren't indexed by search engines as "HLE Answer".
Search Sandboxing: In the evaluation code, implement a domain blacklist (e.g., block github.com, huggingface.co, lastexam.ai) during the testing phase.
Dynamic Evaluation: Use versions of questions where the constants or logic are slightly altered (dynamic noise) to prevent exact-string match retrieval.
Process-Based Verification: Evaluate the agent's internal reasoning steps rather than just the final answer to detect if it "jumped" to a conclusion found online.
Question to the community:
How is the official leaderboard currently handling this? Is there a "closed-net" environment for these evaluations that I missed?