Skip to content

[问题/Issue] 章节12.3.2:简短问题描述 / Chapter12.3: benchmark-GAIA 数据集新旧版本格式不一致 #347

@zzhRooT1998

Description

@zzhRooT1998

1. 遇到问题的章节 / Affected Chapter

Chapter12.3

2. 问题类型 / Issue Type

代码错误 / Code Error

3. 具体问题描述 / Problem Description

gaia-benchmark目前最新的数据集中没有metadata.jsonl文件,而是用metadata.parquet格式的文件代替,但这两种文件可以互相转换,我在本地已修复该问题,如有需要我可以提交pr。
最新文档对此也做了说明:https://huggingface.co/datasets/gaia-benchmark/GAIA

Image

4. 问题重现材料 / Reproduction Materials

代码:
from hello_agents import SimpleAgent, HelloAgentsLLM
from hello_agents.tools import GAIAEvaluationTool

GAIA官方系统提示词(来自论文)

GAIA_SYSTEM_PROMPT = """You are a general AI assistant. I will ask you a question. Report your thoughts, and finish your answer with the following template: FINAL ANSWER: [YOUR FINAL ANSWER].

YOUR FINAL ANSWER should be a number OR as few words as possible OR a comma separated list of numbers and/or strings.

If you are asked for a number, don't use comma to write your number neither use units such as $ or percent sign unless specified otherwise.

If you are asked for a string, don't use articles, neither abbreviations (e.g. for cities), and write the digits in plain text unless specified otherwise.

If you are asked for a comma separated list, apply the above rules depending of whether the element to be put in the list is a number or a string."""

1. 创建智能体(使用GAIA官方系统提示词)

llm = HelloAgentsLLM()
agent = SimpleAgent(
name="TestAgent",
llm=llm,
system_prompt=GAIA_SYSTEM_PROMPT # 关键:使用GAIA官方提示词
)

2. 创建GAIA评估工具

gaia_tool = GAIAEvaluationTool()

3. 一键运行评估

results = gaia_tool.run(
agent=agent,
level=1, # Level 1: 简单任务
max_samples=5, # 评估5个样本
export_results=True, # 导出GAIA格式结果
generate_report=True # 生成评估报告
)

4. 查看结果

print(f"精确匹配率: {results['exact_match_rate']:.2%}")
print(f"部分匹配率: {results['partial_match_rate']:.2%}")
print(f"正确数: {results['exact_matches']}/{results['total_samples']}")

日志:

GAIA一键评估

配置:
智能体: TestAgent
难度级别: 1
样本数量: 5

============================================================
步骤1: 运行HelloAgents评估

正在从HuggingFace下载: gaia-benchmark/GAIA
📥 下载GAIA数据集...
Fetching 119 files: 100%|██████████| 119/119 [00:00<00:00, 1954.12it/s]
Traceback (most recent call last):
File "D:\pyApp\evaluation.venv\Lib\site-packages\hello_agents\tools\builtin\gaia_evaluation_tool.py", line 115, in run
results = self._run_evaluation(agent, level, max_samples, local_data_dir)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\pyApp\evaluation.venv\Lib\site-packages\hello_agents\tools\builtin\gaia_evaluation_tool.py", line 169, in _run_evaluation
raise ValueError("数据集加载失败或为空")
ValueError: 数据集加载失败或为空
Traceback (most recent call last):
File "D:\pyApp\evaluation\gaia_evaluate.py", line 36, in
print(f"精确匹配率: {results['exact_match_rate']:.2%}")
~~~~~~~^^^^^^^^^^^^^^^^^^^^
KeyError: 'exact_match_rate'
✓ 数据集下载完成: D:\pyApp\evaluation\data\gaia
⚠️ 未找到metadata文件: D:\pyApp\evaluation\data\gaia\2023\validation\metadata.jsonl
✅ GAIA数据集加载完成
数据源: gaia-benchmark/GAIA
分割: validation
级别: 1
样本数: 0

❌ 评估失败: 数据集加载失败或为空

5. 补充信息 / Additional Information

No response

确认事项 / Verification

  • 我已阅读过相关章节的文档 / I have read the relevant chapter documentation
  • 我已搜索过现有的Issues,确认此问题未被报告 / I have searched existing Issues and confirmed this hasn't been reported
  • 我已尝试过基本的故障排除(如重启、重新安装依赖等) / I have tried basic troubleshooting (restart, reinstall dependencies, etc.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions