were making headlines last week.
In Microsoft’s Build 2025, CEO Satya Nadella introduced the vision of an “open agentic web” and showcased a newer GitHub Copilot serving as a multi-agent teammate powered by Azure AI Foundry.
Google’s I/O 2025 quickly followed with an array of Agentic Ai innovations: the new Agent Mode in Gemini 2.5, the open beta of the coding assistant Jules, and native support for the Model Context Protocol, which enables more smooth inter-agent collaboration.
OpenAI isn’t sitting still, either. They upgraded their Operator, the web-browsing agent, to the new o3 model, which brings more autonomy, reasoning, and contextual awareness to everyday tasks.
Across all the announcements, one keyword keeps popping up: GAIA. Everyone seems to be racing to report their GAIA scores, but do you actually know what it is?
If you are curious to learn more about what’s behind the GAIA scores, you are in the right place. In this blog, let’s unpack the GAIA Benchmark and discuss what it is, how it works, and why you should care about those numbers when choosing LLM agent tools.
1. Agentic AI Evaluation: From Problem to Solution
Llm agents are AI systems using LLM as the core that can autonomously perform tasks by combining natural language understanding, with reasoning, planning, memory, and tool use.
Unlike a standard LLM, they are not just passive responders to prompts. Instead, they initiate actions, adapt to context, and collaborate with humans (or even with other agents) to solve complex tasks.
As these agents grow more capable, an important question naturally follows: How do we figure out how good they are?
We need standard benchmark evaluations.
For a while, the LLM community has relied on benchmarks that were great for testing specific skills of LLM, e.g., knowledge recall on MMLU, arithmetic reasoning on GSM8K, snippet-level code generation on HumanEval, or single-turn language understanding on SuperGLUE.
These tests are certainly valuable. But here’s the catch: evaluating a full-fledged AI assistant is a totally different game.
An assistant needs to autonomously plan, decide, and act over multiple steps. These dynamic, real-world skills weren’t the main focus of those “older” evaluation paradigms.
This quickly highlighted a gap: we need a way to measure that all-around practical intelligence.
Enter GAIA.
2. GAIA Unpacked: What’s Under the Hood?
GAIA stands for General AI Assistants benchmark [1]. This benchmark was introduced to specifically evaluate LLM agents on their ability to act as general-purpose AI assistants. It is the result of a collaborative effort by researchers from Meta-FAIR, Meta-GenAI, Hugging Face, and others associated with AutoGPT initiative.
To better understand, let’s break down this benchmark by looking at its structure, how it scores results, and what makes it different from other benchmarks.
2.1 GAIA’s Structure
GAIA is fundamentally a question-driven benchmark where LLM agents are tasked to solve those questions. This requires them to demonstrate a broad suite of abilities, including but not limited to:
- Logical reasoning
- Multi-modality understanding, e.g., interpreting images, data presented in non-textual formats, etc.
- Web browsing for retrieving information
- Use of various software tools, e.g., code interpreters, file manipulators, etc.
- Strategic planning
- Aggregate information from disparate sources
Let’s take a look at one of the “hard” GAIA questions.
Which of the fruits shown in the 2008 painting Embroidery from Uzbekistan were served as part of the October 1949 breakfast menu for the ocean liner later used as a floating prop in the film The Last Voyage? Give the items as a comma-separated list, ordering them clockwise from the 12 o’clock position in the painting and using the plural form of each fruit.
Solving this question forces an agent to (1) perform image recognition to label the fruits in the painting, (2) research film trivia to learn the ship’s name, (3) retrieve and parse a 1949 historical menu, (4) intersect the two fruit lists, and (5) format the answer exactly as requested. This showcases multiple skill pillars in one go.
In total, the benchmark consists of 466 curated questions. They are divided into a development/validation set, which is public, and a private test set of 300 questions, the answers to which are withheld to power the official leaderboard. A unique characteristic of GAIA is that they are designed to have unambiguous, factual answers. This characteristic greatly simplifies the evaluation process and also ensures consistency in scoring.
The GAIA questions are structured based on three difficulty levels. The idea behind this design is to probe progressively more complex capabilities:
- Level 1: These tasks are intended to be solvable by very proficient LLMs. They typically require fewer than five steps to complete and only involve minimal tool usage.
- Level 2: These tasks demand more complex reasoning and the proper usage of multiple tools. The solution generally involves between five and ten steps.
- Level 3: These tasks represent the most challenging tasks within the benchmark. Successfully answering those questions would require long-term planning and the sophisticated integration of diverse tools.
Now that we understand what GAIA tests, let’s examine how it measures success.
2.2 GAIA’s Scoring
The performance of an LLM agent is primarily measured along two main dimensions, accuracy and cost.
For accuracy, this is undoubtedly the main metric for assessing performance. What’s special about GAIA is that the accuracy metric is usually not just reported as an overall score across all questions. Additionally, individual scores for each of the three difficulty levels are also reported to give a clear breakdown of an agent’s capabilities when handling questions with varying complexities.
For cost, it is measured in USD, and reflects the total API cost incurred by an agent to attempt all tasks in the evaluation set. The cost metric is highly valuable in practice because it assesses the efficiency and cost-effectiveness of deploying the agent in the real world. A high-performing agent that incurs excessive costs would be impractical at scale. In contrast, a cost-effective model might be more preferable in production even when it achieves slightly lower accuracy.
To give you a clearer sense of what accuracy actually looks like in practice, consider the following reference points:
- Humans achieve around 92% accuracy on GAIA tasks.
- As a comparison, early LLM agents (powered by GPT-4 with plugin support) started with scores around 15%.
- More recent top-performing agents, e.g., h2oGPTe from H2O.ai (powered by Claude-3.7-sonnet), have delivered ~74% overall score, with level 1/2/3 scores being 86%, 74.8%, and 53%, respectively.
These numbers show how much agents have improved, but also show how challenging GAIA remains, even for the top LLM agent systems.
But what makes GAIA’s difficulty so meaningful for evaluating real-world agent capabilities?
2.3 GAIA’s Guiding Principles
What makes GAIA stand out isn’t just that it’s difficult; it’s that the difficulty is carefully designed to test the kinds of skills that agents need in practical, real-world scenarios. Behind this design are a few important principles:
- Real-world difficulty: GAIA tasks are intentionally challenging. They usually require multi-step reasoning, cross-modal understanding, and the use of tools or APIs. Those requirements closely mirror the kinds of tasks agents would face in real applications.
- Human interpretability: Even though these tasks can be challenging for LLM agents, they remain intuitively understandable for humans. This makes it easier for researchers and practitioners to analyze errors and trace agent behavior.
- Non-gameability: Getting the right answer means the agent has to fully solve the task, not just guess or use pattern-matching. GAIA also discourages overfitting by requiring reasoning traces and avoiding questions with easily searchable answers.
- Simplicity of evaluation: Answers to GAIA questions are designed to be concise, factual, and unambiguous. This allows for automated (and objective) scoring, thus making large-scale comparisons more reliable and reproducible.
With a clearer understanding of GAIA under the hood, the next question is: how should we interpret these scores when we see them in research papers, product announcements, or vendor comparisons?
3. Putting GAIA Scores to Work
Not all GAIA scores are created equal, and headline numbers should be taken with a pinch of salt. Here are four key things to keep in mind:
- Prioritize private test set results. When looking at GAIA scores, always remember to check how the scores are calculated. Is it based on the public validation set or the private test set? The questions and answers for the validation set are widely available online. So it is highly likely that the models might have “memorized” them during their training rather than deriving solutions from genuine reasoning. The private test set is the “real exam”, while the public set is more of an “open book exam.”
- Look beyond overall accuracy, dig into difficulty levels. While the overall accuracy score gives a general idea, it is often better to take a deeper look at how exactly the agent performs for different difficulty levels. Pay particular attention to Level 3 tasks, because strong performance there signals significant advancements in an agent’s capabilities for long-term planning and sophisticated tool usage and integration.
- Seek cost-effective solutions. Always aim to identify agents that offer the best performance for a given cost. We’re seeing significant progress here. For example, the recent Knowledge Graph of Thoughts (KGoT) architecture [2] can solve up to 57 tasks from the GAIA validation set (165 total tasks) at approximately $5 total cost with GPT-4o mini, compared to the earlier versions of Hugging Face Agents that solve around 29 tasks at $187 using GPT-4o.
- Be aware of potential dataset imperfections. About 5% of the GAIA data (across both validation and test sets) contains errors/ambiguities in the ground truth answers. While this makes evaluation tricky, there’s a silver lining: testing LLM agents on questions with imperfect answers can clearly show which agents truly reason versus just spill out their training data.
4. Conclusion
In this post, we’ve unpacked the GAIA, an agent evaluation benchmark that has quickly become the go-to option in the field. The main points to remember:
- GAIA is a reality check for AI assistants. It’s specifically designed to test a sophisticated suite of abilities of LLM agents as AI assistants. These skills include complex reasoning, handling different types of information, web browsing, and using various tools effectively.
- Look beyond the headline numbers. Check the test set source, difficulty breakdowns, and cost-effectiveness.
GAIA represents a significant step toward evaluating LLM agents the way we actually want to use them: as autonomous assistants that can handle the messy, multi-faceted challenges of the real world.
Maybe new evaluation frameworks will emerge, but GAIA’s core principles, real-world relevance, human interpretability, and resistance to gaming, will probably stay central to how we measure AI agents.
References
[1] Mialon et al., GAIA: a benchmark for General AI Assistants, 2023, arXiv.
[2] Besta et al., Affordable AI Assistants with Knowledge Graph of Thoughts, 2025, arXiv.