• About
  • Privacy Policy
  • Disclaimer
  • Contact
Soft Bliss Academy
No Result
View All Result
  • Home
  • Artificial Intelligence
  • Software Development
  • Machine Learning
  • Research & Academia
  • Startups
  • Home
  • Artificial Intelligence
  • Software Development
  • Machine Learning
  • Research & Academia
  • Startups
Soft Bliss Academy
No Result
View All Result
Home Artificial Intelligence

FACTS Grounding: A new benchmark for evaluating the factuality of large language models

softbliss by softbliss
April 8, 2025
in Artificial Intelligence
0
FACTS Grounding: A new benchmark for evaluating the factuality of large language models
0
SHARES
1
VIEWS
Share on FacebookShare on Twitter


Responsibility & Safety

Published
17 December 2024
Authors

FACTS team


Our comprehensive benchmark and online leaderboard offer a much-needed measure of how accurately LLMs ground their responses in provided source material and avoid hallucinations

Large language models (LLMs) are transforming how we access information, yet their grip on factual accuracy remains imperfect. They can “hallucinate” false information, particularly when given complex inputs. In turn, this can erode trust in LLMs and limit their applications in the real world.

Today, we’re introducing FACTS Grounding, a comprehensive benchmark for evaluating the ability of LLMs to generate responses that are not only factually accurate with respect to given inputs, but also sufficiently detailed to provide satisfactory answers to user queries.

We hope our benchmark will spur industry-wide progress on factuality and grounding. To track progress, we’re also launching the FACTS leaderboard on Kaggle. We’ve already tested leading LLMs using FACTS Grounding and have populated the initial leaderboard with their grounding scores. We will maintain and update the leaderboard as the field advances.

Current leaderboard ranking

FACTS Grounding dataset

To accurately evaluate the factuality and grounding of any given LLM, the FACTS Grounding dataset comprises 1,719 examples, each carefully crafted to require long-form responses grounded in the context document provided. Each example comprises a document, a system instruction requiring the LLM to exclusively reference the provided document, and an accompanying user request.

An example from the FACTS Grounding dataset

All examples are divided into a “public” set (860) and a “private” (859) held out set. We are releasing the public set today so anyone can use it to evaluate an LLM. Of course, we know that issues of benchmark contamination and leaderboard hacking are important to protect against, so following standard industry practice, we are keeping the private evaluation set held out. The FACTS leaderboard scores are the average performance across both public and private sets.

To ensure a diversity of inputs, the FACTS Grounding examples include documents with a variety of lengths, up to a maximum of 32,000 tokens (roughly 20,000 words), covering domains such as finance, technology, retail, medicine, and law. The user requests are similarly wide ranging, including requests for summarization, Q&A generation, and rewriting tasks. We did not include any examples that could require creativity, mathematics, or complex reasoning – capabilities which might require the model to apply more advanced reasoning in addition to grounding.

Collective judgement by leading LLMs

To succeed on a given example, an LLM must synthesize the complex information in the document and generate a long-form response that is both a comprehensive answer to the user request and fully attributable to that document.

FACTS Grounding evaluates model responses automatically using three frontier LLM judges — namely Gemini 1.5 Pro, GPT-4o, and Claude 3.5 Sonnet. We selected a combination of different judges to mitigate any potential bias of a judge giving higher scores to the responses produced by a member of its own model family. The automatic judge models were comprehensively evaluated against a held-out test set to find the best performing judging prompt templates and to verify agreement with human raters.

Each FACTS Grounding example is judged in two phases. First, responses are evaluated for eligibility, and disqualified if they don’t sufficiently address the user’s request. Second, responses are judged as factually accurate if they are fully grounded in information contained in the provided document, with no hallucinations.

With the eligibility and grounding accuracy of a given LLM response evaluated separately by multiple AI judge models, the results are then aggregated to determine if the LLM has dealt with the example successfully. The final score for the overall grounding task is the average of all judge models’ scores across all examples. Find more details of our FACTS Grounding evaluation methodology in our paper.

A factually correct response that fails to properly address the user’s request fails the benchmarking example. Here we see three instances of model responses that the automated LLM judges considered ineligible

FACTS Grounding will continue to evolve

We are mindful that benchmarks can be quickly overtaken by progress, so this launch of our FACTS Grounding benchmark and leaderboard is just the beginning. Factuality and grounding are among the key factors that will shape the future success and usefulness of LLMs and broader AI systems, and we aim to grow and iterate FACTS Grounding as the field progresses, continually raising the bar.

We encourage the AI community to engage with FACTS Grounding, evaluate their models on the open set of examples or to submit their models for evaluation. We believe that comprehensive benchmarking methods, coupled with continuous research and development will continue to improve AI systems.

Acknowledgements

FACTS is a collaboration between Google DeepMind and Google Research.
FACTS Grounding was led by: Alon Jacovi, Andrew Wang, Chris Alberti, Connie Tao, Dipanjan Das, Jon Lipovetz, Kate Olszewska, Lukas Haas, Michelle Liu, and Nate Keating.

We are also very grateful for contributions from: Adam Bloniarz, Carl Saroufim, Corey Fry, Dror Marcus, Doron Kukliansky, Gaurav Singh Tomar, James Swirhun, Jinwei Xing, Lily Wang, Madhu Gurumurthy, Michael Aaron, Moran Ambar, Rachana Fellinger, Rui Wang, Zizhao Zhang, and Sasha Goldshtein.

We would also like to thank Avinatan Hassidim, D. Sculley, Fernando Pereira, Koray Kavukcuoglu, Slav Petrov, Ya Xu, and Yossi Matias for their continued support.

Tags: BenchmarkEvaluatingFACTSfactualityGroundingLanguageLargeModels
Previous Post

AI Integration Shifts from “If” to “How” in Education

Next Post

How 5G Technology Changes Mobile App Development: Impact and Outcomes

softbliss

softbliss

Related Posts

Soham Mazumdar, Co-Founder & CEO of WisdomAI – Interview Series
Artificial Intelligence

Soham Mazumdar, Co-Founder & CEO of WisdomAI – Interview Series

by softbliss
June 6, 2025
Gemini 2.5’s native audio capabilities
Artificial Intelligence

Gemini 2.5’s native audio capabilities

by softbliss
June 5, 2025
AI stirs up the recipe for concrete in MIT study | MIT News
Artificial Intelligence

AI stirs up the recipe for concrete in MIT study | MIT News

by softbliss
June 5, 2025
Artificial Intelligence

Mistral AI Introduces Mistral Code: A Customizable AI Coding Assistant for Enterprise Workflows

by softbliss
June 4, 2025
Hierarchical Coordination in Multi-Agent Tasks
Artificial Intelligence

Hierarchical Coordination in Multi-Agent Tasks

by softbliss
June 4, 2025
Next Post
How 5G Technology Changes Mobile App Development: Impact and Outcomes

How 5G Technology Changes Mobile App Development: Impact and Outcomes

Premium Content

These Principal Foundation Money Myths Activities are Fun

These Principal Foundation Money Myths Activities are Fun

April 1, 2025
Florida A&M Hires DeSantis Ally as President

Florida A&M Hires DeSantis Ally as President

May 17, 2025

Gemini Robotics brings AI into the physical world

March 24, 2025

Browse by Category

  • Artificial Intelligence
  • Machine Learning
  • Research & Academia
  • Software Development
  • Startups

Browse by Tags

Amazon API App Artificial Blog Build Building Business Data Development Digital Framework Future Gemini Generative Google Guide Impact Innovation Intelligence Interview Key Language Large Learning LLM LLMs Machine Microsoft MIT model Models News NVIDIA opinion OReilly Research Science Series Startup Startups students Tech Tools Video

Soft Bliss Academy

Welcome to SoftBliss Academy, your go-to source for the latest news, insights, and resources on Artificial Intelligence (AI), Software Development, Machine Learning, Startups, and Research & Academia. We are passionate about exploring the ever-evolving world of technology and providing valuable content for developers, AI enthusiasts, entrepreneurs, and anyone interested in the future of innovation.

Categories

  • Artificial Intelligence
  • Machine Learning
  • Research & Academia
  • Software Development
  • Startups

Recent Posts

  • HealthKois launches $300M fund to drive scalable healthcare innovation in India
  • Developing an Inquiry- and Asset-Based Approach to Grammar
  • On-Demand Alcohol Delivery App Development: Complete Guide

© 2025 https://softblissacademy.online/- All Rights Reserved

No Result
View All Result
  • Home
  • Artificial Intelligence
  • Software Development
  • Machine Learning
  • Research & Academia
  • Startups

© 2025 https://softblissacademy.online/- All Rights Reserved

Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?