Efficient Hallucination Detection: Adaptive Bayesian Estimation
of Semantic Entropy with Guided Semantic Exploration

Qiyao Sun¹, Xingming Li¹, Xixiang He¹, Ao Cheng¹, Xuanyu Ji¹
Hailun Lu², Runke Huang³, Qingyong Hu^†2

^† Corresponding author

¹ National University of Defense Technology ² Intelligent Game and Decision Lab, Beijing ³ The Chinese University of Hong Kong, Shenzhen

Paper (AAAI) arXiv Code BibTeX

Abstract

Large language models (LLMs) have achieved remarkable success in various natural language processing tasks, yet they remain prone to generating factually incorrect outputs—known as "hallucinations". While recent approaches have shown promise for hallucination detection by repeatedly sampling from LLMs and quantifying the semantic inconsistency among the generated responses, they rely on fixed sampling budgets that fail to adapt to query complexity, resulting in computational inefficiency.

We propose an Adaptive Bayesian Estimation framework for Semantic Entropy with Guided Semantic Exploration, which dynamically adjusts sampling requirements based on observed uncertainty. Our approach employs a hierarchical Bayesian framework to model the semantic distribution, enabling dynamic control of sampling iterations through variance-based thresholds that terminate generation once sufficient certainty is achieved. We also develop a perturbation-based importance sampling strategy to systematically explore the semantic space.

Extensive experiments on four QA datasets demonstrate that our method achieves superior hallucination detection performance with significant efficiency gains. In low-budget scenarios, our approach requires about 50% fewer samples to achieve comparable detection performance to existing methods, while delivers an average AUROC improvement of 12.6% under the same sampling budget.

Teaser: comparison of fixed sampling vs. adaptive Bayesian approach

Figure 1. Comparison of fixed sampling (a) versus our adaptive Bayesian approach (b). Fixed sampling wastes computational resources on simple queries while failing to discover semantic diversity in complex cases. Our method dynamically adjusts sampling based on variance thresholds, enabling efficient and accurate hallucination detection.

Algorithm Highlights

Key results on hallucination detection benchmarks

~0%

Fewer LLM Queries

Low-budget hallucination detection with same accuracy

AUROC Gain (N=2)

Average improvement over strongest baseline

23/24

State-of-the-Art

Best AUROC across all model × dataset settings

QA Benchmarks

CoQA · TriviaQA · TruthfulQA · SimpleQA

Method Overview

Adaptive Bayesian Estimation with Guided Semantic Exploration

Adaptive Sampling Loop

Input Query

LLM Sampling
Initial N₀ responses

Bayesian Update
Posterior on K, Var[h|D]

Var[h|D] < γ?

No →

Guided Exploration
Perturb critical tokens

Yes →

Semantic Entropy
Final estimate

Hierarchical Bayesian Framework

Models the semantic distribution via a Dirichlet prior over semantic categories. Marginalizes over the unknown number of meanings K to compute E[h|D] and Var[h|D] with tighter posterior bounds through generation probability constraints.

Variance-Based Adaptive Sampling

Dynamically terminates sampling once posterior variance falls below threshold γ. Simple queries converge quickly; complex queries receive additional exploration— achieving ≈50% sample reduction in low-budget settings.

Guided Semantic Exploration

Identifies semantically critical tokens via importance weights and systematically perturbs them to discover diverse interpretations. Importance sampling maintains unbiased estimates while accelerating variance convergence.

Performance vs. Sampling Budget

AUROC as a function of sampling budget N — Llama-3.1-8B

Our method (solid blue) consistently outperforms baselines at every sampling budget N=1–10. At N=2, we achieve an average 12.6% higher AUROC than the strongest baseline.

Benchmark Results

AUROC on four QA datasets across three LLMs (higher is better)

Dataset	P(True)	SAR	SE	SE-SDLG	Ours ↑

Bold values indicate state-of-the-art. Our method achieves the highest AUROC in 23 out of 24 experimental settings.