Hallucination Is Linearly Decodable from Mid-Layer Hidden States in Quantized LLMs
Summary
arXiv:2606.02628v1 Announce Type: new Abstract: We investigate whether open-source LLMs encode a linearly separable truthfulness signal in their hidden states, and at which network depth this signal is strongest. Across three $7$B--$8$B instruction-tuned models (Llama-3.1-8B, Mistral-7B, Qwen2.5-7B) loaded in $4$-bit NF4 quantization, we extract per-layer hidden states on four hallucination benchmarks (TruthfulQA, HaluEval-QA, FEVER, and a controlled synthetic set) and compare four detection approaches: linear and MLP probes, INSIDE EigenScore, self-consistency, and attention entropy. A linear probe on a single mid-network layer achieves $0.904$--$1.000$ AUROC on held-out splits, while sampling-based detectors do not exceed $0.541$ AUROC under the same protocol.
Why It Matters
This Industrial AI development deepens the link between AI compute and industrial productivity. For Asia, it is a signal worth tracking: it shapes who supplies, who scales, and who sets the standard over the next five years.
Key Facts
- SectorIndustrial AI
- Market—
- ImpactMedium (50/100)
- SignalFunding Research