- Updated: January 31, 2026
- 6 min read
FFE-Hallu: Hallucinations in Fixed Figurative Expressions: Benchmark of Idioms and Proverbs in the Persian Language

Direct Answer
The paper introduces FFE‑Hallu, a benchmark specifically designed to surface and measure hallucinations that occur when large language models (LLMs) handle fixed figurative expressions such as idioms and proverbs. By focusing on Persian idiomatic language, the authors reveal systematic failure modes that current multilingual models struggle with, highlighting a gap that directly impacts real‑world applications ranging from translation services to conversational agents.
Background: Why This Problem Is Hard
Figurative language—idioms, proverbs, and set phrases—encodes cultural knowledge that rarely follows compositional semantics. For example, the Persian idiom “آب در کوزه و ما تشنه میمانیم” (literally “water in the jug while we remain thirsty”) conveys a sense of unfulfilled promises, not a literal description of water. When LLMs generate text, they rely on statistical patterns learned from massive corpora. This approach works well for literal sentences but often collapses for fixed expressions that require:
- Cultural grounding: Understanding the historical or societal context that gives an idiom its meaning.
- Non‑compositional semantics: Recognizing that the whole phrase’s meaning cannot be derived from its parts.
- Cross‑lingual alignment: Mapping an idiom in one language to an appropriate equivalent in another, rather than a word‑by‑word translation.
Existing evaluation suites—such as GLUE, SuperGLUE, or even multilingual benchmarks like XGLUE—focus on syntactic or factual correctness but rarely stress these cultural‑semantic nuances. Consequently, models can appear high‑performing while silently hallucinating or mis‑rendering idioms, leading to mistranslations, confusing chatbot replies, and degraded user trust.
What the Researchers Propose
The authors propose a three‑pronged framework:
- Dataset Construction: A curated collection of Persian fixed figurative expressions (idioms, proverbs, sayings) paired with three task formats—generation, detection, and translation.
- Hallucination Scoring: A set of metrics that quantify how often a model produces a literal, unrelated, or semantically distorted output when prompted with a figurative expression.
- Benchmark Suite (FFE‑Hallu): An open‑source evaluation harness that runs the three tasks across any multilingual LLM, reporting both raw performance and hallucination propensity.
Key components include:
- Expression Bank: Over 2,000 Persian idioms annotated with literal meanings, cultural notes, and recommended target‑language equivalents.
- Prompt Templates: Carefully designed prompts that elicit generation, ask the model to flag whether a sentence contains a figurative expression, or request a translation.
- Evaluation Scripts: Automated pipelines that compare model outputs against gold references and compute hallucination rates using lexical overlap, semantic similarity, and human‑in‑the‑loop verification.
How It Works in Practice
When an LLM is evaluated with FFE‑Hallu, the workflow proceeds as follows:
- Input Selection: The harness selects an idiom from the Expression Bank, e.g., “دست به کار شدن” (to get to work).
- Prompt Generation: Depending on the task, the system builds a prompt. For generation, it might ask, “Write a short story that includes the idiom ‘دست به کار شدن’.” For detection, it presents a sentence and asks, “Does this sentence contain a Persian idiom?” For translation, it asks, “Translate the idiom ‘دست به کار شدن’ into English preserving its figurative meaning.”
- Model Invocation: The LLM processes the prompt and returns a response.
- Automatic Scoring: The response is first checked for surface form correctness (e.g., does it contain the exact idiom?). Then a semantic similarity model (often a multilingual sentence‑encoder) assesses whether the intended figurative meaning is retained. Finally, a rule‑based hallucination detector flags literal or unrelated content.
- Human Validation (Optional): For borderline cases, human annotators verify whether the model’s output truly respects the idiom’s meaning.
- Report Generation: The suite aggregates scores across the three tasks, producing a concise dashboard that highlights both competence and hallucination risk.
This pipeline differs from prior benchmarks by explicitly separating “correctness” from “figurative fidelity.” A model could achieve a high BLEU score on a translation task yet still hallucinate the idiom’s meaning; FFE‑Hallu surfaces that discrepancy.
Evaluation & Results
The authors evaluated five prominent multilingual LLMs, including:
- GPT‑3.5‑Turbo (OpenAI)
- LLaMA‑2‑13B (Meta)
- Mistral‑7B (Mistral AI)
- Google‑Gemini‑Pro (Google)
- Claude‑2 (Anthropic)
Each model was tested on 500 idioms across the three tasks. The key findings were:
| Model | Generation Accuracy | Detection F1 | Translation Fidelity | Hallucination Rate |
|---|---|---|---|---|
| GPT‑3.5‑Turbo | 78 % | 84 % | 71 % | 22 % |
| LLaMA‑2‑13B | 65 % | 70 % | 58 % | 34 % |
| Mistral‑7B | 62 % | 68 % | 55 % | 38 % |
| Google‑Gemini‑Pro | 81 % | 86 % | 74 % | 18 % |
| Claude‑2 | 79 % | 85 % | 73 % | 20 % |
Even the top‑performing models (Gemini‑Pro, Claude‑2) hallucinated roughly one‑fifth of the time on at least one task. The most common error pattern was “literalization”: the model reproduced the idiom’s words but interpreted them literally, e.g., translating “دست به کار شدن” as “hand to work” instead of “to get to work.” Detection performance lagged behind generation, indicating that models often fail to recognize when they have produced a hallucinated idiom.
Human annotators confirmed that automatic hallucination scores aligned with perceived quality in >90 % of sampled cases, validating the benchmark’s reliability.
Why This Matters for AI Systems and Agents
Figurative language is pervasive in user‑generated content, marketing copy, and conversational interfaces. When an AI assistant misinterprets or fabricates idioms, the user experience degrades in subtle but consequential ways:
- Customer Support Bots: A bot that translates a Persian proverb literally could provide misleading advice, eroding trust.
- Cross‑Cultural Content Generation: Marketing teams relying on LLMs for localized copy may inadvertently produce culturally inappropriate slogans.
- Knowledge Retrieval Systems: Search engines that index hallucinated idioms can propagate misinformation across the web.
By exposing these failure modes, FFE‑Hallu equips engineers with a diagnostic tool to:
- Benchmark new multilingual models before deployment.
- Identify specific idioms that trigger hallucinations and apply targeted fine‑tuning.
- Integrate detection modules that flag potentially hallucinated outputs for human review.
For organizations building AI‑driven agents, incorporating FFE‑Hallu into the evaluation pipeline can reduce the risk of cultural missteps and improve overall robustness. Learn more about building reliable AI pipelines at UBOS AI Pipelines.
What Comes Next
While FFE‑Hallu makes a strong case for systematic evaluation, several limitations remain:
- Language Scope: The current release focuses on Persian; extending to other low‑resource languages will test the benchmark’s generality.
- Figurative Variety: Idioms are just one class; metaphors, sarcasm, and cultural jokes present additional challenges.
- Model Adaptation: Simple fine‑tuning on the Expression Bank improves scores modestly, but deeper architectural changes (e.g., incorporating cultural knowledge graphs) may be required for substantial gains.
Future research directions include:
- Developing cross‑cultural grounding modules that inject region‑specific knowledge during inference.
- Exploring self‑detecting hallucination mechanisms that allow a model to self‑audit its output before responding.
- Creating a multilingual figurative benchmark suite that unifies idioms, proverbs, and metaphor datasets across dozens of languages.
Practitioners interested in contributing to the next iteration of the benchmark or integrating it with their own evaluation pipelines can find resources and community forums at UBOS Community Hub.
Reference
For the full technical details, see the original arXiv preprint: FFE‑Hallu: Hallucinations in Fixed Figurative Expressions.