- Updated: June 13, 2026
- 6 min read
Explaining is Harder Than Predicting Alone: Evaluating Concept‑based Explanations of MLLMs as ICL Visual Classifiers
Direct Answer
The paper introduces a systematic evaluation of concept‑based explanations generated by frozen multimodal large language models (MLLMs) during few‑shot in‑context learning (ICL). It shows that requiring models to produce formal, machine‑verifiable explanations actually lowers classification accuracy, revealing a gap between predictive performance and explainability.
Background: Why This Problem Is Hard
Multimodal large language models have become the de‑facto standard for tasks that combine vision and language, such as image classification from textual prompts. In‑context learning lets these models adapt to new categories with only a handful of labeled examples, eliminating the need for costly fine‑tuning. However, the internal reasoning that leads from the visual input to the textual label remains a black box.
Existing explainability techniques for pure language models—like chain‑of‑thought prompting—rely on the model verbalizing its reasoning steps. When the same idea is transferred to MLLMs, two major obstacles appear:
- Opaque visual grounding: The model must align pixel‑level features with abstract concepts, a process that is not directly observable.
- Mismatch between natural language and formal logic: Researchers have tried to force models to output Description Logics (DL) axioms or other structured representations, assuming that a more “formal” explanation will improve trust and debugging. In practice, the models were never instruction‑tuned for such outputs, leading to degraded performance.
Consequently, practitioners lack reliable methods to verify whether an MLLM’s prediction is based on the right visual cues, limiting the deployment of these systems in safety‑critical domains.
What the Researchers Propose
The authors design a five‑tier evaluation framework that progressively tightens the requirements on the explanations produced by an MLLM:
- Baseline classification: The model simply returns a label for each image.
- Free‑form textual description: The model describes visual features it deems relevant.
- Concept extraction: The model lists high‑level concepts (e.g., “striped pattern”, “metallic surface”).
- Structured concept mapping: The model maps concepts to predefined ontology nodes.
- Formal DL axiom generation: The model produces logical statements that can be parsed by a reasoner.
Each tier adds a layer of formal rigor, moving from loosely structured language toward machine‑verifiable logic. The framework treats the explanation as a first‑class output that can be judged independently of the label.
How It Works in Practice
The experimental pipeline consists of three interacting components:
- Frozen MLLM (the “student”): A pre‑trained multimodal model such as GPT‑4‑V or LLaVA that receives a prompt containing a few labeled examples and a new image.
- Prompt templates: Carefully crafted instructions that ask the model to produce outputs at each of the five tiers. For the formal tier, the template includes a mini‑ontology and asks for DL axioms.
- LLM‑as‑judge evaluator: An independent large language model (e.g., Claude) that receives the model’s prediction and explanation, then scores explanation quality, relevance, and logical consistency.
The workflow proceeds as follows:
- The user supplies k labeled image‑text pairs (few‑shot context).
- The frozen MLLM processes a new query image and, depending on the tier, returns a label, a description, a concept list, a mapping, or a DL axiom.
- The LLM‑as‑judge compares the output against a gold‑standard reference (derived from human annotations) and assigns a quality score.
This design isolates the effect of explanation constraints while keeping the underlying visual encoder unchanged.

Evaluation & Results
The authors evaluated four state‑of‑the‑art MLLMs across two benchmark datasets covering everyday objects and fine‑grained categories. They measured two axes:
- Predictive accuracy: The proportion of correct class labels.
- Explanation quality: The LLM‑as‑judge score, ranging from 0 (nonsensical) to 1 (perfectly aligned with the gold explanation).
Key observations include:
- Baseline classification achieved the highest accuracy (≈93.8%).
- Introducing free‑form descriptions caused a modest drop (≈92.5%).
- Requiring structured concept mapping lowered accuracy further (≈91.2%).
- Generating formal DL axioms resulted in the lowest accuracy (≈90.1%).
- When the model succeeded in producing high‑quality, class‑discriminative visual features, the explanation score strongly correlated (Pearson ≈ 0.78) with correct predictions.
These results demonstrate a monotonic trade‑off: the more formal the explanation requirement, the more the model’s predictive performance suffers. Crucially, the degradation is not due to a lack of visual understanding but to the absence of instruction‑tuning for formal reasoning.
Why This Matters for AI Systems and Agents
For practitioners building AI agents that must justify decisions—such as autonomous inspection drones, medical image triage tools, or compliance‑focused chatbots—the findings raise two practical concerns:
- Explainability cannot be bolted on after the fact: Expecting a frozen MLLM to generate formal, verifiable explanations without dedicated fine‑tuning leads to poorer predictions, which may be unacceptable in high‑stakes environments.
- Instruction‑tuning is a missing piece: To obtain both high accuracy and trustworthy explanations, developers need to train or fine‑tune models on datasets that pair images with structured concept annotations and logical statements.
These insights directly influence the design of AI marketing agents that must explain why a particular creative asset was selected, as well as the broader UBOS platform overview, where explainable multimodal components can be orchestrated alongside other AI services.
What Comes Next
While the study clarifies the current limits of frozen MLLMs, several avenues remain open:
- Instruction‑tuning pipelines: Curating large‑scale datasets that align visual concepts with formal ontologies could enable models to learn the mapping end‑to‑end.
- Hybrid architectures: Combining a frozen visual encoder with a lightweight, trainable reasoning module may preserve classification strength while providing structured explanations.
- Human‑in‑the‑loop evaluation: Extending the LLM‑as‑judge framework with real user studies would validate whether the observed correlation between explanation quality and correctness translates to user trust.
- Cross‑modal reasoning benchmarks: Developing standardized tasks that require both accurate classification and logical justification could drive community adoption of explainable ICL.
Addressing these challenges will be essential for deploying multimodal agents that are not only performant but also auditable and transparent.
For a deeper dive into the methodology and raw data, consult the original arXiv paper. Readers interested in building explainable AI workflows can explore UBOS’s resources and integration guides.
