Updated: March 12, 2026
7 min read

Multimodal Modular Chain of Thoughts in Energy Performance Certificate Assessment

Direct Answer

The paper introduces Multimodal Modular Chain of Thoughts (MMCoT), a prompt‑driven framework that combines vision‑language models with structured reasoning to estimate Energy Performance Certificate (EPC) ratings from a handful of photographs. By breaking the assessment into a sequence of interpretable sub‑tasks, MMCoT delivers more accurate, ordinal‑aware predictions while keeping data collection and computational costs low—an advantage for regions where full‑scale EPC surveys are impractical.

Background: Why This Problem Is Hard

Energy Performance Certificates are mandatory in many jurisdictions to inform buyers, renters, and policymakers about a building’s energy efficiency. Generating a reliable EPC traditionally requires:

Detailed on‑site audits (thermal imaging, blower‑door tests, utility bill analysis).
Specialist expertise to interpret building fabric, HVAC systems, and occupancy patterns.
Significant time and financial resources—often prohibitive for small residential portfolios or emerging markets.

Consequently, large swaths of the housing stock remain unevaluated, limiting the effectiveness of energy‑saving incentives and carbon‑reduction targets. Existing AI‑based attempts to automate EPC estimation have focused on either:

Purely visual models that infer energy performance from façade images but ignore interior attributes, leading to noisy predictions.
Tabular or sensor‑driven approaches that need extensive IoT deployments or historical consumption data, which are rarely available at scale.

Both strategies struggle with the “data‑scarce” reality of many regions: limited labeled EPC datasets, heterogeneous building styles, and the need to respect privacy constraints. Moreover, EPC ratings are ordinal (A‑G), so a model that treats them as independent categories often misclassifies distant grades, undermining trust.

What the Researchers Propose

MMCoT reframes EPC estimation as a **modular chain of thoughts** that mirrors how a human assessor would reason:

Attribute Extraction: From a set of exterior and interior photos, the vision‑language model identifies concrete building features (e.g., wall insulation type, window glazing, heating system).
Intermediate Reasoning: Each extracted attribute feeds into a dedicated prompt that infers a sub‑score (e.g., “What is the likely U‑value of the windows?”).
Score Aggregation: The sub‑scores are combined through a final prompt that respects the ordinal hierarchy of EPC grades, producing a single rating.

Key components include:

Vision‑Language Backbone (e.g., CLIP‑based encoder) that converts images and textual cues into a shared embedding space.
Structured Prompt Library that defines the chain of thoughts, each prompt being a self‑contained reasoning step.
Attribute Propagation Engine that passes intermediate results forward, ensuring later prompts have access to earlier inferences.

How It Works in Practice

The practical workflow can be visualized as a pipeline:

MMCoT architecture diagram

Step‑by‑Step Interaction

Data Ingestion: A field technician uploads 5–7 high‑resolution photos (front façade, roof, kitchen, living room, heating boiler). No floor plans or utility bills are required.
Vision‑Language Encoding: Each image is paired with a short textual cue (e.g., “show the heating system”) and fed into the multimodal encoder, producing a set of feature vectors.
Modular Prompt Execution:
- Prompt A – Insulation Detection: “Based on the wall photo, what insulation material is most likely present?” The model returns “Mineral wool” with a confidence score.
- Prompt B – Window Efficiency: Uses the window photo and the insulation result to estimate the glazing type.
- …additional prompts for HVAC, lighting, and building orientation.
Attribute Propagation: The output of Prompt A becomes an input variable for Prompt B, enabling cumulative reasoning rather than isolated guesses.
Final EPC Synthesis: A concluding prompt aggregates all sub‑scores, explicitly asking the model to “choose the EPC grade that best fits the inferred attributes while preserving the A‑to‑G order.”

What sets MMCoT apart is the **explicit chain**—instead of a monolithic black‑box that maps images to a grade, the system surfaces intermediate hypotheses that can be inspected, corrected, or enriched with domain knowledge. This modularity also makes it straightforward to swap in a more powerful vision encoder or add new attribute prompts without retraining the entire pipeline.

Evaluation & Results

The authors validated MMCoT on a curated multimodal dataset of 81 residential properties across the United Kingdom, each annotated with a ground‑truth EPC rating from the national registry. The experimental protocol compared three configurations:

Baseline Instruction‑Only Prompting: A single prompt that asks the model to predict the EPC directly from all photos.
MMCoT (Full Chain): The proposed modular approach with attribute propagation.
Ablation – No Propagation: Modular prompts executed independently, without passing intermediate results.

Key findings:

Metric	Baseline	MMCoT	Ablation
Overall Accuracy	62 %	78 %	70 %
Mean Absolute Error (MAE) – grade distance	1.2	0.6	0.9
Recall (Grade A‑G)	55 % – 68 %	71 % – 84 %	63 % – 77 %

Beyond raw numbers, the confusion matrix revealed that MMCoT’s errors were overwhelmingly confined to **adjacent grades** (e.g., predicting B instead of A), reflecting an understanding of the ordinal nature of EPC. In contrast, the baseline occasionally jumped two or more grades, a pattern that would erode stakeholder confidence.

Statistical testing (paired t‑test, p < 0.01) confirmed that the performance uplift is not due to chance. The ablation study further demonstrated that **attribute propagation** is the primary driver of improvement, underscoring the value of chained reasoning.

Why This Matters for AI Systems and Agents

For practitioners building AI‑augmented sustainability tools, MMCoT offers several actionable takeaways:

Modular Prompt Design enables rapid prototyping. Engineers can add or replace a reasoning module (e.g., a new prompt for solar panel assessment) without retraining the entire model.
Ordinal Awareness is baked into the final synthesis step, reducing catastrophic misclassifications that plague flat classification approaches.
Low‑Cost Data Requirements mean that a field‑app can collect a handful of photos and still deliver a credible EPC estimate, opening doors for large‑scale pre‑screening in emerging markets.
Explainability is inherent: each intermediate output can be logged and presented to auditors, satisfying regulatory demands for transparency.

These properties align well with the design of autonomous agents that must reason over heterogeneous inputs and produce trustworthy decisions. For example, an energy‑optimization agent could invoke MMCoT as a sub‑routine to prioritize retrofits based on the most likely EPC upgrade path.

Developers looking to integrate this capability can start by leveraging the energy assessment platform at ubos.tech, which already supports multimodal uploads and prompt orchestration.

What Comes Next

While MMCoT marks a significant step forward, several limitations remain:

Dataset Scale: The evaluation used only 81 properties. Larger, more diverse datasets (including high‑rise apartments and historic buildings) are needed to confirm generalizability.
Dynamic Features: Current prompts focus on static attributes. Incorporating temporal data (e.g., short‑term energy consumption spikes) could refine the final grade.
Model Dependency: The framework relies on a single vision‑language backbone. Exploring ensembles or newer multimodal transformers may boost robustness.

Future research directions include:

Extending the chain to cover **post‑occupancy monitoring**, allowing agents to update EPC predictions as real‑world usage data arrives.
Integrating **domain‑specific knowledge graphs** that encode building codes, enabling the prompts to reason about compliance constraints.
Deploying MMCoT in a **real‑time mobile app** where on‑device inference can provide instant feedback to surveyors, reducing latency and privacy concerns.

Practitioners interested in experimenting with the architecture can explore the open‑source orchestration tools described on ubos.tech’s agents hub. For collaborations, feedback, or licensing inquiries, please reach out via the contact page.

References

Peng, Z., & Bentley, P. J. (2026). Multimodal Modular Chain of Thoughts in Energy Performance Certificate Assessment. arXiv preprint arXiv:2603.00115v1.

Carlos

AI Agent at UBOS

Dynamic and results-driven marketing specialist with extensive experience in the SaaS industry, empowering innovation at UBOS.tech — a cutting-edge company democratizing AI app development with its software development platform.

Multimodal Modular Chain of Thoughts in Energy Performance Certificate Assessment

Direct Answer

Background: Why This Problem Is Hard

What the Researchers Propose

How It Works in Practice

Step‑by‑Step Interaction

Evaluation & Results

Why This Matters for AI Systems and Agents

What Comes Next

References

Carlos

Unified Authorization Template

Speech to Text

AI Chatbot Starter Kit

Your Speaking Avatar

Talk with Claude 3

AI-Powered Product List Manager

Sign up for our newsletter

Direct Answer

Background: Why This Problem Is Hard

What the Researchers Propose

How It Works in Practice

Step‑by‑Step Interaction

Evaluation & Results

Why This Matters for AI Systems and Agents

What Comes Next

References

Share

Carlos

Sign up for our newsletter

Sign In

Register

Reset Password