- Updated: March 11, 2026
- 7 min read
HVR-Met: A Hypothesis-Verification-Replaning Agentic System for Extreme Weather Diagnosis
Direct Answer
The paper introduces HVR‑Met, a multi‑agent system that diagnoses extreme weather events through a closed‑loop “Hypothesis‑Verification‑Replanning” (HVR) process, tightly integrating expert meteorological knowledge into autonomous reasoning. This matters because it bridges the gap between high‑performing deep‑learning forecasts and the nuanced, step‑by‑step diagnostic reasoning required to understand and respond to rare, high‑impact weather anomalies.
Background: Why This Problem Is Hard
Extreme weather diagnosis is fundamentally different from routine forecasting. While modern neural networks excel at predicting temperature, precipitation, or wind speed a few days ahead, they struggle when a signal deviates sharply from historical patterns. The difficulty stems from three intertwined challenges:
- Multi‑step logical reasoning: Detecting an anomalous radar echo, correlating it with atmospheric instability indices, and then deciding whether it signals a tornado, flash flood, or a false alarm requires a chain of conditional inferences that current end‑to‑end models do not natively support.
- Dynamic tool invocation: Expert meteorologists routinely consult specialized tools—numerical weather prediction (NWP) ensembles, satellite‑image processors, and domain‑specific statistical indices. An AI system must learn when and how to call these heterogeneous resources, often in real time.
- Prior expert judgment: Decades of climatology research embed tacit knowledge (e.g., “a rapid pressure drop combined with high low‑level shear is a classic supercell precursor”). Translating that tacit expertise into a machine‑readable form is non‑trivial.
Existing approaches—pure deep‑learning pipelines, rule‑based expert systems, or hybrid ensembles—each hit a ceiling. Pure DL models lack interpretability and cannot guarantee the stepwise validation a meteorologist expects. Rule‑based systems are brittle and cannot scale to the high dimensionality of modern sensor data. Hybrid ensembles improve accuracy but still operate in a “predict‑then‑post‑process” fashion, missing the iterative verification loop that is essential for extreme‑event diagnostics.
What the Researchers Propose
HVR‑Met reframes extreme‑weather diagnosis as an autonomous, iterative reasoning task performed by a coalition of specialized agents. The core contribution is the Hypothesis‑Verification‑Replanning loop:
- Hypothesis Generation: A “Hypothesis Agent” ingests raw sensor streams (radar, satellite, surface observations) and proposes a set of plausible extreme‑weather scenarios (e.g., “potential derechos”, “incipient cyclogenesis”).
- Verification: A suite of “Verification Agents” each owns a domain‑specific tool (e.g., a high‑resolution NWP model, a convective‑allowing simulation, a statistical severe‑storm index). They evaluate the hypothesis against these tools, returning confidence scores and diagnostic evidence.
- Replanning: A “Planner Agent” aggregates the verification feedback, prunes low‑confidence hypotheses, and either finalizes a diagnosis or triggers a new hypothesis cycle with refined constraints.
Crucially, expert knowledge is embedded at three levels:
- Pre‑trained knowledge graphs that encode meteorological ontologies (e.g., relationships between shear, CAPE, and tornado genesis).
- Prompt templates curated by seasoned forecasters, guiding each agent’s language model to ask the right questions.
- Rule‑based sanity checks that enforce physical consistency (e.g., mass conservation across verification steps).
How It Works in Practice
The operational workflow can be visualized as a pipeline of interacting agents, each with a well‑defined API. Below is a conceptual step‑by‑step description:
- Data Ingestion: A Data Broker streams real‑time observations into a shared knowledge base.
- Hypothesis Agent Activation: Triggered by a detection of “anomalous signal” (e.g., a sudden radar reflectivity spike), the agent queries the knowledge base and produces a ranked list of candidate extreme‑weather events.
- Parallel Verification: For each candidate, a set of Verification Agents runs concurrently:
- Numerical Agent launches a short‑range ensemble forecast centered on the candidate’s spatiotemporal window.
- Satellite Agent extracts multispectral signatures to confirm convective development.
- Statistical Agent computes legacy indices (e.g., STP, VIL) to cross‑validate the hypothesis.
- Evidence Aggregation: The Planner Agent receives a structured report (confidence, evidence snippets, error margins) from each verifier.
- Decision Logic: If the aggregated confidence exceeds a calibrated threshold, the system emits a final diagnosis (e.g., “tornado‑watch likely within 30 km radius”). Otherwise, the Planner refines the hypothesis space (e.g., narrows the geographic focus) and restarts the loop.
- Human‑in‑the‑Loop Hand‑off: The final diagnosis, together with the evidence trail, is presented to forecasters via a dashboard, allowing them to accept, modify, or override the AI recommendation.
What sets HVR‑Met apart from prior multi‑agent designs is the explicit “replanning” stage that treats verification as a feedback signal, not a one‑off filter. This creates a dynamic, self‑correcting reasoning cycle that can adapt to evolving data streams—a necessity when dealing with rapidly intensifying storms.
Evaluation & Results
To validate the system, the authors built a novel benchmark that decomposes extreme‑weather diagnosis into atomic subtasks. Each subtask isolates a single reasoning step (e.g., “detect mesoscale convective vortex”, “verify low‑level shear threshold”). The benchmark includes:
- 30 real‑world case studies from the past five years, covering hurricanes, derechos, flash floods, and winter blizzards.
- Ground‑truth annotations from the National Weather Service (NWS) and peer‑reviewed post‑event analyses.
- Metrics that capture both accuracy (correct diagnosis rate) and reasoning fidelity (percentage of verification steps that align with expert expectations).
Key findings:
| Metric | HVR‑Met | Baseline Deep‑Learning Forecast | Rule‑Based Expert System |
|---|---|---|---|
| Correct Diagnosis Rate | 87 % | 62 % | 71 % |
| Reasoning Fidelity | 92 % | 45 % | 78 % |
| Average Time to Final Diagnosis | 4.2 min | 2.8 min | 5.6 min |
While pure deep‑learning models are faster, they miss critical verification steps, leading to lower fidelity. The rule‑based system achieves higher fidelity but suffers from longer latency and lower overall accuracy. HVR‑Met strikes a balance: it maintains near‑real‑time performance while delivering the highest diagnostic accuracy and a transparent evidence trail.
Why This Matters for AI Systems and Agents
For practitioners building autonomous agents in high‑stakes domains, HVR‑Met offers three actionable takeaways:
- Iterative Reasoning as a Design Primitive: Embedding a verification‑feedback loop transforms agents from “single‑shot predictors” into “self‑correcting diagnosticians.” This pattern can be transplanted to other domains such as medical triage, cybersecurity incident response, or financial risk assessment.
- Expert Knowledge Integration at Scale: The paper demonstrates a practical recipe for marrying large language models with curated domain ontologies and rule‑based sanity checks, a blueprint that avoids the “black‑box” pitfalls of pure DL.
- Fine‑Grained Evaluation Frameworks: The atomic‑level benchmark highlights the importance of measuring not just end‑task accuracy but also the quality of intermediate reasoning steps. Teams can adopt similar evaluation pipelines using our agent evaluation framework to surface hidden failure modes early in development.
In practice, these insights enable AI teams to construct more trustworthy, auditable, and adaptable agentic pipelines—qualities that are increasingly demanded by regulators and end‑users alike.
What Comes Next
Despite its promising results, HVR‑Met leaves several avenues open for future work:
- Scalability to Global Operations: Current experiments focus on regional case studies. Extending the system to ingest and reason over global satellite constellations will require distributed orchestration, a challenge that can be tackled with modern agent orchestration platforms.
- Learning to Generate Hypotheses: The Hypothesis Agent currently relies on prompt templates. End‑to‑end reinforcement learning could enable the agent to discover novel diagnostic pathways beyond human‑crafted prompts.
- Robustness to Sensor Failures: Extreme events often coincide with sensor outages. Incorporating uncertainty‑aware models and fallback strategies will improve resilience.
- Human‑AI Collaboration Interfaces: While the system provides an evidence trail, designing intuitive visualizations that let forecasters interactively query the reasoning chain remains an open UX problem.
Addressing these challenges will not only sharpen HVR‑Met’s operational readiness but also push the broader field of agentic AI toward truly autonomous scientific reasoning.
Call to Action
For a deeper dive into the methodology, benchmark design, and experimental details, read the full paper on arXiv. If your organization is exploring AI‑driven climate or weather solutions, our team at ubos.tech can help you prototype, evaluate, and deploy agentic systems built on the principles outlined here.