- Updated: June 25, 2026
- 7 min read
Generative Responsible AI Data Evaluation Schema (GRAIDES) for AI Assurance in Local Government
Direct Answer
The paper introduces GRAIDES (Generative Responsible AI Data Evaluation Schema), a lightweight, open‑source data model that centralises observability and evaluation data for generative AI systems used by local governments. By treating evaluations as a structured data‑modelling problem, GRAIDES makes benchmarking, tuning, and assurance activities reproducible across vendors, thereby turning fragmented metrics into a single, comparable evidence base.
Background: Why This Problem Is Hard
Local authorities are increasingly deploying generative AI for citizen services, document drafting, and decision support. Yet the trust required for public‑sector adoption hinges on measurable evidence of safety, fairness, and alignment with policy goals. In practice, three intertwined challenges prevent reliable assurance:
- Data fragmentation: Evaluation logs, human‑in‑the‑loop feedback, and performance metrics are stored in disparate systems—often proprietary dashboards from different AI vendors.
- Inconsistent schema: Each vendor defines its own fields (e.g., “prompt‑quality”, “toxicity score”), making cross‑vendor comparison a manual, error‑prone exercise.
- Lack of reproducibility: Without a common representation, reproducing a benchmark or auditing a model’s behaviour over time becomes nearly impossible, especially when policy auditors request evidence months after a deployment.
Existing governance frameworks (e.g., ISO/IEC 42001, NIST AI Risk Management) prescribe high‑level processes but stop short of providing a concrete data model that can be operationalised at the municipal level. Consequently, AI officers spend disproportionate effort stitching together spreadsheets, custom APIs, and ad‑hoc scripts—effort that scales poorly as the number of models and vendors grows.
What the Researchers Propose
GRAIDES is positioned as a schema‑first solution. Rather than building a monolithic platform, the authors define a set of interoperable JSON‑LD entities that capture every relevant evaluation artifact:
- Evaluation Instance: A timestamped record linking a specific model version, input prompt, and the resulting output.
- Human Annotation: Structured feedback from evaluators, including rating dimensions (e.g., relevance, bias, factuality) and optional free‑text comments.
- Vendor Metadata: Vendor‑specific identifiers, API version, and cost‑per‑token information, enabling cost‑effectiveness analysis.
- Statistical Summary: Pre‑computed aggregates (mean, variance, confidence intervals) that can be queried without re‑processing raw logs.
Key agents in the GRAIDES ecosystem are:
- Data Ingestor: A thin connector that normalises vendor‑specific logs into the GRAIDES schema.
- Evaluation Engine: A reusable service that runs automated tests (e.g., prompt‑robustness, hallucination detection) and writes results as Evaluation Instances.
- Governance Dashboard: A visual layer that queries the schema to surface compliance metrics, trend lines, and disagreement heatmaps.
By decoupling the schema from any particular vendor, GRAIDES can be adopted incrementally—organizations can start by mapping a single vendor’s logs and later expand to a multi‑vendor observability hub.
How It Works in Practice
The practical workflow follows a clear, MECE‑structured pipeline:
1. Ingestion & Normalisation
Each AI provider (OpenAI, Anthropic, Cohere, etc.) exports its evaluation logs via a REST endpoint or a cloud storage bucket. The Data Ingestor reads these raw artifacts, maps fields to the GRAIDES JSON‑LD model, and stores the result in a central document store (e.g., MongoDB or a graph database). Because the schema is versioned, backward compatibility is guaranteed.
2. Human‑in‑the‑Loop Annotation
Policy analysts or citizen‑service agents review model outputs through a lightweight UI. Their ratings are captured as Human Annotation objects, automatically linked to the originating Evaluation Instance. The UI also flags systematic disagreement when multiple annotators diverge beyond a configurable threshold.
3. Automated Metric Computation
The Evaluation Engine periodically runs batch jobs that compute statistical summaries, detect outliers, and apply domain‑specific safety checks (e.g., GDPR‑sensitive data leakage). Results are persisted as Statistical Summary entities, enabling instant dashboard refreshes.
4. Governance & Reporting
Stakeholders query the schema using GraphQL or a simple REST filter. The Governance Dashboard visualises:
- Model‑level compliance scores over time.
- Cost‑per‑evaluation trends across vendors.
- Heatmaps of annotator disagreement, highlighting prompts that consistently trigger bias or factual errors.
Because every datum is traceable to its source (prompt, model version, annotator), auditors can reproduce any reported metric with a single API call.
What Sets GRAIDES Apart
- Vendor‑agnostic data model: No lock‑in; the same schema works for OpenAI, Anthropic, or emerging local‑LLM providers.
- Lightweight implementation: The reference codebase is under 2 KB of JSON‑LD definitions and a handful of Python utilities, making it easy to embed in existing CI/CD pipelines.
- Built‑in disagreement detection: Systematic human‑model misalignment is surfaced automatically, a feature rarely found in generic observability tools.
Architecture Diagram (placeholder)

Evaluation & Results
The authors validated GRAIDES with a real‑world deployment at Westminster City Council, where the council maintains an AI catalogue of 12 generative models spanning three vendors. The evaluation focused on two core questions:
- Can GRAIDES reliably surface systematic disagreement between human evaluators and model outputs?
- Does a unified schema improve the speed and accuracy of compliance reporting?
Scenario 1 – Human‑Model Alignment
Over a four‑week pilot, 150 policy analysts reviewed 3,000 model responses to citizen‑service prompts (e.g., “How do I apply for a parking permit?”). Using GRAIDES, the team identified 27% of prompts where annotator scores diverged by more than one standard deviation. Further investigation revealed two root causes: (a) ambiguous prompt phrasing, and (b) model‑specific hallucination patterns. By feeding these insights back into prompt engineering guidelines, the council reduced disagreement to 12% in the subsequent iteration.
Scenario 2 – Reporting Efficiency
Prior to GRAIDES, the council’s compliance officer spent an average of 12 hours per month manually consolidating CSV exports from each vendor. After integrating the schema, the same officer generated a full compliance report with a single GraphQL query in under five minutes—a 96% reduction in manual effort. Moreover, the audit trail provided by GRAIDES satisfied the council’s external regulator, who praised the “single source of truth” for AI performance data.
These findings demonstrate that GRAIDES not only surfaces hidden alignment issues but also delivers tangible productivity gains for AI governance teams.
Why This Matters for AI Systems and Agents
For AI practitioners building agents that interact with citizens, the ability to prove that a model behaves responsibly is no longer optional—it is a regulatory prerequisite. GRAIDES offers a concrete pathway to embed responsible‑AI checks directly into the development lifecycle:
- Continuous Evaluation: Agents can trigger the Evaluation Engine after each deployment, automatically logging results in the schema.
- Cost‑aware Orchestration: By attaching vendor cost metadata, orchestration platforms can route high‑risk queries to cheaper, more interpretable models while reserving expensive, high‑capacity models for complex tasks.
- Feedback Loops: Disagreement heatmaps feed directly into prompt‑refinement pipelines, enabling agents to self‑improve without human re‑annotation.
- Audit‑ready Artifacts: When a city council faces a public inquiry, the Governance Dashboard can export a complete, timestamped evidence package, reducing legal exposure.
In short, GRAIDES transforms abstract responsible‑AI principles into actionable data that can be consumed by UBOS platform overview, integrated with existing workflow automation tools, and leveraged by AI marketing agents to ensure compliance before content goes live.
What Comes Next
While the initial study proves the concept, several open challenges remain:
- Scalability to national‑level deployments: As the number of evaluated prompts grows into the millions, indexing strategies and distributed storage become critical.
- Standardisation across jurisdictions: Different municipalities may adopt varying rating scales; a federated schema extension mechanism is needed.
- Integration with emerging LLM governance frameworks: Aligning GRAIDES with upcoming ISO standards will require collaborative extensions.
Future research directions include:
- Embedding automated bias detection models directly into the Evaluation Engine, turning the schema into a proactive guardrail.
- Developing a plug‑and‑play OpenAI ChatGPT integration that streams evaluation data in real time to the schema.
- Exploring federated learning scenarios where multiple councils share anonymised evaluation statistics without exposing raw citizen data.
Organizations interested in piloting GRAIDES can start by reviewing the original GRAIDES paper, cloning the open‑source reference implementation, and connecting their existing AI vendor dashboards via the ChatGPT and Telegram integration for rapid feedback collection.
Conclusion
GRAIDES reframes generative AI assurance as a data‑modelling challenge, delivering a vendor‑agnostic, reproducible, and audit‑ready framework for local governments. By centralising evaluation artifacts, it uncovers hidden misalignments, slashes reporting overhead, and equips AI agents with the evidence needed to operate responsibly in the public sector. As municipalities worldwide grapple with the ethical and regulatory implications of AI, adopting a schema‑first approach like GRAIDES could become the cornerstone of trustworthy, citizen‑centric AI services.