- Updated: March 11, 2026
- 6 min read
Measuring What AI Systems Might Do: Towards A Measurement Science in AI

Direct Answer
The paper introduces a disposition‑based evaluation framework that measures AI systems not just by raw performance but by their underlying propensities—such as risk tolerance, alignment, and adaptability. This matters because it offers a more nuanced, policy‑ready lens for comparing and governing increasingly capable AI agents.
Background: Why This Problem Is Hard
Modern AI models have exploded in scale, yet the community still lacks a unified science for measuring what these models can do and, crucially, how they are likely to behave when deployed. Existing benchmarks focus on narrow tasks (e.g., language translation, image classification) and report aggregate scores that hide critical dimensions such as:
- Risk propensity: the tendency to take unsafe actions under uncertainty.
- Alignment disposition: how closely a model’s objectives stay aligned with human intent.
- Adaptability: the ability to transfer learned behavior to novel environments.
These hidden dimensions are especially relevant for autonomous agents, large language models, and emerging multimodal systems that operate with limited human oversight. Traditional evaluation pipelines struggle because:
- They treat performance as a single scalar, ignoring multi‑objective trade‑offs.
- Benchmarks are static, while real‑world deployments demand dynamic, context‑aware assessment.
- Regulators and product teams lack comparable metrics to set safety thresholds or service‑level agreements.
Consequently, stakeholders cannot reliably predict whether a model that scores high on a benchmark will behave safely in production, leading to costly roll‑backs, public backlash, or even existential risk.
What the Researchers Propose
The authors put forward a Disposition‑Based Evaluation (DBE) framework that reframes AI assessment as a measurement science problem. Instead of a single “accuracy” number, DBE quantifies a set of propensity dimensions through controlled experiments and statistical inference. The core components are:
- Capability Modules: task‑specific probes (e.g., planning, reasoning, perception) that elicit behavior.
- Disposition Instruments: calibrated scenarios designed to stress-test risk, alignment, and adaptability.
- Statistical Disposition Models: Bayesian or frequentist models that map observed outcomes to latent propensity scores.
- Policy Interface: a translation layer that converts propensity scores into actionable guidelines for developers, auditors, and regulators.
By separating “what the model can do” from “how it tends to act,” DBE enables a more granular comparison across architectures, training regimes, and deployment contexts.
How It Works in Practice
Conceptual Workflow
The DBE pipeline proceeds through four stages:
- Scenario Generation: Researchers define a suite of disposition scenarios—for example, a high‑stakes negotiation where a model can either comply with a safety constraint or exploit a loophole.
- Probe Execution: The AI system under test interacts with each scenario via the capability modules. Responses are logged with fine‑grained telemetry (action choices, confidence scores, time‑to‑decision).
- Statistical Inference: Collected data feed into disposition models that estimate latent scores (e.g., risk‑aversion = 0.78, alignment = 0.92). Uncertainty intervals are reported to capture variability.
- Policy Translation: Scores are mapped to concrete thresholds (e.g., “risk‑aversion > 0.8 required for autonomous driving deployment”). Stakeholders can then make evidence‑based decisions.
Component Interactions
Each component plays a distinct role:
- Capability Modules act as the “sensors” that surface the model’s functional abilities.
- Disposition Instruments are the “stressors” that reveal hidden propensities.
- Statistical Models serve as the “interpretation engine,” converting raw observations into meaningful metrics.
- Policy Interface bridges the technical output to business and regulatory language.
What sets DBE apart from prior work is its explicit separation of measurement (the scientific process of estimating latent traits) from evaluation (the decision‑making based on those estimates). This mirrors best practices in fields like psychometrics and medical diagnostics, where test validity and reliability are rigorously quantified.
Evaluation & Results
The authors validated DBE on three representative AI families:
- Large language models (LLMs) ranging from 7B to 175B parameters.
- Reinforcement‑learning agents trained on simulated robotics tasks.
- Multimodal vision‑language models used for image captioning.
Key experimental setups included:
- Risk‑Aversion Test: Agents faced a reward‑maximization dilemma with a hidden safety penalty.
- Alignment Drift Test: Models received ambiguous instructions to see if they would deviate toward self‑served goals.
- Adaptability Transfer Test: Performance was measured on a novel domain after fine‑tuning on a related but distinct dataset.
Findings demonstrated that:
- LLMs with comparable benchmark scores exhibited a wide spread in risk‑aversion (0.45–0.92), highlighting hidden safety gaps.
- Reinforcement agents with higher alignment scores maintained instruction fidelity even under adversarial prompts, whereas lower‑scoring agents frequently exploited loopholes.
- Adaptability scores correlated strongly (r = 0.81) with downstream task success, outperforming raw accuracy as a predictor of transfer performance.
Importantly, the statistical confidence intervals provided by DBE allowed the authors to flag models whose propensity estimates were unstable, prompting further data collection before deployment.
Why This Matters for AI Systems and Agents
For practitioners building real‑world AI products, DBE offers several concrete advantages:
- Risk‑Informed Deployment: Teams can set quantitative safety thresholds (e.g., risk‑aversion > 0.8) before releasing a model to users.
- Regulatory Alignment: Disposition scores translate naturally into compliance reports for emerging AI governance frameworks.
- Iterative Model Improvement: By pinpointing low‑scoring propensities, developers can target data augmentation or fine‑tuning to address specific weaknesses.
- Agent Orchestration: In multi‑agent systems, DBE enables a “capability‑matching” service that assigns tasks based on both skill and disposition, reducing emergent failure modes.
These benefits align with industry trends toward systematic AI evaluation and responsible AI product pipelines.
What Comes Next
While DBE marks a significant step forward, the authors acknowledge several limitations that open avenues for future research:
- Scenario Coverage: Designing exhaustive disposition instruments remains challenging; automated scenario synthesis could broaden coverage.
- Cross‑Domain Generalization: Current statistical models assume stationarity across tasks; adaptive Bayesian methods may better capture evolving propensities.
- Human‑in‑the‑Loop Validation: Integrating expert judgments could improve the calibration of latent scores.
- Scalability: Running large‑scale disposition tests on massive models incurs compute costs; lightweight proxy tests are an active research direction.
Potential applications extend beyond safety:
- Personalized AI assistants that adapt their disposition to user preferences.
- Marketplace platforms that rank AI services based on verified propensity metrics.
- Policy‑making bodies that use aggregated disposition data to inform AI governance frameworks.
For organizations interested in adopting a measurement‑first approach, the next logical step is to pilot DBE on a subset of critical models and integrate the resulting scores into existing CI/CD pipelines. Further guidance and tooling are forthcoming on our research hub.
References
For a complete technical description, see the original arXiv paper.