- Updated: June 26, 2026
- 6 min read
AgentCAT: Simulating Computerized Adaptive Testing via Multi-Agent Large Language Models

Direct Answer
AgentCAT introduces a multi‑agent simulation framework built on large language models (LLMs) that faithfully reproduces the dynamics of Computerized Adaptive Testing (CAT). By modeling examinees, item selectors, and a supervisory overseer as interacting agents, the system enables researchers to evaluate adaptive testing strategies without relying on static offline logs.
This matters because it removes a long‑standing bottleneck—lack of high‑fidelity, end‑to‑end test environments—allowing rapid prototyping of CAT algorithms that can better personalize learning experiences at scale.
Background: Why This Problem Is Hard
Computerized Adaptive Testing aims to estimate a learner’s ability by presenting items whose difficulty matches the current proficiency estimate. In theory, CAT reduces test length while improving measurement precision. In practice, however, the research community faces three intertwined challenges:
- Static data dependency: Most studies rely on pre‑collected response logs that capture only a narrow slice of the adaptive process. These logs lack the full decision loop—item selection, response generation, and ability update—making it impossible to test novel selection heuristics.
- Partial labeling: Offline datasets often contain only the final score or a limited set of item‑response pairs, forcing researchers to treat adaptive testing as a static sequence‑prediction problem rather than a dynamic interaction.
- Component isolation: Existing work typically optimizes a single CAT component (e.g., item selection) while assuming perfect ability estimation and response modeling. This siloed approach ignores the feedback loops that drive real‑world testing.
These constraints hinder the development of truly adaptive, pedagogically sound testing systems, especially as educational platforms move toward AI‑driven personalization.
What the Researchers Propose
AgentCAT addresses the above gaps by constructing a high‑fidelity, LLM‑powered simulation environment that mirrors the full CAT workflow. The framework consists of three cooperating agents:
- Examinee Agent: Encodes a synthetic learner’s knowledge profile, retrieves relevant concepts from a memory store, and generates responses using Chain‑of‑Thought (CoT) reasoning. This agent mimics human cognitive processes, including misconceptions and partial knowledge.
- Selection Agent: Implements a coarse‑to‑fine bucketing strategy combined with knowledge‑graph exploration. It balances local difficulty (matching the current ability estimate) with global coverage (ensuring the test spans the curriculum).
- Supervisor Agent: Performs dual‑auditing of both examinee and selector actions, applying robust ability‑update rules to guarantee convergence and statistical validity.
By treating each component as an autonomous LLM‑driven entity, AgentCAT enables end‑to‑end experimentation where the impact of a new selection heuristic can be observed under realistic response behavior and ability updating.
How It Works in Practice
Conceptual Workflow
The simulation proceeds in iterative rounds, each comprising four steps:
- Ability Estimation: The Supervisor maintains a Bayesian estimate of the examinee’s latent ability based on prior responses.
- Item Selection: The Selection Agent queries a knowledge graph, filters items into difficulty buckets, and proposes a candidate set that aligns with the current ability estimate while preserving curriculum breadth.
- Response Generation: The Examinee Agent retrieves relevant concepts, constructs a CoT chain, and produces a binary (correct/incorrect) answer along with an explanatory rationale.
- Update & Audit: The Supervisor validates the response, updates the ability posterior, and logs the interaction for downstream analysis.
Interaction Dynamics
What distinguishes AgentCAT from prior simulators is the bidirectional communication between agents:
- The Examinee Agent can request clarification or additional context, prompting the Selection Agent to adjust item difficulty on the fly.
- The Supervisor’s dual‑auditing mechanism cross‑checks the Selection Agent’s difficulty estimates against the Examinee’s performance, preventing drift and ensuring statistical soundness.
This feedback loop mirrors real classroom testing, where teachers (selectors) adapt questions based on student answers, and students (examinees) demonstrate evolving mastery.
Evaluation & Results
Test Scenarios
Researchers validated AgentCAT on two publicly available educational datasets covering mathematics and language comprehension. They examined three evaluation dimensions:
- Macro‑level ability convergence: How quickly and accurately the system’s ability estimate stabilizes to the ground‑truth proficiency.
- Micro‑level interaction logic: Whether the sequence of items respects pedagogical principles such as scaffolding and concept continuity.
- Data sparsity resilience: Performance when the underlying item pool is limited or heavily imbalanced.
Key Findings
Across both datasets, AgentCAT demonstrated:
- Ability estimates that converged within 10–12 items, matching or surpassing traditional IRT‑based CAT baselines.
- Selection patterns that naturally progressed from foundational to advanced concepts, aligning with expert instructional design.
- Robustness to sparse item pools, maintaining estimation accuracy even when only 30% of the curriculum was available for selection.
These results indicate that the multi‑agent simulation not only replicates human‑like testing dynamics but also provides a reliable sandbox for testing new adaptive algorithms.
For a deeper dive into the methodology and quantitative metrics, see the AgentCAT paper on arXiv.
Why This Matters for AI Systems and Agents
AgentCAT’s contribution extends beyond academic curiosity; it offers concrete value to practitioners building AI‑driven education platforms:
- Rapid prototyping: Developers can iterate on selection heuristics, ability‑update formulas, or even new item types without waiting for live user data.
- Safety testing: Simulated examinees expose edge cases—such as persistent misconceptions—that might cause real learners to receive inappropriate items.
- Orchestration insights: The multi‑agent architecture showcases how LLMs can be coordinated via supervisory oversight, a pattern applicable to broader AI workflow automation.
- Integration pathways: AgentCAT can be embedded into existing AI platforms, for example by leveraging the UBOS platform overview to manage agent lifecycles and data pipelines.
In practice, an edtech startup could use AgentCAT to benchmark a new adaptive quiz engine before a public beta, reducing risk and accelerating time‑to‑market.
What Comes Next
While AgentCAT marks a significant step forward, several avenues remain open for exploration:
- Human‑in‑the‑loop validation: Aligning simulated examinee behavior with real student data to fine‑tune the CoT reasoning module.
- Cross‑domain generalization: Extending the framework to non‑academic assessments such as professional certification or health literacy.
- Scalable deployment: Integrating with cloud‑native orchestration tools—potentially via the Workflow automation studio—to run large‑scale simulations for millions of virtual learners.
- Personalized feedback generation: Enriching the Examinee Agent’s explanations to serve as immediate tutoring hints, a feature that could be paired with the ElevenLabs AI voice integration for spoken feedback.
Addressing these challenges will bring us closer to truly autonomous, AI‑powered assessment ecosystems that adapt in real time to each learner’s needs.
Conclusion
AgentCAT redefines how researchers and developers can experiment with Computerized Adaptive Testing by turning the entire testing loop into a controllable, LLM‑driven multi‑agent simulation. Its ability to generate realistic examinee responses, execute sophisticated item‑selection strategies, and maintain statistical rigor opens new pathways for personalized education at scale. As AI continues to permeate learning environments, frameworks like AgentCAT will be essential for building, validating, and deploying the next generation of adaptive assessment tools.