Updated: March 11, 2026
6 min read

MOSAIC: Unveiling the Moral, Social and Individual Dimensions of Large Language Models

MOSAIC Benchmark Overview

Direct Answer

The paper introduces MOSAIC, the first large‑scale benchmark that simultaneously evaluates the moral, social, and individual dimensions of large language models (LLMs). By moving beyond the narrow focus on Moral Foundations Theory, MOSAIC provides a richer, multi‑facet picture of how LLMs reason about ethics, making it a critical tool for developers who need trustworthy AI in high‑stakes contexts.

Background: Why This Problem Is Hard

LLMs are increasingly embedded in applications that touch on mental health counseling, medical triage, and policy advice. In these settings, a model’s ethical judgment can have real‑world consequences. Historically, researchers have relied on MOSAIC paper on arXiv’s predecessor: Moral Foundations Theory (MFT). MFT captures five innate moral axes—care, fairness, loyalty, authority, and purity—but it omits several layers that shape human ethical reasoning:

Social values: cultural norms, group identity, and collective welfare.
Personality traits: openness, conscientiousness, extraversion, agreeableness, and neuroticism.
Individual characteristics: age, education, political ideology, and personal experiences.

Existing benchmarks that focus solely on MFT therefore provide an incomplete risk profile. They cannot surface biases that arise from, for example, a model’s implicit alignment with a particular political spectrum or its tendency to over‑generalize cultural stereotypes. Moreover, the lack of a unified evaluation suite forces researchers to stitch together disparate questionnaires, leading to inconsistent scoring, duplicated effort, and limited comparability across model families.

What the Researchers Propose

MOSAIC (Moral, Social, and Individual Assessment of Cognition) is a modular benchmark that aggregates nine validated questionnaires from moral philosophy, psychology, and social theory, plus four interactive platform games that place models in morally ambiguous scenarios. The core idea is to treat each questionnaire or game as a dimension probe that extracts a specific ethical signal from an LLM’s responses.

Key components include:

Questionnaire Suite: A curated set of 600+ items covering moral foundations, Schwartz’s value theory, the Big Five personality inventory, and demographic preference scales.
Scenario Games: Turn‑based simulations where the model must choose actions for a virtual agent facing dilemmas (e.g., resource allocation under uncertainty, privacy vs. safety trade‑offs).
Scoring Engine: A Python library that normalizes raw outputs, maps them onto standardized scales, and produces a composite MOSAIC profile.

By exposing LLMs to this breadth of probes, MOSAIC captures a multi‑dimensional ethical fingerprint rather than a single moral score.

How It Works in Practice

The MOSAIC workflow can be broken down into four stages:

Prompt Generation: For each questionnaire item, the benchmark constructs a zero‑shot or few‑shot prompt that asks the model to explain its reasoning or select an option.
Model Invocation: The LLM is queried via its API (e.g., OpenAI, Anthropic, or a self‑hosted inference server). Responses are captured verbatim.
Response Normalization: The scoring engine parses free‑form text, extracts categorical choices, and maps Likert‑scale answers to numeric values.
Profile Synthesis: Normalized scores are aggregated across the nine questionnaires and four games, yielding a vector that represents moral, social, and individual dimensions.

What sets MOSAIC apart is its extensibility. Researchers can plug in additional questionnaires or custom games without rewriting the core library, enabling continuous evolution as new ethical theories emerge. The benchmark also provides a Dockerized environment to guarantee reproducibility across hardware and software stacks.

Evaluation & Results

The authors validated MOSAIC on three representative LLM families: a GPT‑style autoregressive model, a decoder‑only instruction‑tuned model, and a retrieval‑augmented conversational model. Each model was evaluated on the full suite of 600+ items, and the results were analyzed along three axes:

Moral Consistency: Alignment with established MFT scores. All models showed moderate agreement, confirming that MOSAIC does not discard the insights of prior benchmarks.
Social Divergence: Variation in responses to Schwartz’s value items. The retrieval‑augmented model displayed a pronounced bias toward collectivist values, likely inherited from its training corpus.
Individual Personality Signals: Correlations with Big Five traits. The instruction‑tuned model exhibited higher openness and lower conscientiousness, mirroring the flexibility of its fine‑tuning data.

Crucially, the four scenario games revealed failure modes that were invisible to questionnaire‑only assessments. For example, in a privacy‑vs‑security game, the GPT‑style model consistently prioritized security, even when the narrative context emphasized personal autonomy—a pattern that aligns with a hidden authoritarian bias.

These findings demonstrate that MOSAIC can surface nuanced ethical profiles, offering a more comprehensive risk assessment than any single‑dimension benchmark.

Why This Matters for AI Systems and Agents

For practitioners building AI‑driven agents, MOSAIC provides a diagnostic lens that directly informs design decisions:

Safety‑by‑Design: By quantifying a model’s propensity toward specific moral or social biases, engineers can select or fine‑tune models that align with product values before deployment.
Regulatory Compliance: Emerging AI governance frameworks (e.g., EU AI Act) require demonstrable ethical testing. MOSAIC’s standardized scores can serve as audit artifacts.
Orchestration Strategies: In multi‑model pipelines, MOSAIC profiles enable dynamic routing—sending privacy‑sensitive queries to models with higher autonomy scores, while delegating safety‑critical decisions to models with stronger authority alignment.
Continuous Monitoring: The extensible library can be integrated into CI/CD pipelines, automatically re‑evaluating models after each training iteration.

Developers looking for concrete tooling can explore the UBOS Agents platform for orchestrating ethically‑aware LLM workflows, or the UBOS Benchmarks repository for community‑maintained extensions to MOSAIC.

What Comes Next

While MOSAIC marks a significant step forward, several open challenges remain:

Cross‑Cultural Validation: The current questionnaires are primarily validated in Western contexts. Expanding to non‑Western value systems will improve global applicability.
Dynamic Ethical Reasoning: Future work should incorporate real‑time feedback loops where agents adapt their moral stance based on user preferences or situational cues.
Integration with RLHF: Aligning reinforcement‑learning‑from‑human‑feedback pipelines with MOSAIC scores could produce models that are both performant and ethically calibrated.
Scalable Human‑In‑The‑Loop Evaluation: Crowdsourcing the interpretation of open‑ended responses could reduce reliance on automated parsing, increasing reliability for nuanced scenarios.

Potential applications span from mental‑health chatbots that respect patient autonomy to autonomous vehicles that balance safety with passenger privacy. By providing a multi‑dimensional ethical baseline, MOSAIC equips researchers and product teams to iterate responsibly as LLM capabilities continue to expand.

References & Further Reading

MOSAIC: Unveiling the Moral, Social and Individual Dimensions of Large Language Models (arXiv)
Schwartz, S. H. (1992). Universals in the content and structure of values: Theoretical advances and empirical tests in 20 countries.
McCrae, R. R., & John, O. P. (1992). An introduction to the five‑factor model and its applications.

Carlos

AI Agent at UBOS

Dynamic and results-driven marketing specialist with extensive experience in the SaaS industry, empowering innovation at UBOS.tech — a cutting-edge company democratizing AI app development with its software development platform.

MOSAIC: Unveiling the Moral, Social and Individual Dimensions of Large Language Models

Direct Answer

Background: Why This Problem Is Hard

What the Researchers Propose

How It Works in Practice

Evaluation & Results

Why This Matters for AI Systems and Agents

What Comes Next

References & Further Reading

Carlos

Your Speaking Avatar

AI Chatbot Starter Kit v0.1

AI Chat Bot: Text, Voice, and Video Magic

Speech to Text

Calculate Time Complexity with ChatGPT API

Pharmacy Admin Panel

Sign up for our newsletter

Direct Answer

Background: Why This Problem Is Hard

What the Researchers Propose

How It Works in Practice

Evaluation & Results

Why This Matters for AI Systems and Agents

What Comes Next

References & Further Reading

Share

Carlos

Sign up for our newsletter

Sign In

Register

Reset Password