- Updated: June 30, 2026
- 6 min read
Investigating Linguistic Steering: An Analysis of Adjectival Effects Across Large Language Model Architectures
Direct Answer
The paper introduces a systematic, Shapley‑value‑based framework for measuring how individual adjectives steer the behavior of large language models (LLMs) on the MMLU benchmark. By quantifying each adjective’s marginal contribution, the authors reveal a small set of “lever” adjectives, model‑family sensitivity patterns, and strong non‑additive interactions that grow with model size.

Background: Why This Problem Is Hard
Prompt engineering has become the de‑facto interface for extracting reliable performance from LLMs, yet the community still relies heavily on anecdotal heuristics—“add a ‘careful’ tone” or “use ‘expert’ to boost accuracy.” These rules are fragile because:
- Lexical ambiguity: The same adjective can convey different intents depending on syntax, position, or surrounding context.
- Model heterogeneity: Architectures trained on divergent data pipelines develop idiosyncratic sensitivities, making a one‑size‑fits‑all prompt recipe unrealistic.
- Scale‑induced compositionality: Larger models exhibit emergent abilities to combine linguistic cues, but this also introduces unpredictable interaction effects.
Existing evaluation methods—manual ablations, prompt‑template benchmarks, or simple correlation analyses—cannot isolate the causal impact of a single adjective while accounting for interactions with other words. Without a principled attribution tool, AI alignment teams lack the granularity needed to design robust, model‑specific steering strategies.
What the Researchers Propose
The authors present a three‑component framework that treats each adjective as a “player” in a cooperative game whose payoff is the model’s performance on a downstream task (here, the MMLU benchmark). The key ideas are:
- Shapley value attribution: By enumerating all possible subsets of adjectives and measuring the marginal gain when an adjective is added, the method yields a fair, mathematically grounded contribution score for each word.
- Cross‑model profiling: The same adjective set is evaluated on a diverse portfolio of models—including o3, gpt‑4o‑mini, phi‑3, llama‑3‑70b, and deepseek‑r1—to expose lineage‑level sensitivity patterns.
- Interaction analysis: Beyond individual contributions, the framework quantifies pairwise and higher‑order synergy or antagonism, revealing whether adjectives amplify, dampen, or even reverse each other’s effects.
This approach moves away from “black‑box heuristics” toward a transparent, data‑driven map of linguistic steering levers.
How It Works in Practice
The workflow can be broken down into four logical stages:
1. Prompt Construction
Researchers start with a base MMLU question prompt and generate 100 adjective variants (e.g., “careful,” “creative,” “concise”). Each adjective is inserted in three syntactic roles—pre‑modifier, post‑modifier, and clause‑level—to capture positional effects.
2. Subset Sampling
Because enumerating all 2^100 subsets is infeasible, the authors employ a Monte‑Carlo sampling strategy that approximates Shapley values with provable error bounds. Each sampled subset is rendered into a concrete prompt and fed to the target LLM.
3. Performance Scoring
The model’s answer is automatically graded against the MMLU ground truth. The resulting accuracy serves as the coalition’s payoff, which is then used to compute marginal contributions for each adjective.
4. Cross‑Model Aggregation & Interaction Mapping
After obtaining per‑model Shapley scores, the researchers cluster models by architectural lineage (e.g., transformer‑based vs. mixture‑of‑experts) and compare sensitivity profiles. Pairwise interaction terms are extracted by measuring the difference between the combined Shapley value of two adjectives and the sum of their individual values.
What sets this pipeline apart is its explicit treatment of adjectives as strategic agents in a cooperative game, allowing practitioners to answer questions like “Which single word will most improve accuracy for gpt‑4o‑mini?” or “Do ‘careful’ and ‘detailed’ reinforce each other on llama‑3‑70b?”
Evaluation & Results
The authors evaluated the framework on five prominent LLM families, each tested on the 57‑category MMLU benchmark. The key observations are:
5.1 Powerful Adjective Levers
- Across all models, fewer than 10 % of the adjectives accounted for more than 60 % of the total performance variance.
- Adjectives such as “careful,” “precise,” and “methodical” consistently boosted scores, while “creative” and “informal” often caused drops, especially on factual categories.
5.2 Family Effect Across Model Lineages
Models sharing a training corpus or architectural backbone displayed highly correlated Shapley profiles. For example, o3 and llama‑3‑70b (both trained on extensive web data) reacted similarly to “concise,” whereas phi‑3 (a lightweight decoder‑only model) showed a distinct pattern, treating most adjectives almost literally.
5.3 Interaction Effects in Large Models
In gpt‑4o‑mini, certain adjective pairs produced non‑linear amplification: “careful” + “detailed” raised accuracy by up to 12 % beyond the sum of their individual effects. Conversely, “creative” + “formal” neutralized each other, sometimes flipping the direction of influence. Smaller models like phi‑3 exhibited near‑additive behavior, indicating limited compositional reasoning.
Overall, the experiments demonstrate that:
- Steering power is highly concentrated.
- Sensitivity is not universal; it clusters by model family.
- Scale introduces both richer compositionality and greater unpredictability.
Why This Matters for AI Systems and Agents
For practitioners building AI agents, orchestration pipelines, or enterprise‑grade assistants, the findings translate into concrete design guidelines:
- Targeted prompt tuning: Instead of generic “tone‑adjustment” heuristics, developers can select the most effective adjective levers for their specific model, reducing trial‑and‑error cycles.
- Model‑specific orchestration: When a system dynamically switches between LLM providers (e.g., using UBOS platform overview to route workloads), the steering map informs which adjectives to preserve or replace for each provider.
- Safety and alignment: Knowing that certain adjectives can dramatically shift factual accuracy helps alignment teams enforce guardrails—e.g., suppressing “creative” in high‑stakes domains.
- Composable agent design: Interaction analysis enables the construction of prompt “recipes” where synergistic adjectives are combined deliberately, improving downstream task performance without increasing model size.
In short, the Shapley‑value framework equips AI engineers with a quantitative steering toolbox, turning vague linguistic intuition into reproducible, model‑aware prompt engineering.
What Comes Next
While the study marks a significant step toward interpretable prompt control, several open challenges remain:
- Scalability of attribution: Extending the method to thousands of lexical cues (nouns, verbs, idioms) will require more efficient sampling or approximation techniques.
- Dynamic contexts: Real‑world agents often operate in multi‑turn dialogues; future work should explore how adjective effects evolve across conversational turns.
- Cross‑modal steering: Integrating voice (e.g., ElevenLabs AI voice integration) or multimodal prompts may introduce new steering dimensions that interact with textual adjectives.
- Automated lever selection: Embedding the Shapley attribution engine into a prompt‑generation service could automatically recommend the optimal adjective set for a given task and model.
Addressing these gaps will pave the way for robust, compositional alignment techniques that scale with model size. Companies looking to embed such capabilities into their products can start by experimenting with the Workflow automation studio to prototype adjective‑steering pipelines, or explore the UBOS templates for quick start that already incorporate best‑practice prompt patterns.
References
L. Malmqvist, “Investigating Linguistic Steering: An Analysis of Adjectival Effects Across Large Language Model Architectures,” arXiv preprint arXiv:2606.20572, 2026.