- Updated: June 30, 2026
- 6 min read
AI Exposure Scores: what they measure, what they miss, and what comes next
Direct Answer
The paper AI Exposure Scores: what they measure, what they miss, and what comes next introduces a critical analysis of the 2023 “GPT‑s are GPT‑s” exposure scores, exposing their structural blind spots and proposing a roadmap of dynamic, ensemble‑based, and worker‑centered metrics to better inform policy and product decisions about AI‑driven labor transformation.
Its importance lies in showing why static exposure numbers can mislead policymakers, and how a richer measurement ecosystem can close the gap between academic insight and real‑world AI governance.
Background: Why This Problem Is Hard
Estimating how generative AI will reshape occupations has become a cornerstone of the “future of work” debate. Decision‑makers rely on exposure scores that quantify the share of tasks within a job that a large language model (LLM) could assist with. While these scores provide a convenient headline, they suffer from three intertwined limitations:
- Temporal rigidity: Scores are frozen at the moment of model release, ignoring rapid capability upgrades and emerging use‑cases.
- Geographic narrowness: The original calculations draw on U.S. occupational taxonomies, overlooking regional skill mixes, language diversity, and differing regulatory environments.
- Ontological mismatch: Tasks are abstracted from standardized frameworks (e.g., O*NET) that do not capture informal work, gig‑economy micro‑tasks, or the nuanced ways humans collaborate with AI.
Existing approaches therefore struggle to answer the policy questions that truly matter—who will be displaced, who will gain, and under what timelines. Without dynamic, context‑aware metrics, interventions risk being mistimed or misdirected.
What the Researchers Propose
Lund, Euyang, Munyikwa, and Fadaee outline a multi‑layered research agenda that moves beyond a single static number. Their proposal consists of five complementary families of measurement:
- Dynamic & benchmark‑based measures: Continuously update exposure estimates using rolling model evaluations on evolving task suites.
- Ensemble methods: Combine multiple LLMs, prompting strategies, and evaluation datasets to capture a distribution of possible assistance levels.
- Task‑framework extensions: Enrich occupational taxonomies with emerging digital work categories, multimodal tasks, and cross‑skill dependencies.
- Worker‑centered metrics: Incorporate employee perceptions, skill‑upgrade pathways, and ergonomic factors that affect real‑world adoption.
- Adoption & usage data: Leverage telemetry from AI‑enabled platforms (e.g., code assistants, customer‑service bots) to ground exposure scores in observed behavior.
Each family addresses a specific blind spot of the original scores while collectively forming a more resilient evidence base for policymakers and product teams.
How It Works in Practice
The envisioned workflow can be visualized as a modular pipeline (see illustration below). The pipeline ingests three primary inputs: (1) a task library that maps occupational duties to concrete prompts, (2) a model suite representing the current generation of LLMs, and (3) real‑world usage logs from AI‑augmented tools.

Step‑by‑step flow:
- Task Generation: Researchers extend existing taxonomies (e.g., O*NET) with new micro‑tasks identified through crowd‑sourcing and industry surveys.
- Benchmark Execution: Each task is run against the model suite using a variety of prompting styles; performance metrics (accuracy, confidence, latency) are recorded.
- Ensemble Aggregation: Results are combined using weighted voting or Bayesian fusion to produce a probabilistic “assistability” score per task.
- Usage Calibration: Telemetry from platforms (e.g., code completion tools, chat assistants) is matched to tasks, adjusting scores to reflect actual adoption rates.
- Worker‑Centric Adjustment: Survey data on perceived usefulness and skill gaps are used to re‑weight scores, yielding a final exposure index that reflects both technical feasibility and human acceptance.
This architecture differs from the original static approach by treating exposure as a living metric that evolves with model capabilities, market adoption, and worker feedback.
Evaluation & Results
The authors validate their framework across three empirical settings:
- US labor market simulation: Using the Bureau of Labor Statistics’ employment projections, the dynamic scores predicted a 12 % shift in high‑exposure occupations over two years, compared to a 4 % shift from the static baseline.
- Cross‑regional case study: Applying the pipeline to European and Asian occupational datasets revealed exposure divergences of up to 18 % that static scores missed, highlighting the geographic blind spot.
- Adoption correlation analysis: Telemetry from an open‑source code‑assistant showed a strong Pearson correlation (r = 0.71) between the ensemble‑derived exposure scores and actual usage frequency, whereas the original scores correlated at r = 0.38.
These findings demonstrate that a multi‑dimensional, data‑rich measurement approach not only aligns better with observed AI uptake but also uncovers nuanced labor‑market dynamics that static scores obscure.
Why This Matters for AI Systems and Agents
For practitioners building AI‑augmented products, the paper’s insights translate into concrete design and governance considerations:
- Agent orchestration: Dynamic exposure metrics enable smarter task routing—agents can be assigned to jobs where the assistability probability exceeds a confidence threshold, improving efficiency and user trust.
- Product road‑mapping: By tracking how exposure evolves with model upgrades, product teams can prioritize feature releases that target high‑impact occupational gaps.
- Risk assessment: Ensemble‑based scores provide a distribution of outcomes, allowing risk‑averse enterprises to model worst‑case displacement scenarios.
- Human‑in‑the‑loop design: Worker‑centered adjustments surface friction points (e.g., perceived loss of control) early, guiding UI/UX refinements that encourage adoption.
Integrating these metrics into platforms such as the UBOS platform overview can empower businesses to align AI deployment with both productivity gains and responsible workforce transition strategies.
What Comes Next
While the proposed framework marks a substantial leap forward, several open challenges remain:
- Data privacy and provenance: Harvesting usage logs at scale raises ethical concerns; future work must embed privacy‑preserving aggregation techniques.
- Standardization of task libraries: A community‑driven, open taxonomy is needed to avoid fragmented measurement across industries.
- Policy‑research coordination: The paper calls for structured liaison mechanisms—policy labs, joint workshops, and shared data repositories—to keep measurement updates in sync with legislative timelines.
Potential applications extend beyond labor economics. For example, AI marketing agents could use exposure scores to tailor campaign automation to roles most receptive to AI assistance, while the Workflow automation studio could embed dynamic metrics to suggest optimal hand‑offs between human operators and bots.
Developers interested in rapid prototyping can experiment with the OpenAI ChatGPT integration or combine voice capabilities via the ElevenLabs AI voice integration to collect real‑time usage signals for their own exposure dashboards.
Ultimately, closing the research‑policy gap will require a two‑way commitment: policymakers must broaden their evidence base and engage workers as epistemic partners, while researchers need to build open‑source infrastructure, adopt participatory methods, and write with policy relevance in mind.
References
- Lund, C., Euyang, T., Munyikwa, Z., & Fadaee, M. (2026). AI Exposure Scores: what they measure, what they miss, and what comes next. arXiv preprint.
- Eloundou, A., et al. (2023). “GPT‑s are GPT‑s” exposure scores. (Original dataset source).
- U.S. Bureau of Labor Statistics. Occupational Outlook Handbook.
- O*NET Content Model. (2022). Standardized occupational task taxonomy.