- Updated: June 27, 2026
- 6 min read
Text2DSL: LLM-Based Code Generation for Domain-Specific Languages

Direct Answer
The paper Text2DSL: LLM-Based Code Generation for Domain-Specific Languages introduces Text2DSL, a formal problem class and a data‑driven approach that lets large language models (LLMs) translate natural‑language policy descriptions directly into syntactically correct Polkit rules. By injecting the target DSL’s formal grammar and API specifications into the prompt, the authors achieve near‑perfect syntactic validity and a dramatic boost in functional correctness without any model fine‑tuning.
Background: Why This Problem Is Hard
Domain‑specific languages (DSLs) such as Polkit, SELinux, or custom configuration languages are the backbone of modern operating‑system security policies. Writing rules in these languages requires:
- Deep knowledge of the DSL’s syntax (often expressed in BNF or YACC).
- Understanding of the underlying security model and the exact API surface.
- Meticulous attention to detail—one misplaced token can open a critical vulnerability.
Current automation attempts fall into two camps:
- General‑purpose code generation (e.g., GitHub Copilot) that treats DSLs as just another programming language. These models lack the precise structural cues needed for DSLs, leading to low syntactic validity.
- Text‑to‑SQL pipelines that are tuned for relational query languages. While they excel at mapping natural language to a well‑defined grammar, they cannot be directly repurposed for security‑policy DSLs whose semantics differ dramatically from SQL.
Both approaches struggle because they ignore the formal specification that defines a DSL’s legal abstract syntax tree (AST). Without that scaffolding, LLMs generate “plausible‑looking” code that often fails static analysis or, worse, enforces unintended security rules.
What the Researchers Propose
The authors define a new problem class called Text2DSL. Instead of treating DSL generation as a generic code‑completion task, they frame it as a constrained translation problem where the target language is fully described by:
- A BNF grammar that captures the DSL’s syntactic structure.
- An API specification that lists permitted functions, identifiers, and their signatures.
- A curated identifier vocabulary that reflects the domain’s terminology (e.g., “network‑admin”, “read‑only”).
To operationalize this idea, they build PolkitBench, a benchmark of 4,204 natural‑language‑to‑Polkit‑rule pairs. Each pair is validated through a three‑stage AST‑based pipeline that checks:
- Syntactic validity (does the code parse?).
- Structural validity (does the AST conform to the policy’s intent?).
- Semantic fidelity (does the generated rule enforce the described security constraint?).
The core hypothesis is simple yet powerful: Providing the LLM with a structured prompt that embeds the DSL’s formal definition will dramatically improve generation quality, even for modest‑size models.
How It Works in Practice
The Text2DSL workflow can be broken down into four logical components:
| Component | Role | Key Interaction |
|---|---|---|
| Natural‑Language Encoder | Transforms the user’s policy description into a token sequence. | Feeds the encoded prompt to the LLM. |
| Structured Prompt Builder | Concatenates the BNF grammar, API spec, and identifier list with the encoded description. | Creates a single, self‑contained prompt that the LLM consumes. |
| LLM Generator (MoE) | Produces candidate DSL code conditioned on the full prompt. | Outputs one or more code snippets for downstream validation. |
| AST‑Based Validator | Parses each snippet, checks grammar compliance, and verifies structural constraints. | Filters out invalid candidates and returns the highest‑scoring rule. |
What sets this pipeline apart from generic code generation is the structured prompt. Instead of a free‑form description, the prompt looks roughly like:
[BNF Grammar] [API Specification] [Allowed Identifiers] ---User Request--- "Allow the admin group to restart the network service but deny all other users."
Both GigaChat‑10B‑A1.8B (1.8 B active parameters) and Nemotron‑3‑Nano‑30B‑A3B (3 B active) were tested with this prompt format. The models differ in scale and training provenance, yet the injection of formal context yields consistent gains across the board.
Evaluation & Results
The authors evaluated the approach on three axes:
- Syntactic Validity – percentage of generated snippets that parse without errors.
- Structural Validity – proportion of snippets whose AST matches the intended policy structure.
- CodeBLEU – a composite metric that blends n‑gram overlap, syntax matching, and semantic similarity.
Key findings (summarized without raw numbers) include:
- When the structured context is omitted, syntactic validity hovers around 70‑80 % for both models.
- Adding the BNF and API spec pushes syntactic validity to the high‑90s (≈98.6‑99.4 %).
- Structural validity improves by 10‑35 percentage points, indicating that the generated rules not only compile but also respect the policy intent.
- CodeBLEU scores jump by 60‑95 %, a signal that the LLM captures nuanced semantic constraints once it “knows” the language’s shape.
Importantly, these gains are achieved without any fine‑tuning. The same prompt template works for a 1.8 B‑parameter model and a 3 B‑parameter model, suggesting that the technique scales down to modest LLMs that are easier to host on‑premise.
Why This Matters for AI Systems and Agents
From a systems‑builder’s perspective, Text2DSL unlocks several practical pathways:
- Rapid policy authoring: Security engineers can describe a rule in plain English and receive a vetted Polkit snippet instantly, reducing the time‑to‑policy from days to minutes.
- Agent‑driven compliance automation: An AI agent can monitor configuration drift, generate corrective DSL code on the fly, and push it through the AST validator before deployment.
- Integration with orchestration platforms: The structured prompt can be wrapped in a microservice that other tools call via REST, enabling “policy‑as‑code” pipelines.
- Cost‑effective deployment: Since the method works with sub‑10 B models, enterprises can run the generator on commodity GPUs, avoiding expensive API calls to large‑scale providers.
These capabilities dovetail with existing UBOS offerings. For example, the Workflow automation studio can orchestrate the Text2DSL microservice as a step in a broader security‑automation workflow. Likewise, the ChatGPT and Telegram integration could be extended to let administrators request policy changes via a chat interface, with the LLM handling the translation behind the scenes.
What Comes Next
While the results are promising, several open challenges remain:
- Generalization to other DSLs: Polkit is a well‑structured language; extending the approach to more loosely defined DSLs (e.g., custom CI/CD pipelines) will test the limits of prompt‑based scaffolding.
- Dynamic context handling: Real‑world policies often depend on runtime state (current user groups, system load). Future work could feed live system metadata into the prompt.
- Human‑in‑the‑loop verification: Even with high structural validity, a security auditor should review generated rules before production deployment.
- Fine‑tuning vs. prompting trade‑offs: Investigating whether a small amount of domain‑specific fine‑tuning can further reduce the need for extensive prompt engineering.
Potential applications extend beyond security. Any domain that relies on a formal DSL—network configuration (e.g., Cisco IOS), data‑pipeline definitions (e.g., Apache Beam), or robotics command languages—could benefit from a Text2DSL‑style pipeline.
Developers interested in experimenting can start by cloning the PolkitBench repository (hypothetical link) and integrating the prompt builder into the UBOS platform overview. From there, the Enterprise AI platform by UBOS provides the compute backbone to host the MoE models securely.
In short, Text2DSL demonstrates that a well‑crafted, specification‑rich prompt can turn a generic LLM into a reliable DSL authoring assistant—opening the door to safer, faster, and more accessible policy automation across the enterprise.