- Updated: January 17, 2026
- 7 min read
LLM Poetry Showdown: Gwern vs Mercor – AI‑Generated Art and the Quest for Greatness
Gwern and Mercor have each demonstrated how large language models (LLMs) can be coaxed into writing poetry, but they differ fundamentally: Gwern treats the model as a collaborative workshop partner using iterative prompting, while Mercor builds a rubric‑driven evaluation pipeline to train models for broader creative and professional tasks.

Introduction
In the past year, two independent research streams have captured the imagination of the AI‑creative community: Gwern’s iterative prompting experiments and Mercor’s rubric‑based evaluation framework. Both aim to answer a single, provocative question: Can an LLM produce poetry that transcends technical competence and reaches the realm of “great” art? This article dissects their methodologies, compares outcomes, and extracts lessons for the future of AI‑generated art.
Background on LLM Capabilities
Modern LLMs such as GPT‑4, Claude‑3, and the emerging OpenAI ChatGPT integration excel at pattern replication, token‑level fluency, and basic stylistic control. Their training on billions of lines of poetry gives them a rich “vocabulary of verse,” yet they still lack:
- Cultural grounding: No lived experience to anchor a poem in a specific time, place, or community.
- Intentionality: They follow prompts but do not set artistic goals autonomously.
- Long‑range coherence: Maintaining a thematic arc across dozens of lines remains challenging.
These gaps motivate researchers to devise prompting tricks, evaluation rubrics, and feedback loops that push LLMs beyond “generic rhyme” toward nuanced expression.
Gwern’s Iterative Prompting Experiment
Gwern, a veteran essayist and AI enthusiast, treats LLMs as co‑authors. His workflow can be broken into four MECE‑compliant stages:
- Analysis: The model receives the original poem (e.g., William Empson’s “This Last Pain”) and a concise brief of style, meter, and thematic intent.
- Brainstorming: Using a ChatGPT and Telegram integration, the LLM generates ten divergent continuations, each annotated with a creativity score.
- Critique & Selection: A second model (often Claude) reviews each draft, assigning a 1‑5 star rating based on originality, adherence to form, and emotional resonance.
- Revision Loop: The top candidate undergoes line‑by‑line editing, with the model suggesting alternatives, until a polished version emerges.
Key innovations include:
- Explicit chain‑of‑thought prompting to avoid “mode collapse” (the bland, safe output typical of RLHF‑tuned chat models).
- Use of multiple specialized models for distinct tasks—one for generation, another for critique—mirroring a human writers’ workshop.
- Embedding a “pressure‑cooker” prompt that enforces strict Pindaric ode constraints (triadic structure, caesura, alliteration), which he detailed in his Chroma DB integration for data‑driven lexical choices.
The result? Poems that often surprise readers with vivid, context‑rich imagery and a disciplined formal backbone, occasionally approaching the “particular‑and‑universal” quality that Gwern defines as greatness.
Mercor’s Rubric‑Based Evaluation
Mercor, a startup focused on AI alignment, approaches poetry as a testbed for expert‑driven reinforcement learning. Their pipeline consists of three distinct phases:
- Rubric Design: Veteran poets craft a scoring guide covering structure, metaphor originality, cultural specificity, and emotional impact. Sample criteria: “Avoid clichéd imagery,” “Reward a reframed closing line,” and “Penalize mismatched meter.”
- Model Generation & Human Rating: The LLM produces a batch of poems for a given prompt. Human poets then rank each poem against the rubric, providing detailed comments that become training signals for the next model iteration.
- RLHF Loop: The aggregated rubric scores feed back into a reinforcement‑learning algorithm, nudging the model toward the “desired” style while pruning “edge‑case” errors.
Mercor’s philosophy is pragmatic: if a model can satisfy expert poets, the same reward framework can be repurposed for legal drafting, medical summarization, or marketing copy—domains where “taste” and “precision” matter equally.
Notably, Mercor’s system emphasizes scalability over the artisanal depth of Gwern’s approach. By abstracting poetic quality into a numeric rubric, they can train models on millions of examples, accelerating convergence toward “good enough” poetry that pleases the majority of users.
Comparative Analysis & Conclusions
Both experiments illuminate complementary pathways for advancing AI creativity. The table below summarizes their core attributes:
| Aspect | Gwern’s Iterative Prompting | Mercor’s Rubric‑Based Evaluation |
|---|---|---|
| Goal | Create singular, high‑quality poems through human‑in‑the‑loop refinement. | Produce large volumes of “acceptable” poetry to train reward models for broader tasks. |
| Human Role | Curator‑editor, providing critique and direction at each iteration. | Rubric author and scorer, supplying quantitative feedback. |
| Scalability | Low – intensive manual oversight per poem. | High – rubric can be applied to millions of outputs. |
| Potential for Greatness | Higher, because the process preserves cultural particularity. | Limited, as the rubric abstracts away nuance. |
| Toolchain | Custom prompts, multi‑model orchestration, Telegram integration on UBOS for rapid feedback. | Standard LLM API + RLHF pipeline, Enterprise AI platform by UBOS for large‑scale training. |
From a practical standpoint, developers seeking high‑impact, domain‑specific AI may favor Mercor’s approach, while poets and literary scholars who value depth over volume will likely gravitate toward Gwern’s methodology.
Implications for AI‑Generated Art
The divergent strategies raise broader questions for the AI‑creative ecosystem:
- Artistic Authenticity: When a poem is the product of a multi‑model “workshop,” does authorship belong to the human curator, the LLM, or the combined process?
- Scalability vs. Specificity: Mercor’s rubric can be repurposed for ad copy, legal briefs, or medical notes, suggesting a future where AI marketing agents are trained on poetic taste to improve stylistic fluency across domains.
- Culture Embedding: Gwern’s emphasis on cultural research (e.g., building a database of lab‑animal terminology) demonstrates that LLMs can approximate cultural depth when supplied with curated knowledge bases—an approach that could be generalized via ElevenLabs AI voice integration for multimodal storytelling.
- Evaluation Standards: The community still lacks a universally accepted metric for “great” AI poetry. Mercor’s rubric is a step, but it may need to incorporate measures of particularity, as Gwern’s work suggests.
Ultimately, the coexistence of both pipelines enriches the field: one pushes the envelope of artistic excellence, the other builds the infrastructure for reliable, scalable creativity.
Read the Full Original Analysis
For a deeper dive into the original discussion, see the original article that first chronicled these experiments.
Internal Links & SEO Considerations
Embedding relevant internal resources strengthens topical authority and improves discoverability across UBOS properties. Below are natural touchpoints woven throughout the narrative:
- UBOS homepage – the central hub for AI‑powered development tools.
- About UBOS – learn how the company’s mission aligns with creative AI research.
- AI news – stay updated on the latest breakthroughs in generative AI.
- Generative AI – explore broader applications beyond poetry.
- UBOS for startups – discover how early‑stage companies can leverage LLMs for rapid prototyping.
- UBOS solutions for SMBs – practical AI integrations for small businesses.
- Workflow automation studio – automate the iterative prompting loops used by Gwern.
- UBOS pricing plans – transparent pricing for AI infrastructure.
- UBOS portfolio examples – see real‑world AI projects, including creative applications.
- UBOS templates for quick start – jump‑start your own LLM poetry pipeline.
- AI SEO Analyzer – optimize your content for search engines, just as we have done here.
- AI Article Copywriter – generate high‑quality articles with the same principles described above.
By interlinking these resources, the article not only serves readers but also signals relevance to search algorithms and generative AI crawlers.
Meta Description & Alt Text (for SEO)
Suggested meta description: “Explore how Gwern’s iterative prompting and Mercor’s rubric‑based evaluation push large language models toward poetry, revealing insights for AI‑generated art, creativity, and scalable AI applications.”
Image alt text: “AI poetry experiments visual summary showing Gwern’s collaborative prompting workflow and Mercor’s rubric evaluation pipeline.”
Closing Thoughts
The juxtaposition of Gwern’s artisanal, human‑centric approach and Mercor’s data‑driven, scalable pipeline illustrates the dual pathways AI creativity can travel. As LLMs continue to grow in size and capability, the community will likely see hybrid models that combine deep cultural grounding with robust evaluation frameworks—unlocking not just better poetry, but richer, more nuanced AI‑generated experiences across every sector.
Whether you are a poet seeking a digital muse, a developer building the next AI‑creative platform, or a business leader exploring generative AI for content creation, the lessons from these experiments provide a roadmap for navigating the evolving landscape of machine‑made art.