- Updated: March 15, 2026
- 6 min read
Zhipu AI Launches GLM‑OCR: A Compact 0.9B Multimodal OCR Model

GLM‑OCR, a 0.9 B‑parameter multimodal OCR model from Zhipu AI, delivers high‑accuracy document parsing, table and formula recognition, and key‑information extraction while keeping latency and compute costs low enough for edge‑device deployment.
Introduction: Why GLM‑OCR Matters
On March 14, 2026, Zhipu AI announced the release of GLM‑OCR, a compact 0.9 B‑parameter multimodal OCR model designed to bridge the gap between traditional OCR engines and heavyweight vision‑language models. The research, conducted jointly with Tsinghua University, targets real‑world documents that contain mixed layouts, tables, formulas, and structured fields—scenarios where classic OCR pipelines often stumble. By combining a lightweight visual encoder with a language decoder, GLM‑OCR promises enterprise‑grade accuracy with a fraction of the hardware footprint required by larger models.
Technical Overview: Architecture & Innovations
Model Size & Core Components
- Visual Encoder: 0.4 B‑parameter CogViT encoder optimized for document images.
- Language Decoder: 0.5 B‑parameter GLM decoder that generates structured text, JSON, or Markdown.
- Cross‑Modal Connector: A lightweight bridge that fuses visual embeddings with language tokens.
Multi‑Token Prediction (MTP)
Unlike conventional autoregressive decoding that emits one token per step, GLM‑OCR predicts up to 10 tokens simultaneously. In practice, the model averages 5.2 tokens per step, delivering roughly a 50 % boost in throughput without sacrificing output quality. This innovation is crucial for OCR workloads where the output sequence is largely deterministic.
Two‑Stage Layout Parsing
GLM‑OCR adopts a two‑stage pipeline:
- Stage 1 – Layout Detection: The
PP‑DocLayout‑V3module identifies regions such as paragraphs, tables, formulas, and seals. - Stage 2 – Region‑Level Recognition: Each detected region is processed in parallel, allowing the model to focus on localized visual cues and dramatically reducing inference latency.
Task‑Specific Output Paths
GLM‑OCR separates document parsing from key‑information extraction (KIE). For parsing, the model emits structured Markdown or JSON after region‑level processing. For KIE, a single‑image prompt drives the model to generate a JSON payload containing extracted fields directly, bypassing the layout stage when the task is simple.
Training Process and Benchmark Results
Four‑Stage Training Pipeline
| Stage | Focus | Key Techniques |
|---|---|---|
| 1 | Vision encoder pre‑training | Image‑text pairs, grounding, retrieval |
| 2.1 | Multimodal pre‑training | Doc parsing, VQA, image‑text |
| 2.2 | Multi‑Token Prediction | MTP objective, token‑sharing draft models |
| 3 | Supervised fine‑tuning | OCR, formula transcription, table recovery, KIE |
| 4 | Reinforcement learning | GRPO with task‑specific rewards (Edit Distance, CDM, TEDS, F1) |
Benchmark Performance
GLM‑OCR was evaluated on a suite of public OCR benchmarks. The model achieved top‑tier scores on most non‑reference datasets, confirming its competitive edge while maintaining a modest footprint.
- OmniDocBench v1.5: 94.6
- OCRBench (Text): 94.0
- UniMERNet: 96.5
- TEDS_TEST: 86.0
- PubTabNet (tables): 85.2 (behind MinerU 2.5’s 88.4)
- Nanonets‑KIE: 93.7 (Gemini‑3‑Pro scores higher but is a reference model)
The results demonstrate that a sub‑billion‑parameter model can rival much larger systems on diverse document‑understanding tasks, a claim supported by the accompanying research paper.
Deployment Options and Pricing
Zhipu AI packages GLM‑OCR as a Model‑as‑a‑Service (MaaS) offering, compatible with popular inference runtimes such as vLLM, SGLang, and Ollama. Fine‑tuning is supported via LLaMA‑Factory, enabling enterprises to adapt the model to domain‑specific vocabularies.
The pricing model is straightforward: 0.2 RMB per million tokens. For a typical scanned invoice (≈ 2 KB), the cost translates to less than a cent per document, making large‑scale batch processing economically viable.
Throughput benchmarks report 0.67 images/s and 1.86 PDF pages/s** on a single V100 GPU, confirming that the model can serve real‑time workloads on modest hardware.
Comparison with Competing OCR Models
| Model | Parameters | Key Strengths | Typical Use‑Case |
|---|---|---|---|
| GLM‑OCR (Zhipu AI) | 0.9 B | MTP decoding, two‑stage layout, low latency | Enterprise document pipelines, edge devices |
| Gemini‑3‑Pro (Google) | ≈ 30 B | State‑of‑the‑art vision‑language, high accuracy | Research‑grade, cloud‑only OCR |
| MinerU 2.5 | 1.2 B | Superior table extraction (PubTabNet) | Financial statement analysis |
| Traditional Tesseract | N/A (rule‑based) | Open‑source, lightweight | Simple text PDFs, low‑cost projects |
The table highlights that GLM‑OCR occupies a sweet spot: far smaller than giant LLM‑based OCR services yet more capable than classic rule‑based engines. Its unique MTP and layout‑aware design give it an edge in latency‑sensitive environments such as enterprise AI platforms.
Use‑Case Scenarios Where GLM‑OCR Shines
- Invoice & Receipt Automation: Extract line items, totals, and tax IDs in milliseconds, feeding directly into ERP systems.
- Legal Document Review: Parse contracts with mixed clauses, tables, and signatures, then generate JSON summaries for compliance checks.
- Scientific Paper Indexing: Recognize formulas and tables, enabling searchable knowledge bases for R&D teams.
- Healthcare Records Digitization: Convert scanned patient forms into structured HL7‑compatible JSON without manual data entry.
- Multilingual KIE: Combined with OpenAI ChatGPT integration, GLM‑OCR can feed extracted fields into a language model for translation or summarization.
Companies looking to embed OCR into their existing workflows can leverage the UBOS platform overview to orchestrate GLM‑OCR alongside other AI services such as ElevenLabs AI voice integration for end‑to‑end document‑to‑speech pipelines.
Read the Original Announcement
For a complete technical deep‑dive, see the original MarkTechPost article: Zhipu AI Introduces GLM‑OCR.
Explore Related UBOS Solutions
If you’re evaluating OCR as part of a broader AI strategy, UBOS offers a suite of complementary tools:
- AI OCR solutions – pre‑built pipelines that can ingest GLM‑OCR outputs.
- Multimodal OCR – combines vision and language models for richer extraction.
- OCR technology trends – stay ahead of emerging standards.
- UBOS templates for quick start – jump‑start your OCR workflow with ready‑made templates.
- UBOS portfolio examples – see real‑world deployments of document AI.
- UBOS for startups – affordable plans for early‑stage innovators.
- UBOS solutions for SMBs – scale OCR without breaking the bank.
- UBOS pricing plans – transparent cost structures for AI services.
- UBOS partner program – collaborate on AI solutions and co‑sell.
- AI marketing agents – automate content generation from extracted data.
- Web app editor on UBOS – build custom front‑ends for OCR results.
- Telegram integration on UBOS – push OCR alerts to chat channels.
- ChatGPT and Telegram integration – enable conversational query of extracted data.
Conclusion: GLM‑OCR Sets a New Baseline for Efficient Document Understanding
Zhipu AI’s GLM‑OCR demonstrates that high‑quality OCR no longer requires massive compute budgets. By marrying a compact visual encoder with a language‑centric decoder and introducing Multi‑Token Prediction, the model delivers fast, accurate, and structured extraction suitable for everything from invoice processing to scientific literature mining. Enterprises seeking a scalable, cost‑effective OCR engine should evaluate GLM‑OCR alongside UBOS’s Enterprise AI platform to create end‑to‑end pipelines that turn paper into actionable intelligence.
Ready to modernize your document workflows? Visit the UBOS homepage today, explore the AI OCR solutions, and start a free trial to experience GLM‑OCR’s performance first‑hand.