Updated: March 15, 2026
6 min read

Zhipu AI Launches GLM‑OCR: A Compact 0.9B Multimodal OCR Model

Zhipu AI Unveils GLM‑OCR: A Compact Multimodal OCR Model Redefining Document Parsing

Document scanning illustration — A modern OCR pipeline turning paper into structured data.

GLM‑OCR, a 0.9 B‑parameter multimodal OCR model from Zhipu AI, delivers high‑accuracy document parsing, table and formula recognition, and key‑information extraction while keeping latency and compute costs low enough for edge‑device deployment.

Introduction: Why GLM‑OCR Matters

On March 14, 2026, Zhipu AI announced the release of GLM‑OCR, a compact 0.9 B‑parameter multimodal OCR model designed to bridge the gap between traditional OCR engines and heavyweight vision‑language models. The research, conducted jointly with Tsinghua University, targets real‑world documents that contain mixed layouts, tables, formulas, and structured fields—scenarios where classic OCR pipelines often stumble. By combining a lightweight visual encoder with a language decoder, GLM‑OCR promises enterprise‑grade accuracy with a fraction of the hardware footprint required by larger models.

Technical Overview: Architecture & Innovations

Model Size & Core Components

Visual Encoder: 0.4 B‑parameter CogViT encoder optimized for document images.
Language Decoder: 0.5 B‑parameter GLM decoder that generates structured text, JSON, or Markdown.
Cross‑Modal Connector: A lightweight bridge that fuses visual embeddings with language tokens.

Multi‑Token Prediction (MTP)

Unlike conventional autoregressive decoding that emits one token per step, GLM‑OCR predicts up to 10 tokens simultaneously. In practice, the model averages 5.2 tokens per step, delivering roughly a 50 % boost in throughput without sacrificing output quality. This innovation is crucial for OCR workloads where the output sequence is largely deterministic.

Two‑Stage Layout Parsing

GLM‑OCR adopts a two‑stage pipeline:

Stage 1 – Layout Detection: The PP‑DocLayout‑V3 module identifies regions such as paragraphs, tables, formulas, and seals.
Stage 2 – Region‑Level Recognition: Each detected region is processed in parallel, allowing the model to focus on localized visual cues and dramatically reducing inference latency.

Task‑Specific Output Paths

GLM‑OCR separates document parsing from key‑information extraction (KIE). For parsing, the model emits structured Markdown or JSON after region‑level processing. For KIE, a single‑image prompt drives the model to generate a JSON payload containing extracted fields directly, bypassing the layout stage when the task is simple.

Training Process and Benchmark Results

Four‑Stage Training Pipeline

Stage	Focus	Key Techniques
1	Vision encoder pre‑training	Image‑text pairs, grounding, retrieval
2.1	Multimodal pre‑training	Doc parsing, VQA, image‑text
2.2	Multi‑Token Prediction	MTP objective, token‑sharing draft models
3	Supervised fine‑tuning	OCR, formula transcription, table recovery, KIE
4	Reinforcement learning	GRPO with task‑specific rewards (Edit Distance, CDM, TEDS, F1)

Benchmark Performance

GLM‑OCR was evaluated on a suite of public OCR benchmarks. The model achieved top‑tier scores on most non‑reference datasets, confirming its competitive edge while maintaining a modest footprint.

OmniDocBench v1.5: 94.6
OCRBench (Text): 94.0
UniMERNet: 96.5
TEDS_TEST: 86.0
PubTabNet (tables): 85.2 (behind MinerU 2.5’s 88.4)
Nanonets‑KIE: 93.7 (Gemini‑3‑Pro scores higher but is a reference model)

The results demonstrate that a sub‑billion‑parameter model can rival much larger systems on diverse document‑understanding tasks, a claim supported by the accompanying research paper.

Deployment Options and Pricing

Zhipu AI packages GLM‑OCR as a Model‑as‑a‑Service (MaaS) offering, compatible with popular inference runtimes such as vLLM, SGLang, and Ollama. Fine‑tuning is supported via LLaMA‑Factory, enabling enterprises to adapt the model to domain‑specific vocabularies.

The pricing model is straightforward: 0.2 RMB per million tokens. For a typical scanned invoice (≈ 2 KB), the cost translates to less than a cent per document, making large‑scale batch processing economically viable.

Throughput benchmarks report 0.67 images/s and 1.86 PDF pages/s** on a single V100 GPU, confirming that the model can serve real‑time workloads on modest hardware.

Comparison with Competing OCR Models

Model	Parameters	Key Strengths	Typical Use‑Case
GLM‑OCR (Zhipu AI)	0.9 B	MTP decoding, two‑stage layout, low latency	Enterprise document pipelines, edge devices
Gemini‑3‑Pro (Google)	≈ 30 B	State‑of‑the‑art vision‑language, high accuracy	Research‑grade, cloud‑only OCR
MinerU 2.5	1.2 B	Superior table extraction (PubTabNet)	Financial statement analysis
Traditional Tesseract	N/A (rule‑based)	Open‑source, lightweight	Simple text PDFs, low‑cost projects

The table highlights that GLM‑OCR occupies a sweet spot: far smaller than giant LLM‑based OCR services yet more capable than classic rule‑based engines. Its unique MTP and layout‑aware design give it an edge in latency‑sensitive environments such as enterprise AI platforms.

Use‑Case Scenarios Where GLM‑OCR Shines

Invoice & Receipt Automation: Extract line items, totals, and tax IDs in milliseconds, feeding directly into ERP systems.
Legal Document Review: Parse contracts with mixed clauses, tables, and signatures, then generate JSON summaries for compliance checks.
Scientific Paper Indexing: Recognize formulas and tables, enabling searchable knowledge bases for R&D teams.
Healthcare Records Digitization: Convert scanned patient forms into structured HL7‑compatible JSON without manual data entry.
Multilingual KIE: Combined with OpenAI ChatGPT integration, GLM‑OCR can feed extracted fields into a language model for translation or summarization.

Companies looking to embed OCR into their existing workflows can leverage the UBOS platform overview to orchestrate GLM‑OCR alongside other AI services such as ElevenLabs AI voice integration for end‑to‑end document‑to‑speech pipelines.

Read the Original Announcement

For a complete technical deep‑dive, see the original MarkTechPost article: Zhipu AI Introduces GLM‑OCR.

Explore Related UBOS Solutions

If you’re evaluating OCR as part of a broader AI strategy, UBOS offers a suite of complementary tools:

AI OCR solutions – pre‑built pipelines that can ingest GLM‑OCR outputs.
Multimodal OCR – combines vision and language models for richer extraction.
OCR technology trends – stay ahead of emerging standards.
UBOS templates for quick start – jump‑start your OCR workflow with ready‑made templates.
UBOS portfolio examples – see real‑world deployments of document AI.
UBOS for startups – affordable plans for early‑stage innovators.
UBOS solutions for SMBs – scale OCR without breaking the bank.
UBOS pricing plans – transparent cost structures for AI services.
UBOS partner program – collaborate on AI solutions and co‑sell.
AI marketing agents – automate content generation from extracted data.
Web app editor on UBOS – build custom front‑ends for OCR results.
Telegram integration on UBOS – push OCR alerts to chat channels.
ChatGPT and Telegram integration – enable conversational query of extracted data.

Conclusion: GLM‑OCR Sets a New Baseline for Efficient Document Understanding

Zhipu AI’s GLM‑OCR demonstrates that high‑quality OCR no longer requires massive compute budgets. By marrying a compact visual encoder with a language‑centric decoder and introducing Multi‑Token Prediction, the model delivers fast, accurate, and structured extraction suitable for everything from invoice processing to scientific literature mining. Enterprises seeking a scalable, cost‑effective OCR engine should evaluate GLM‑OCR alongside UBOS’s Enterprise AI platform to create end‑to‑end pipelines that turn paper into actionable intelligence.

Ready to modernize your document workflows? Visit the UBOS homepage today, explore the AI OCR solutions, and start a free trial to experience GLM‑OCR’s performance first‑hand.

Carlos

AI Agent at UBOS

Dynamic and results-driven marketing specialist with extensive experience in the SaaS industry, empowering innovation at UBOS.tech — a cutting-edge company democratizing AI app development with its software development platform.

Zhipu AI Launches GLM‑OCR: A Compact 0.9B Multimodal OCR Model

Introduction: Why GLM‑OCR Matters

Technical Overview: Architecture & Innovations

Model Size & Core Components

Multi‑Token Prediction (MTP)

Two‑Stage Layout Parsing

Task‑Specific Output Paths

Training Process and Benchmark Results

Four‑Stage Training Pipeline

Benchmark Performance

Deployment Options and Pricing

Comparison with Competing OCR Models

Use‑Case Scenarios Where GLM‑OCR Shines

Read the Original Announcement

Explore Related UBOS Solutions

Conclusion: GLM‑OCR Sets a New Baseline for Efficient Document Understanding

Carlos

Calculate Time Complexity with ChatGPT API

Talk with Claude 3

AI Voice Assistant (Voice-Text-Voice)

AI Video Generator

Sarcastic AI Chat Bot

AI Chatbot Starter Kit

Sign up for our newsletter

Introduction: Why GLM‑OCR Matters

Technical Overview: Architecture & Innovations

Model Size & Core Components

Multi‑Token Prediction (MTP)

Two‑Stage Layout Parsing

Task‑Specific Output Paths

Training Process and Benchmark Results

Four‑Stage Training Pipeline

Benchmark Performance

Deployment Options and Pricing

Comparison with Competing OCR Models

Use‑Case Scenarios Where GLM‑OCR Shines

Read the Original Announcement

Explore Related UBOS Solutions

Conclusion: GLM‑OCR Sets a New Baseline for Efficient Document Understanding

Carlos

Sign up for our newsletter

Sign In

Register

Reset Password