Baidu Qianfan Team Releases Qianfan-OCR: A 4B-Parameter Unified Document Intelligence Model

Baidu Qianfan Team Launches Qianfan-OCR: Revolutionizing Document Intelligence with a 4B-Parameter Unified Model

Introduction

In a significant leap forward for AI automation and business efficiency, the Baidu Qianfan Team has unveiled Qianfan-OCR, a groundbreaking 4-billion-parameter end-to-end document intelligence model. Unlike traditional OCR systems that rely on multi-stage pipelines, Qianfan-OCR integrates document parsing, layout analysis, and understanding within a unified vision-language framework. This innovation not only enhances accuracy and speed but also unlocks advanced functionalities like direct image-to-Markdown conversion, table extraction, and prompt-driven document question answering.

Architectural Innovation Behind Qianfan-OCR

Core Components of the Model

Vision Encoder (Qianfan-ViT): Utilizes an Any Resolution design, tiling images into 448 x 448 patches, supporting resolutions up to 4K. This approach generates up to 4,096 visual tokens per image, preserving spatial detail vital for small fonts and dense text.
Cross-Modal Adapter: A lightweight two-layer MLP with GELU activation projects visual embeddings into the language model’s latent space.
Language Model Backbone (Qwen3-4B): A robust 4.0B-parameter transformer featuring 36 layers and a native 32K context window. Employing Grouped-Query Attention (GQA) to optimize memory usage significantly.

Introducing the ‘Layout-as-Thought’ Mechanism

This unique feature allows the model to perform an optional “thinking” phase, triggered by <think> tokens, during which it generates detailed structured layout representations. These include bounding box coordinates, element classifications, and reading order prior to final output generation.

Feature	Benefits
Explicit Layout Recovery	Restores traditional layout analysis capabilities that end-to-end models often lose, improving element localization and classification.
Efficient Representation	Bounding box coordinates encoded as special tokens reduce sequence length by ~50%, enhancing inference speed.
Performance Boost	Demonstrates marked improvements on complex documents with heterogeneous layouts such as mixed text, formulas, and graphics.

Benchmark Performance: Setting New Standards in OCR and Document Understanding

Qianfan-OCR was rigorously evaluated on diverse datasets against specialized OCR systems and general vision-language models, yielding impressive results demonstrating superior AI automation capabilities.

Document Parsing and OCR Benchmarks

OmniDocBench v1.5: Achieved top score of 93.12, surpassing DeepSeek-OCR-v2 (91.09) and Gemini-3 Pro (90.33).
OlmOCR Bench: Led the end-to-end model category with a result of 79.8.
OCRBench: Scored an industry-leading 880, outpacing all tested counterparts.

Key Information Extraction (KIE) Metrics

Model	Overall Mean (KIE)	OCRBench KIE	Nanonets KIE (F1)
Qianfan-OCR (4B)	87.9	95.0	86.5
Qwen3-4B-VL	83.5	89.0	83.3
Qwen3-VL-235B-A22B	84.2	94.0	83.8
Gemini-3.1-Pro	79.2	96.0	76.1

Document Understanding Edge

Conventional two-stage OCR plus large language model (LLM) pipelines struggle with spatial reasoning tasks—for example, they falter on CharXiv benchmarks due to discarded visual context critical for understanding charts and graphs. Qianfan-OCR, with its integrated vision-language design, excels in these scenarios, offering more accurate and context-aware document interpretation.

Deployment and Inference Efficiency

Inference Speed: Leveraging a single NVIDIA A100 GPU, Qianfan-OCR achieves 1.024 pages per second (PPS) using W8A8 quantization, doubling throughput compared to the W16A16 baseline.
GPU-Centric Architecture: Avoids CPU layout processing bottlenecks inherent to pipeline systems, enabling large-batch, low-latency inference ideal for high-volume document processing.

Why Qianfan-OCR Matters for AI Automation and Business Efficiency

By consolidating multiple steps of traditional OCR workflows into a single powerful model, Qianfan-OCR significantly streamlines document processing pipelines. This not only reduces operational complexity and costs but also enhances accuracy, speed, and scalability—essential qualities for enterprises embracing AI automation to drive smarter workflows and increased business efficiency.

Organizations can now extract rich semantic information and maintain spatial document context, enabling more informed decision-making from diverse document types such as legal forms, scientific papers, and financial reports.

Conclusion

Baidu’s Qianfan-OCR presents a compelling advancement in unified document intelligence, pushing the boundaries of current OCR technologies through its innovative architecture and state-of-the-art performance. For businesses seeking to harness AI-powered automation for document workflows, Qianfan-OCR offers a scalable, efficient, and highly accurate solution that integrates seamlessly into modern enterprise systems.

For more technical insights and to explore Qianfan-OCR, visit the official paper and their Hugging Face repository.

Looking for custom AI automation for your business? Connect with me at https://amr-abdeldaym.netlify.app/