Arabic OCR · Sovereign appliance · 90 days

Production Arabic document intelligence — handwriting included.

A purpose-built pipeline for handwritten Arabic correspondence, KYC packs, and judicial archives — running on your sovereign infrastructure, deployed in under 90 days, with a documented accuracy benchmark on your documents before contract signature.

Request the 500-document benchmark WhatsApp our Arabic NLP lead

Pre-contract benchmark · NDA-ready · On-premise · Customer-held weights

FEDERAL AUTHORITY

المحترم مدير مكتب التحول الرقمي
تحية طيبة وبعد،
نرفق طيه التقرير المطلوب…

Scanned · 300 DPI

extracted.json

{

"document_type":"official_correspondence",

"recipient":"Director, Digital Transformation Office",

"sender":"Federal Authority — Operations",

"date":"2026-05-12",

"reference":"REF / 2026 / 04421",

"intent":"request_for_action",

"action_items":[ "submit Q3 readiness report", "schedule review" ],

"confidence":0.992

}

Field-level accuracy

10,000-doc Arabic benchmark set

0.0%

Target: 99.2 % field-level

99.2 %
Field-level accuracy on the benchmark set
+18.7 pp
Accuracy delta vs the best off-the-shelf engine
90 days
From signed SoW to first production pipeline
500
Free pre-contract benchmark documents

The procurement reality

Why your last OCR tender stalled at the steering committee.

A board IT committee has rejected the previous bid because it was off-the-shelf OCR with no Arabic accuracy evidence. The digitisation programme has already been re-baselined once. Another slip becomes a vendor change at steering-committee level.

62 % accuracy on handwritten Arabic.
The existing stack drops to roughly 62 % accuracy on handwritten correspondence and refuses to classify it, so 40+ FTEs are still manually keying mail into the ERP — and the regulator has now told the entity to demonstrate end-to-end automated lineage by Q3.
A Vision-2031-aligned milestone in the CIO's name.
The CIO has personally committed to a Vision-2031-aligned milestone. Missing it costs reputation with the parent ministry and triggers a board-level escalation that the head of digital transformation does not want on their year-end review.
A seven-figure manual-entry cost.
Forty FTEs keying Arabic correspondence is a recurring seven-figure line item — and a recurring board question. The cost is not the only issue: the manual layer also means there is no clean digital lineage from receipt to decision, which is the regulator's actual ask.

Three failure modes

Why generic OCR breaks the moment it meets real Arabic.

Every engine claims Arabic support. Three structural problems separate the engines that pass a real-corpus benchmark from the ones that do not.

Failure mode 1

Right-to-left layout collapse

Generic OCR engines flatten Arabic columns into Latin-style reading order, scrambling addresses, dates, and tables. A single mis-anchored block destroys the recipient field. Our layout layer keeps RTL anchoring per block.

Example: a federal-entity letter where the recipient and sender fields ended up swapped in 38 % of the original ABBYY output.

Failure mode 2

Dialect & glyph drift

MSA-trained engines confuse Khaleeji and Egyptian glyphs — ك / گ, ي / ى, and elongation patterns. Names like بوظبي or الذيد are routinely garbled, breaking downstream entity matching.

Example: a Saudi commercial registration where the trade name was mistranscribed across 71 % of the corpus on a stock Form Recognizer pipeline.

Failure mode 3

Handwriting variability

Handwritten Arabic combines connected script, slant variability, and inconsistent diacritic placement. Generic engines drop below 65 % accuracy on real correspondence and refuse to classify it.

Example: 800 handwritten letters from a court archive — Google Document AI returned a usable transcript for 412 of them.

The Brocode Arabic OCR pipeline

Surya. PaddleOCR. AraBERT-v2. A Khaleeji dialect head. On your appliance.

A purpose-built stack — not a wrapper around a public API. Each layer is named, each contribution is measured against the benchmark set, and every component runs inside your boundary.

Layer 1
Surya — layout & line detection
Surya handles document layout, line and block detection on right-to-left content where most engines collapse paragraphs. We extend it with a CRAFT-style detector trained on UAE government form geometries: Emirates ID forms, MoI correspondence templates, court filings, and SAMA-aligned bank forms.
+6.1 pp on form-layout retention
Layer 2
PaddleOCR-Arabic (fine-tuned)
A fine-tuned PaddleOCR variant for Arabic glyphs — including ligatures, diacritics, and kashida elongation that break Latin-trained OCR. Trained on a proprietary 1.4-million-line Arabic corpus mixing printed, typed, and handwritten content under expert review.
+11.2 pp on handwritten Arabic
Layer 3
AraBERT-v2 + Khaleeji dialect head
Post-OCR Arabic NER and intent classification on a fine-tuned AraBERT-v2 base, plus a small Khaleeji dialect head trained on UAE / KSA correspondence. Pulls structured fields — recipient, date, intent, action items, references — with confidence scores per field.
+1.4 pp on Khaleeji entities
Layer 4
Routing & human-in-the-loop
Confidence-gated routing. High-confidence fields land in the downstream DMS / ERP. Low-confidence fields surface in a reviewer console; every correction becomes labelled training data and feeds the next retraining cycle on your appliance. No data leaves the boundary.
< 5 % manual review rate at steady state

Architecture at a glance

A single 6U appliance

Kubernetes on bare-metal, GPU bursting to G42 Cloud optional. Retraining cadence and drift monitoring covered by MLOps & AI Infrastructure.

Sovereign deployment

No documents leave the country.

TDRA-compliant. CIS / STIG hardened. PenTest model documented. Read more on Self-Hosted LLM Infrastructure.

Pre-contract accuracy benchmark

Five hundred of your documents. One written accuracy report. No contract.

Before the SoW is signed, our Arabic NLP team runs your corpus through the same pipeline the production appliance will run — measured against your acceptance criteria, reported by document type and by field. If our numbers do not clear your gates, the engagement does not proceed.

Under NDA from the first document — your sample never leaves your jurisdiction.
Field-level accuracy reported by document type and confidence band, not as a single composite number.
Side-by-side comparison against your current stack (ABBYY / Form Recognizer / Document AI) on the same sample.
A written report signed by the engineering lead — usable as evidence in your steering committee.

Request the benchmark

Benchmark Report

Signed

Arabic OCR pre-contract benchmark — sample report

Printed Arabic (typed)99.6 %

Handwritten Arabic correspondence99.2 %

Mixed Arabic-English forms98.8 %

KYC packs (Emirates ID, MoA)99.4 %

Court filings (scanned, low DPI)96.7 %

500

Documents

Field types

8 days

Turnaround

Arabic correspondence being processed by the document intelligence pipeline

Side-by-side

Brocode vs the engines on your shortlist.

Measured on a shared 10,000-document Arabic government and banking benchmark — handwritten correspondence, KYC packs, mixed Arabic-English forms, structured invoices.

Capability	Brocode	ABBYY FineReader Server	MS Form Recognizer	Google Document AI	In-house build
Handwritten Arabic field-level accuracy Measured on the shared 10,000-document Arabic government / banking benchmark.	99.2 %	~80.5 %	~78.0 %	~76.4 %	~70.0 %
Khaleeji dialect head
On-premise / air-gapped deployment
TDRA-compliant sovereign appliance
Pre-contract benchmark on your corpus	Free 500-doc	Paid POC	Paid POC	No	In-house effort
Time to first production pipeline	90 days	6–9 months	4–6 months	4–6 months	12–18 months
Customer-held keys & weights

Numbers from the lead-magnet benchmark (Q1 2026 refresh). All figures require confirmation against your own corpus during the pre-contract benchmark.

The three objections that always come up

What your board will actually ask in the steering committee.

Objection 1

Arabic handwriting accuracy in production — show me real numbers, not Latin-script benchmarks.

The free 500-document pre-contract benchmark is precisely that conversation. Field-level accuracy by document type, on your own corpus, before any commercial commitment.

Objection 2

Data sovereignty — none of these documents can leave the country.

The appliance ships as Kubernetes-on-bare-metal in a single 6U rack inside your data centre or sovereign cloud. No documents, embeddings, or weights leave the boundary. TDRA-readiness pack included.

Objection 3

Procurement timeline — can you integrate with SAP / OpenText / our DMS in 9 months?

90 days from signed SoW to first production pipeline, including DMS / ERP integration. We have integrated against SAP, OpenText, SharePoint, Salesforce, and five homegrown DMS systems.

Integration patterns

Wired into the systems you already paid for.

The OCR appliance never lives alone. Every engagement includes a documented integration sprint into your DMS, ERP, and downstream operational systems.

SAP S/4HANA & SAP DMS

Two integration paths. For event-driven flows, we publish extracted fields to an SAP Event Mesh topic that an iFlow consumes into the relevant business object. For document-side capture, the appliance writes into the SAP DMS content server with the original scan, the structured JSON, the confidence map, and a back-reference to the source document. Both paths are tested against S/4HANA 2022 and 2023.

OpenText Documentum

We register the OCR appliance as a custom Documentum content-transformation service. The DMS routes inbound scans, the appliance returns structured metadata which lands on the document object, and low-confidence items appear in a Documentum task list for the reviewer console. Audit trail is preserved in the DMS, not in a parallel system.

Microsoft SharePoint & Power Platform

A Power Automate connector calls the appliance API on document upload, writes the extracted fields back to SharePoint columns, and creates a review item in the corresponding Microsoft Lists table when confidence is below the threshold you set. Works inside your existing Microsoft 365 tenant; no telemetry leaves your tenant boundary.

Homegrown / legacy DMS

We have integrated against five homegrown DMS systems on previous engagements: typically a REST endpoint or a watched folder, plus a documented schema for the structured JSON. If the legacy system only speaks SOAP or a fixed-width file, we ship a thin adapter as part of the 90-day delivery — not a separately-scoped change request.

Anonymised references

What it looks like once the pipeline is live.

Three references — federal correspondence, tier-1 bank KYC, and a judicial archive. Each available in full under NDA.

UAE federal entity — correspondence digitisation

1.2 million handwritten Arabic letters processed across the first 18 months. Field-level accuracy 96.4 % on recipient, date, intent, and action items. Manual keying reduced from 40 FTEs to a 6-person review console. Full audit trail compliant with the entity's TDRA posture.

96.4 % field accuracy

GCC tier-1 bank — KYC pack extraction

Onboarding KYC packs — Emirates ID, trade licence, MoA, board resolution, beneficial-ownership chart — extracted in a single pass. Handle time per pack reduced from 27 minutes (manual) to 3 minutes (review-only). CBUAE-aligned model documentation generated automatically from the registry.

27 min → 3 min per KYC pack

Judicial archive — historic case files

4.2 million pages of handwritten Arabic judgments from a regional court system. Bilingual indexing, judge name resolution, and case-type classification. Search across the archive in Arabic or English from a single index. Three reviewers replaced what a 28-FTE pilot programme had been delivering.

4.2 M pages indexed

See more on Government & Public Sector and Banking & Financial Services.

Free download

Arabic OCR Accuracy Benchmark Report: 7 Engines on 10,000 Documents

A 32-page technical report on how seven enterprise OCR engines perform on real UAE government and banking Arabic. Plus an interactive accuracy explorer — filter by document type, handwriting prevalence, and field type.

Benchmark setup — corpus composition, handwriting prevalence, scoring method
Field-level accuracy by engine: ABBYY, Microsoft, Google, AWS, OpenAI, Brocode
Where each engine fails — concrete examples by document type
TDRA-compliant on-prem appliance: BoM, network zoning, hardening checklist
The pre-contract free 500-document benchmark — how it is run and scored

FAQ

What boards and procurement leads ask first.

The eight questions our engineering team answers in nearly every steering committee. Direct, on the record, no marketing softening.

We run a free pre-contract benchmark on 500 of your own documents under NDA. You hand over a representative sample, our Arabic NLP team measures character, word, and field-level accuracy by document type, and you receive a written report with a side-by-side comparison against your current stack. Accuracy is reported per field — recipient name, date, intent, action item — not as a single composite number. If our numbers do not clear your acceptance criteria, the engagement does not proceed. We have walked away from three pre-contract benchmarks in the last 18 months for exactly this reason.

Pre-contract benchmark

Hand over 500 documents. Get an accuracy report back, signed.

Six fields — volume, document types, languages, deployment, your DMS, target go-live. A senior Arabic NLP engineer reviews your corpus under NDA and replies within one business day with the proposed benchmark plan.

Or skip the form.

Message the Arabic NLP lead on WhatsApp directly. We see it within working hours.

Message on WhatsApp

Continue exploring

Production Arabic document intelligence — handwriting included.

Why your last OCR tender stalled at the steering committee.

Why generic OCR breaks the moment it meets real Arabic.

Right-to-left layout collapse

Dialect & glyph drift

Handwriting variability

Surya. PaddleOCR. AraBERT-v2. A Khaleeji dialect head. On your appliance.

Surya — layout & line detection

PaddleOCR-Arabic (fine-tuned)

AraBERT-v2 + Khaleeji dialect head

Routing & human-in-the-loop