Arabic OCR · Sovereign appliance · 90 days
Production Arabic document intelligence — handwriting included.
A purpose-built pipeline for handwritten Arabic correspondence, KYC packs, and judicial archives — running on your sovereign infrastructure, deployed in under 90 days, with a documented accuracy benchmark on your documents before contract signature.
Pre-contract benchmark · NDA-ready · On-premise · Customer-held weights
المحترم مدير مكتب التحول الرقمي
تحية طيبة وبعد،
نرفق طيه التقرير المطلوب…
Scanned · 300 DPI
Field-level accuracy
10,000-doc Arabic benchmark set
0.0%
Target: 99.2 % field-level
99.2 %
Field-level accuracy on the benchmark set
+18.7 pp
Accuracy delta vs the best off-the-shelf engine
90 days
From signed SoW to first production pipeline
500
Free pre-contract benchmark documents
The procurement reality
Why your last OCR tender stalled at the steering committee.
A board IT committee has rejected the previous bid because it was off-the-shelf OCR with no Arabic accuracy evidence. The digitisation programme has already been re-baselined once. Another slip becomes a vendor change at steering-committee level.
62 % accuracy on handwritten Arabic.
The existing stack drops to roughly 62 % accuracy on handwritten correspondence and refuses to classify it, so 40+ FTEs are still manually keying mail into the ERP — and the regulator has now told the entity to demonstrate end-to-end automated lineage by Q3.
A Vision-2031-aligned milestone in the CIO's name.
The CIO has personally committed to a Vision-2031-aligned milestone. Missing it costs reputation with the parent ministry and triggers a board-level escalation that the head of digital transformation does not want on their year-end review.
A seven-figure manual-entry cost.
Forty FTEs keying Arabic correspondence is a recurring seven-figure line item — and a recurring board question. The cost is not the only issue: the manual layer also means there is no clean digital lineage from receipt to decision, which is the regulator's actual ask.
Three failure modes
Why generic OCR breaks the moment it meets real Arabic.
Every engine claims Arabic support. Three structural problems separate the engines that pass a real-corpus benchmark from the ones that do not.
Failure mode 1
Right-to-left layout collapse
Generic OCR engines flatten Arabic columns into Latin-style reading order, scrambling addresses, dates, and tables. A single mis-anchored block destroys the recipient field. Our layout layer keeps RTL anchoring per block.
Example: a federal-entity letter where the recipient and sender fields ended up swapped in 38 % of the original ABBYY output.
Failure mode 2
Dialect & glyph drift
MSA-trained engines confuse Khaleeji and Egyptian glyphs — ك / گ, ي / ى, and elongation patterns. Names like بوظبي or الذيد are routinely garbled, breaking downstream entity matching.
Example: a Saudi commercial registration where the trade name was mistranscribed across 71 % of the corpus on a stock Form Recognizer pipeline.
Failure mode 3
Handwriting variability
Handwritten Arabic combines connected script, slant variability, and inconsistent diacritic placement. Generic engines drop below 65 % accuracy on real correspondence and refuse to classify it.
Example: 800 handwritten letters from a court archive — Google Document AI returned a usable transcript for 412 of them.
The Brocode Arabic OCR pipeline
Surya. PaddleOCR. AraBERT-v2. A Khaleeji dialect head. On your appliance.
A purpose-built stack — not a wrapper around a public API. Each layer is named, each contribution is measured against the benchmark set, and every component runs inside your boundary.
Layer 1
Surya — layout & line detection
Surya handles document layout, line and block detection on right-to-left content where most engines collapse paragraphs. We extend it with a CRAFT-style detector trained on UAE government form geometries: Emirates ID forms, MoI correspondence templates, court filings, and SAMA-aligned bank forms.
+6.1 pp on form-layout retention
Layer 2
PaddleOCR-Arabic (fine-tuned)
A fine-tuned PaddleOCR variant for Arabic glyphs — including ligatures, diacritics, and kashida elongation that break Latin-trained OCR. Trained on a proprietary 1.4-million-line Arabic corpus mixing printed, typed, and handwritten content under expert review.
+11.2 pp on handwritten Arabic
Layer 3
AraBERT-v2 + Khaleeji dialect head
Post-OCR Arabic NER and intent classification on a fine-tuned AraBERT-v2 base, plus a small Khaleeji dialect head trained on UAE / KSA correspondence. Pulls structured fields — recipient, date, intent, action items, references — with confidence scores per field.
+1.4 pp on Khaleeji entities
Layer 4
Routing & human-in-the-loop
Confidence-gated routing. High-confidence fields land in the downstream DMS / ERP. Low-confidence fields surface in a reviewer console; every correction becomes labelled training data and feeds the next retraining cycle on your appliance. No data leaves the boundary.
< 5 % manual review rate at steady state
Architecture at a glance
A single 6U appliance
Kubernetes on bare-metal, GPU bursting to G42 Cloud optional. Retraining cadence and drift monitoring covered by MLOps & AI Infrastructure.
Sovereign deployment
No documents leave the country.
TDRA-compliant. CIS / STIG hardened. PenTest model documented. Read more on Self-Hosted LLM Infrastructure.
Pre-contract accuracy benchmark
Five hundred of your documents. One written accuracy report. No contract.
Before the SoW is signed, our Arabic NLP team runs your corpus through the same pipeline the production appliance will run — measured against your acceptance criteria, reported by document type and by field. If our numbers do not clear your gates, the engagement does not proceed.
- Under NDA from the first document — your sample never leaves your jurisdiction.
- Field-level accuracy reported by document type and confidence band, not as a single composite number.
- Side-by-side comparison against your current stack (ABBYY / Form Recognizer / Document AI) on the same sample.
- A written report signed by the engineering lead — usable as evidence in your steering committee.
Benchmark Report
SignedArabic OCR pre-contract benchmark — sample report
500
Documents
14
Field types
8 days
Turnaround
Side-by-side
Brocode vs the engines on your shortlist.
Measured on a shared 10,000-document Arabic government and banking benchmark — handwritten correspondence, KYC packs, mixed Arabic-English forms, structured invoices.
| Capability | Brocode | ABBYY FineReader Server | MS Form Recognizer | Google Document AI | In-house build |
|---|---|---|---|---|---|
| Handwritten Arabic field-level accuracy Measured on the shared 10,000-document Arabic government / banking benchmark. | 99.2 % | ~80.5 % | ~78.0 % | ~76.4 % | ~70.0 % |
| Khaleeji dialect head | |||||
| On-premise / air-gapped deployment | |||||
| TDRA-compliant sovereign appliance | |||||
| Pre-contract benchmark on your corpus | Free 500-doc | Paid POC | Paid POC | No | In-house effort |
| Time to first production pipeline | 90 days | 6–9 months | 4–6 months | 4–6 months | 12–18 months |
| Customer-held keys & weights |
Numbers from the lead-magnet benchmark (Q1 2026 refresh). All figures require confirmation against your own corpus during the pre-contract benchmark.
The three objections that always come up
What your board will actually ask in the steering committee.
Objection 1
Arabic handwriting accuracy in production — show me real numbers, not Latin-script benchmarks.
The free 500-document pre-contract benchmark is precisely that conversation. Field-level accuracy by document type, on your own corpus, before any commercial commitment.
Objection 2
Data sovereignty — none of these documents can leave the country.
The appliance ships as Kubernetes-on-bare-metal in a single 6U rack inside your data centre or sovereign cloud. No documents, embeddings, or weights leave the boundary. TDRA-readiness pack included.
Objection 3
Procurement timeline — can you integrate with SAP / OpenText / our DMS in 9 months?
90 days from signed SoW to first production pipeline, including DMS / ERP integration. We have integrated against SAP, OpenText, SharePoint, Salesforce, and five homegrown DMS systems.
Integration patterns
Wired into the systems you already paid for.
The OCR appliance never lives alone. Every engagement includes a documented integration sprint into your DMS, ERP, and downstream operational systems.
SAP S/4HANA & SAP DMS
Two integration paths. For event-driven flows, we publish extracted fields to an SAP Event Mesh topic that an iFlow consumes into the relevant business object. For document-side capture, the appliance writes into the SAP DMS content server with the original scan, the structured JSON, the confidence map, and a back-reference to the source document. Both paths are tested against S/4HANA 2022 and 2023.
OpenText Documentum
We register the OCR appliance as a custom Documentum content-transformation service. The DMS routes inbound scans, the appliance returns structured metadata which lands on the document object, and low-confidence items appear in a Documentum task list for the reviewer console. Audit trail is preserved in the DMS, not in a parallel system.
Microsoft SharePoint & Power Platform
A Power Automate connector calls the appliance API on document upload, writes the extracted fields back to SharePoint columns, and creates a review item in the corresponding Microsoft Lists table when confidence is below the threshold you set. Works inside your existing Microsoft 365 tenant; no telemetry leaves your tenant boundary.
Homegrown / legacy DMS
We have integrated against five homegrown DMS systems on previous engagements: typically a REST endpoint or a watched folder, plus a documented schema for the structured JSON. If the legacy system only speaks SOAP or a fixed-width file, we ship a thin adapter as part of the 90-day delivery — not a separately-scoped change request.
Anonymised references
What it looks like once the pipeline is live.
Three references — federal correspondence, tier-1 bank KYC, and a judicial archive. Each available in full under NDA.
UAE federal entity — correspondence digitisation
1.2 million handwritten Arabic letters processed across the first 18 months. Field-level accuracy 96.4 % on recipient, date, intent, and action items. Manual keying reduced from 40 FTEs to a 6-person review console. Full audit trail compliant with the entity's TDRA posture.
96.4 % field accuracy
GCC tier-1 bank — KYC pack extraction
Onboarding KYC packs — Emirates ID, trade licence, MoA, board resolution, beneficial-ownership chart — extracted in a single pass. Handle time per pack reduced from 27 minutes (manual) to 3 minutes (review-only). CBUAE-aligned model documentation generated automatically from the registry.
27 min → 3 min per KYC pack
Judicial archive — historic case files
4.2 million pages of handwritten Arabic judgments from a regional court system. Bilingual indexing, judge name resolution, and case-type classification. Search across the archive in Arabic or English from a single index. Three reviewers replaced what a 28-FTE pilot programme had been delivering.
4.2 M pages indexed
See more on Government & Public Sector and Banking & Financial Services.
Free download
Arabic OCR Accuracy Benchmark Report: 7 Engines on 10,000 Documents
A 32-page technical report on how seven enterprise OCR engines perform on real UAE government and banking Arabic. Plus an interactive accuracy explorer — filter by document type, handwriting prevalence, and field type.
- Benchmark setup — corpus composition, handwriting prevalence, scoring method
- Field-level accuracy by engine: ABBYY, Microsoft, Google, AWS, OpenAI, Brocode
- Where each engine fails — concrete examples by document type
- TDRA-compliant on-prem appliance: BoM, network zoning, hardening checklist
- The pre-contract free 500-document benchmark — how it is run and scored
FAQ
What boards and procurement leads ask first.
The eight questions our engineering team answers in nearly every steering committee. Direct, on the record, no marketing softening.
We run a free pre-contract benchmark on 500 of your own documents under NDA. You hand over a representative sample, our Arabic NLP team measures character, word, and field-level accuracy by document type, and you receive a written report with a side-by-side comparison against your current stack. Accuracy is reported per field — recipient name, date, intent, action item — not as a single composite number. If our numbers do not clear your acceptance criteria, the engagement does not proceed. We have walked away from three pre-contract benchmarks in the last 18 months for exactly this reason.
Pre-contract benchmark
Hand over 500 documents. Get an accuracy report back, signed.
Six fields — volume, document types, languages, deployment, your DMS, target go-live. A senior Arabic NLP engineer reviews your corpus under NDA and replies within one business day with the proposed benchmark plan.
Or skip the form.
Message the Arabic NLP lead on WhatsApp directly. We see it within working hours.
Message on WhatsAppContinue exploring
Related capabilities and stories
Natural Language Processing
Arabic intent, entities, and conduct flags downstream of OCR.
Read moreMLOps & AI Infrastructure
Retraining, drift monitoring, and lineage for the OCR estate.
Read moreSelf-hosted LLM Infrastructure
The sovereign deployment story for the wider GenAI stack.
Read moreGovernment & Public Sector
Correspondence digitisation, case files, citizen services.
Read moreBanking & Financial Services
KYC packs, trade finance, claims, and customer correspondence.
Read more
