The 1.4 million-subscriber contact-centre pilot was promised by next quarter. A second delay forces a re-baselining, a write-off of GPU spend already committed, and an answer to the CIO about whether the in-house team can deliver alone. A peer sovereign-tech holding is already shipping a fine-tuned Arabic model in production. Falling behind has political weight.
The technical work is a known shape: an open-weights base, a Khaleeji corpus large enough to move the eval needle, a preference set that codifies the refusal policy, a safety regression that survives Llama Guard 3 and ALERT, and a serving stack that makes per-token economics defensible to the CFO. The missing piece is usually evaluation discipline — a single benchmark number that a risk committee can tear apart. The Joint Eval Charter is the structural answer.
Off-the-shelf large language models are remarkable generalists and unreliable specialists. They miss regional regulatory terminology, mishandle Gulf-dialect Arabic, and produce inconsistent outputs in the structured formats your downstream systems expect. We close those gaps with dataset curation, parameter-efficient training, rigorous evaluation, and operational integration that lets the resulting model earn its place in production.