Chapter 3 · pages 24–38
Why MSA-trained models fail on Khaleeji calls.
Modern Standard Arabic is a literary register; almost nobody speaks it on a contact-centre call. A customer in Sharjah complaining about an inflated du bill will switch between Khaleeji vocabulary, an English brand name, and a Bahraini turn of phrase three times in a sentence. The MSA-tuned ASR transcribes the first half competently, mis-renders the brand name, and emits a confidence score the application layer treats as gospel.
The remedy is not a bigger model. It is a calibration set that contains the dialect, the code-switching, and the brand-named entities the deployment actually encounters. We construct one in three weeks per dialect: a 1,200-utterance reference set across MSA, Khaleeji, Levantine, and Egyptian, balanced for gender, age band, channel quality, and noise profile. The set lives in the client's environment, refreshes quarterly, and runs as a CI step on every model push.
The detail that matters: Khaleeji morphology produces dozens of valid surface forms for the same lemma. NER trained on MSA collapses them into a single noisy class. The Khaleeji Benchmark v2 splits them. The benchmark scores the same models 9–14 F1 points lower on Khaleeji than on MSA — a gap that is invisible in the published model cards because the model cards report MSA.
Chapter continues — pages 28-38 — Khaleeji NER feature engineering · dialect classifier · evaluation gates.
