Close-up of an H100 SXM5 GPU rack with NVLink switch visible, captured at the Brocode lab

Model adaptation · for AI-mature teams

A fine-tuned open-weights LLM, on your GPUs, in your VPC.

Documented gains on your evaluation harness, your dialect, your refusal policy and your safety classifier — delivered in eight weeks under a co-build agreement that leaves your team owning every artefact.

Co-write a Joint Eval Charter — 90 min, no commitmentMessage a principal scientist on WhatsApp Download GulfBench v2 + the reproducer →

live training run · h100 sxm5 × 8

runllama-3.3-70b-qlora-r128-khaleeji-care-007

step14,200 / 32,000

loss0.412

eval/gulfbench7.8 (+4.7 vs base)

dpo_margin0.31

grad_norm0.84

tokens/sec11,840

vram74.2 / 80 GiB

tokeniser · khaleeji sample

وين قسط السيارة الجديد؟

وينقسطالسيارةالجديد؟

+21.6
GulfBench points vs GPT-4o-mini fine-tune
−38%
Inference cost on vLLM + AWQ-INT4
8 weeks
To production-cleared model
0.4%
Llama Guard 3 violation rate, telco ref

The problem at the staging gate

Your fine-tune scores 8.4 in English and 3.1 in Khaleeji.

And Llama Guard flags 4.7% of generations as unsafe. The risk committee has frozen the launch. We have seen this exact failure mode in three GCC programmes this year.

The 1.4 million-subscriber contact-centre pilot was promised by next quarter. A second delay forces a re-baselining, a write-off of GPU spend already committed, and an answer to the CIO about whether the in-house team can deliver alone. A peer sovereign-tech holding is already shipping a fine-tuned Arabic model in production. Falling behind has political weight.

The technical work is a known shape: an open-weights base, a Khaleeji corpus large enough to move the eval needle, a preference set that codifies the refusal policy, a safety regression that survives Llama Guard 3 and ALERT, and a serving stack that makes per-token economics defensible to the CFO. The missing piece is usually evaluation discipline — a single benchmark number that a risk committee can tear apart. The Joint Eval Charter is the structural answer.

Off-the-shelf large language models are remarkable generalists and unreliable specialists. They miss regional regulatory terminology, mishandle Gulf-dialect Arabic, and produce inconsistent outputs in the structured formats your downstream systems expect. We close those gaps with dataset curation, parameter-efficient training, rigorous evaluation, and operational integration that lets the resulting model earn its place in production.

The 8-week co-build

The Joint Eval Charter signs in week one. GPUs spin up in week two.

No fine-tuning starts until both sides have signed off on the metrics, the held-out task set, the judge ensemble, and the contamination controls. The eval is what the model is being shaped against; it cannot be reverse-engineered after the fact.

Charter signatories

· Customer Head of AI
· Customer Principal Scientist
· Brocode Principal Scientist
· Customer Risk / Compliance representative
· Brocode Programme Lead

The Charter is reversible. Any signatory can pause the programme between gates with no commercial penalty.

Week 1
Joint Eval Charter signed
Customer and Brocode co-write the eval — task set, judge ensemble, contamination controls, success thresholds. No GPU is provisioned until both sides have signed off on the metrics that define a passing model.
Signed by Head of AI + Principal Scientist
Weeks 2–3
Base-model bake-off
Three candidate bases scored against the Charter on a 5K-example dry-run. The customer sees the trade-off matrix in their own terms before committing GPU time to a 70B-class run.
Trade-off matrix delivered
Weeks 3–5
Data pipeline + first SFT pass
Khaleeji corpus assembled from customer transcripts plus Brocode's 2.3M-utterance care set; AraDPO preference set prepared in Argilla; QLoRA rank 64–128 on the chosen base.
GulfBench v0 reported
Weeks 5–6
Preference alignment
DPO or ORPO on the customer's refusal-policy preference set. Llama Guard 3 and ALERT regression suites run on every checkpoint; pairwise judge ensembles triangulate against a fine-tuned in-house judge.
Safety violations ≤ 1% target
Weeks 6–7
Serving readiness
AWQ-INT4 quantisation, Marlin kernels, EAGLE-2 speculative decoding. Run-rate calculator handed to the FinOps lead — per-million-token cost on vLLM versus the closed API baseline.
Run-rate delta documented
Week 8
Production handover
Adapters, eval harness, training configs and reproducer code transferred under the co-build agreement. The customer's MLOps team operates the model from day one; Brocode stays on for a run-phase but never owns the artefacts.
All artefacts customer-owned

Section 3 · Base-model trade-offs

Seven bases we routinely fine-tune. None of them is the right answer to every customer.

The bake-off in weeks two and three is the cheapest insurance you can buy against the wrong base. We score the candidates against your Charter on a 5K-example dry-run before committing GPU time to a 70B-class run.

Base · 01

Llama 3.3 70B

Strong English baseline, broad community tooling, clean QLoRA path on 8× H100.

Trade-off: Khaleeji-weak out of the box; needs continued pre-training on Arabic corpus before SFT.

Base · 02

Llama 3.1 405B (quantised)

Frontier-grade reasoning when AWQ-INT4 is acceptable; survives multi-step policy QA.

Trade-off: Serving cost; requires Marlin kernels and EAGLE-2 speculative decoding to be economic.

Base · 03

Mistral-Large-2

Disciplined instruction-following and structured-output behaviour for compliance Q&A.

Trade-off: Apache-2 weights but smaller open community than Llama; fewer ready-made Arabic recipes.

Base · 04

Qwen 2.5 72B

Excellent multilingual baseline, handles Arabic-English code-switching cleanly.

Trade-off: Tokeniser packs Arabic less efficiently than Jais; latency at long context is harder.

Base · 05

Falcon Mamba 7B (TII)

State-space backbone, low memory at long context; UAE-origin lineage matters to some buyers.

Trade-off: Capacity ceiling on complex reasoning tasks; treat as a fast first-tier classifier.

Base · 06

Jais 70B (Inception / G42)

Native Khaleeji and MSA pre-training; smallest gap to close on dialect.

Trade-off: Licensing terms require careful review; English performance trails Llama on STEM tasks.

Base · 07

Gemma 3 27B

Tight latency profile, friendly licence for hosted inference, strong on safety classifiers.

Trade-off: Capacity below 70B class; choose only when the task is narrow and latency dominates.

The training stack

Axolotl, DeepSpeed ZeRO-3, FlashAttention-3 — and Argilla for the annotators.

The same stack we run in the Brocode lab, named tool by tool, so your principal scientist can challenge any choice on the call.

Tool · 01

Axolotl + DeepSpeed ZeRO-3

Primary training driver — config-driven, reproducible, multi-node ready.

Tool · 02

FlashAttention-3

Memory and throughput wins on H100/H200; required for 128K context tuning.

Tool · 03

Megatron-Core

Multi-node 405B SFT and continued pre-training when ZeRO-3 hits the wall.

Tool · 04

Unsloth

Fast LoRA iteration on 7–13B bases for hypothesis testing before a 70B commit.

Tool · 05

Argilla

Khaleeji annotator workflow — native reviewers, audit trail, preference set quality control.

Tool · 06

vLLM + EAGLE-2 + AWQ-INT4

Production serving with speculative decoding, quantisation and Marlin kernels.

GulfBench v2 — judge ensemble

Every checkpoint is scored by three judges in pairwise comparison: Claude Sonnet 4.5, GPT-5, and a fine-tuned in-house judge trained on customer-reviewed preferences. Disagreement between judges is logged as a quality signal, not averaged away. Contamination probes run on every Charter task.

How we compare

Azure managed, AWS Bedrock, DIY, or the MBZUAI / TII collab — and where each one ends.

The structural differences that survive a procurement-committee read. The full per-task scorecard is in the GulfBench v2 lead magnet below.

Capability	Brocode	Azure OpenAI fine-tune	AWS Bedrock custom	DIY on bare-metal	MBZUAI / TII research
Open-weights ownership of the resulting model Adapters and base weights stay portable to vLLM, TGI, Triton, sovereign clouds.
Khaleeji Arabic data pipeline The brief field-7 problem (3.1 Khaleeji care score) is closed by data, not just method.	Native, 2.3M-utterance corpus + Argilla annotators	Not provided	Not provided	DIY	Research-stage
GulfBench v2 evaluation harness Contamination controls and pairwise judge ensemble triangulation.	Open, reproducible, 18 dialect+domain tasks	Closed metric	Closed metric	DIY	Research benchmarks
In-region inference (UAE / KSA) No cross-border data movement when the customer requires it.	Customer VPC, H100/H200 or sovereign cloud	UAE region with constraints	Bahrain region (data residency limits)	Customer choice	Research collab terms
Time to a production-cleared model Includes safety regression and run-rate optimisation.	8 weeks under co-build agreement	8–12 weeks	8–12 weeks	6–9 months without partner	12–24 months
Per-token inference cost vs closed baseline 38% inference-cost reduction observed on the telco reference.	~1/12× via vLLM + EAGLE-2 + AWQ-INT4	Closed API pricing	Closed API pricing	DIY depends on stack	Not commercial
Joint Eval Charter signed before training starts No GPU spend until the customer has signed off on the success metric.
Customer team trained to retune Reproducible configs and a co-build agreement that ends with the customer running the model.

Three objections worth airing

The questions your team will ask in the second meeting.

Objection 01

“Azure already gives us managed fine-tuning on GPT-4o-mini and o4-mini. Why bring in an open-weights specialist?”

Open-weights ownership is the answer to a vendor-lock question your CISO will eventually ask. Our anonymised telco reference moved from Azure GPT-4o-mini at managed cost to a Llama 3.3 70B fine-tune on their own H100s, with 38% lower per-token inference cost and a GulfBench delta of +21.6 points on Khaleeji care intents. The model is portable to vLLM, TGI, or any sovereign-cloud runtime the customer chooses next.

Objection 02

“Our team can do this — we have a Llama run on bare-metal. What specifically are you bringing that we don't have?”

Three concrete artefacts: GulfBench v2 (an open, reproducible Khaleeji-aware eval suite with contamination controls), a 2.3M-utterance Khaleeji care corpus with annotation workflow in Argilla, and a productionisation toolchain — vLLM + EAGLE-2 speculative decoding + AWQ-INT4 with Marlin kernels — that DIY teams typically rebuild over 6–9 months. We co-build, then leave; the customer owns every artefact.

Objection 03

“How do you prove the lift is real and not eval contamination? Our risk committee will tear apart any single-benchmark number.”

The Joint Eval Charter is the structural answer. We co-author the held-out task set in week 1, lock it before any training begins, and triangulate with a pairwise judge ensemble that combines Claude Sonnet 4.5, GPT-5 and a fine-tuned in-house judge. The methodology is published in the GulfBench v2 lead magnet with full reproducer code on GitHub — a risk committee can audit the design before it sees a number.

Free download

GulfBench v2 — A Khaleeji-Aware LLM Evaluation Harness

A 38-page technical report, a downloadable JSONL of held-out tasks (200 examples redacted, full set under NDA), and a Python reproducer on GitHub. Includes 14 open and closed models scored on telco, banking and government tasks.

Benchmark methodology and contamination controls
Base model comparison: Mistral, Llama 3.x, Qwen 2.5, Jais, Falcon Mamba, Gemma 3
Fine-tuning recipes: QLoRA, DPO, ORPO with config files
Cost and latency on H100 SXM5 and H200 with vLLM + EAGLE-2 + AWQ-INT4
Headline figure: Llama 3.3 70B fine-tune outscores GPT-4o-mini fine-tune by 21.6 GulfBench points at ~1/12× inference cost
Public reproducer on GitHub with seed data and eval harness

Frequently asked

Ownership, sovereignty, contamination, deprecation.

Eight questions a Head of AI and a CISO usually ask in the first sixty minutes. Longer answers live in the procurement pack.

The customer owns the adapters, the merged checkpoints, the training configs, the eval harness and the reproducer code. The co-build agreement is explicit on this point. Brocode retains no rights to redeploy artefacts elsewhere, and the customer can fork the model and continue with another partner or in-house at any time.

Joint Eval Charter call

Ninety minutes with a principal scientist. No commitment.

Bring your current base, your dialect coverage, your refusal policy and the metric your risk committee will eventually defend. Leave with a signed Charter draft you can take to your CIO.

What you walk away with

· A Charter draft: tasks, judges, contamination controls, success thresholds
· A base-model trade-off matrix for your specific intents
· A run-rate envelope on vLLM + AWQ-INT4 against your closed-API baseline
· A pointer to the right reference (telco, bank, or sovereign holding)

Related capabilities

A fine-tuned open-weights LLM, on your GPUs, in your VPC.

Your fine-tune scores 8.4 in English and 3.1 in Khaleeji.

The Joint Eval Charter signs in week one. GPUs spin up in week two.

Joint Eval Charter signed

Base-model bake-off

Data pipeline + first SFT pass

Preference alignment

Serving readiness

Production handover