
eval-harness.run
The test suite behind every model in production.
Evaluation, regression, bias and fairness, drift, red-teaming, adversarial robustness and guardrails — packaged as an audit-evidence bundle your second and third lines of defence can attach to the regulator submission.
model.eval.run #4129
PASS- accuracy0.913pass
- calibrationECE 0.041pass
- robustnessART evasionpass
- fairnessEO gap 0.018pass
- faithfulnessRAGAS 0.86pass
- driftPSI 0.07pass
- red-teamgarak weeklypass
artefacts emitted
- ✓ Evaluation Report
- ✓ Bias Report
- ✓ Red-Team Report
- ✓ Drift Baseline
- ✓ Regression Diff
- ✓ Control Mapping
Why generic QA fails at AI
Functional testing without an eval harness is not AI QA.
A model can pass every traditional software test and still fail catastrophically in production. There are no fixed outputs; failure modes are statistical; the system can degrade silently for months. Traditional QA shops are strong on functional testing and have nothing for evaluation, bias, red-teaming or drift. Big-4 risk-assurance practices ship methodology overviews and seldom an actual eval-harness run.
Brocode treats AI quality as a continuous engineering discipline — seven test categories on every model, every release, every retrain. Each category names the framework (Fairlearn, AIF360, garak, Evidently, ART, NeMo Guardrails, Llama Guard 3) and emits an artefact that ends up in the audit-evidence bundle. The bundle is what your second and third lines of defence attach to the regulator submission.
Seven test categories on every model
Eval harness, regression, bias, drift, red-team, robustness, guardrails.
Each category names the frameworks, the thresholds, and the artefact it emits to the audit-evidence bundle.
category.01
Model-evaluation harness
An in-house extension of lm-evaluation-harness for classical ML and a custom LLM eval framework on top of OpenAI Evals + DeepEval + Promptfoo. Every evaluation run is versioned in MLflow and re-runnable on demand by the client risk team.
Thresholds
CI-blocking on accuracy, F1, calibration; LLM faithfulness and answer-relevance via RAGAS
category.02
Regression suite
Golden-dataset regression for every model release. Candidate-vs-incumbent comparison across the full evaluation pack — not just headline metrics — to catch silent regressions that average accuracy hides.
Thresholds
No metric may regress by more than 0.5% absolute or 1.5% relative without sign-off
category.03
Bias and fairness pack
Demographic-parity, equalised-odds, equal-opportunity and predictive-parity tests using Fairlearn and AIF360, with a UAE-context demographic schema (nationality bands, language preference, residency status) — never US Census categories.
Thresholds
Demographic parity, equalised odds, equal opportunity, predictive parity — thresholds set per use case with the second line of defence
category.04
Drift detectors
Evidently AI plus a custom Brocode population-stability-index pack, with alerting wired to the client ticketing system. Covariate drift, concept drift and prediction drift are monitored independently.
Thresholds
Severity-tiered alerts on covariate, concept and prediction drift; SLA: drift-alert response within 24 hours
category.05
LLM red-teaming
Weekly automated red-team using garak plus a Brocode-curated UAE adversarial prompt pack (Arabic jailbreaks, dialect-coded harm prompts, regulator-sensitive topic probes). Manual red-team passes by a named in-house team before every production release.
Thresholds
Weekly automated red-team; manual passes before every production release; zero unmitigated category-1 findings at G3
category.06
Adversarial robustness
Adversarial Robustness Toolbox (ART) evasion and poisoning tests on tabular and vision models. Findings flow into the regression suite so robustness becomes a CI-blocking criterion, not a pre-launch tickbox.
Thresholds
Evasion and poisoning tests on tabular and vision models; degradation bounds agreed at G2
category.07
Guardrails plane in production
NVIDIA NeMo Guardrails plus Llama Guard 3 plus an Arabic policy classifier, with prompt and response logging to a tamper-evident WORM store for the audit trail. Composition order documented in the relevant ADR.
Thresholds
WORM logging of every prompt and response; latency budget for the guardrails chain agreed at G2
7
Test categories on every model we deliver
11d
Average approval cycle (was 9 weeks)
4
Regulators mapped (CBUAE / FSRA / NCA / ISO 42001)
0
Unmitigated category-1 red-team findings at G3
Bias and fairness with a UAE-context schema
Nationality bands, language preference, residency status — not US Census categories.
The schema is documented in the bundle, agreed with your second line of defence, and reviewed against the most recent regulator guidance.
UAE-context demographic schema
Nationality bands
Emirati / GCC / Arab expatriate / South Asian expatriate / Western expatriate / Other
Language preference
MSA / Khaleeji dialect / English / Other Arabic dialect
Residency status
Citizen / Golden visa / Employment visa / Family visa / Visit visa
Fairness metrics
Demographic parity, equalised odds, equal opportunity, predictive parity
Schema is editable per use case; thresholds are agreed with the second line of defence, never imposed.
Mapping to controls
Each QA category mapped to CBUAE, FSRA, NCA and ISO 42001.
The bundle ships with an editable annex per regulator. Below is the structural map.
| Capability | Brocode artefact | CBUAE Model Risk | FSRA AI Principles | NCA AI Ethics + ECC | ISO 42001 |
|---|---|---|---|---|---|
| Model-evaluation harness | lm-evaluation-harness + OpenAI Evals + DeepEval + Promptfoo + MLflow | Model Risk principle: documented validation pre-release | AI Principle: testing fitness for purpose | ECC AI-SP-01 testing controls | ISO 42001 §8.3 model performance |
| Regression suite | Golden-dataset + CI-blocking thresholds | Periodic validation triggers | Continuous monitoring principle | ECC change-management controls | ISO 42001 §8.4 change control |
| Bias and fairness pack | Fairlearn + AIF360 + UAE-context schema | Fair-treatment expectations | Fairness principle (named) | NCA AI ethics framework alignment | ISO 42001 §6.1.4 fairness |
| Drift detection | Evidently + Brocode PSI pack | Continuous monitoring principle | Operational monitoring principle | ECC operational monitoring | ISO 42001 §9.1 monitoring |
| LLM red-teaming | garak + UAE adversarial pack | Model-risk adversarial testing expectation | Adoption risk testing | ECC AI-SP-02 robustness testing | ISO 42001 §6.1.5 robustness |
| Adversarial robustness | ART evasion + poisoning | Security testing expectation | AI security principle | ECC AI security control | ISO 42001 §8.5 information security |
| Guardrails + WORM audit trail | NeMo + Llama Guard 3 + Arabic classifier; WORM log | Audit-trail expectation | Audit trail principle | ECC logging controls | ISO 42001 §7.5 documented information |
Case study
Three UAE tier-1 banks: 9 weeks to 11 days.
After adopting the audit-evidence bundle, the model-risk function reduced AI use-case approval cycle time from 9 weeks to 11 days — without dropping a single regulator control.
Second-line head of model risk, anonymised UAE tier-1 bank
"First vendor that gave us audit evidence before audit asked. The bundle arrived ready-mapped to CBUAE controls — we extended it for our DFSA workload in a week, not a quarter."
Audit-evidence review, Q4 2025
Versus the alternatives
Traditional QA, Big-4 risk assurance, in-house, model-vendor evals.
| Capability | Brocode | Traditional QA shop | Big-4 risk assurance | In-house QA team |
|---|---|---|---|---|
| Model-evaluation harness (not a slide) | Live, versioned, re-runnable in MLflow | Functional tests only | Methodology overview | Pre-launch checklist |
| Bias testing with a UAE-context demographic schema | US-Census categories by default | |||
| LLM red-teaming with a curated UAE adversarial pack | garak + Arabic / Khaleeji adversarial pack | No | Methodology only | No |
| Drift detection wired to client ticketing | Reporting only | In SLA-only | ||
| Audit-evidence bundle pre-mapped to CBUAE / FSRA / NCA / ISO 42001 | Per-engagement build | Internal-only | ||
| Approval cycle for an AI use case | 11 days (for three UAE tier-1 banks) | 9 weeks | 7–12 weeks | 4–9 weeks |
Free download
2026 Audit-Evidence Bundle for AI
A 64-page PDF combining one redacted full evidence bundle, a control-mapping appendix (CBUAE Model Risk, FSRA AI Principles, NCA AI Ethics, ISO 42001), and a fill-in template your risk team can adopt as a default.
- Evaluation harness — sample report with thresholds and pass/fail
- Regression suite — golden-dataset pattern and CI integration
- Bias and fairness pack — Fairlearn + AIF360 + UAE-context schema
- Drift detector pattern — Evidently + Brocode PSI pack
- Red-team and adversarial — garak + ART, by category
- Guardrails plane — NeMo + Llama Guard 3 + Arabic classifier
- Mapping to controls — CBUAE / FSRA / NCA / ISO 42001
Risk-function questions
What the second and third lines of defence ask first.
Yes — the sample bundle in the lead magnet contains a redacted full evaluation report from a real engagement, including the per-segment metrics, calibration curves, faithfulness scores for RAG outputs, and the CI-blocking thresholds that flipped on the candidate model. Under NDA we share live MLflow access during the review so your risk team can re-run the evaluation on a new dataset.
Book the audit-evidence review
The Head of AI Risk & QA walks the sample bundle with your team.
A senior AI Risk & QA lead responds within one business day. If your audit submission is inside two weeks, tell us in the form and we will prioritise.
Prefer chat? Message us on WhatsApp.
Continue exploring
Related capabilities and stories
Delivery Methodology
The Harden phase produces the Audit-Evidence Bundle.
Read moreTechnology Stack
The QA stack is part of the published stack.
Read moreMLOps & AI Infrastructure
Where drift monitoring lives in production.
Read moreResponsible AI & Governance
The control-mapping work routes here.
Read moreBanking & Financial Services
The dominant industry for QA-driven procurement.
Read more