eval-harness.run

The test suite behind every model in production.

Evaluation, regression, bias and fairness, drift, red-teaming, adversarial robustness and guardrails — packaged as an audit-evidence bundle your second and third lines of defence can attach to the regulator submission.

Book a 60-minute audit-evidence reviewDownload the 2026 Audit-Evidence Bundle

model.eval.run #4129

PASS

accuracy
0.913pass
calibration
ECE 0.041pass
robustness
ART evasionpass
fairness
EO gap 0.018pass
faithfulness
RAGAS 0.86pass
drift
PSI 0.07pass
red-team
garak weeklypass

artefacts emitted

✓ Evaluation Report
✓ Bias Report
✓ Red-Team Report
✓ Drift Baseline
✓ Regression Diff
✓ Control Mapping

Why generic QA fails at AI

Functional testing without an eval harness is not AI QA.

A model can pass every traditional software test and still fail catastrophically in production. There are no fixed outputs; failure modes are statistical; the system can degrade silently for months. Traditional QA shops are strong on functional testing and have nothing for evaluation, bias, red-teaming or drift. Big-4 risk-assurance practices ship methodology overviews and seldom an actual eval-harness run.

Brocode treats AI quality as a continuous engineering discipline — seven test categories on every model, every release, every retrain. Each category names the framework (Fairlearn, AIF360, garak, Evidently, ART, NeMo Guardrails, Llama Guard 3) and emits an artefact that ends up in the audit-evidence bundle. The bundle is what your second and third lines of defence attach to the regulator submission.

Seven test categories on every model

Eval harness, regression, bias, drift, red-team, robustness, guardrails.

Each category names the frameworks, the thresholds, and the artefact it emits to the audit-evidence bundle.

category.01

Model-evaluation harness

lm-evaluation-harnessOpenAI EvalsDeepEvalPromptfooMLflow

An in-house extension of lm-evaluation-harness for classical ML and a custom LLM eval framework on top of OpenAI Evals + DeepEval + Promptfoo. Every evaluation run is versioned in MLflow and re-runnable on demand by the client risk team.

Thresholds

CI-blocking on accuracy, F1, calibration; LLM faithfulness and answer-relevance via RAGAS

category.02

Regression suite

Golden datasetsRAGASMLflow

Golden-dataset regression for every model release. Candidate-vs-incumbent comparison across the full evaluation pack — not just headline metrics — to catch silent regressions that average accuracy hides.

Thresholds

No metric may regress by more than 0.5% absolute or 1.5% relative without sign-off

category.03

Bias and fairness pack

FairlearnAIF360Brocode UAE-context schema

Demographic-parity, equalised-odds, equal-opportunity and predictive-parity tests using Fairlearn and AIF360, with a UAE-context demographic schema (nationality bands, language preference, residency status) — never US Census categories.

Thresholds

Demographic parity, equalised odds, equal opportunity, predictive parity — thresholds set per use case with the second line of defence

category.04

Drift detectors

Evidently AIBrocode PSI pack

Evidently AI plus a custom Brocode population-stability-index pack, with alerting wired to the client ticketing system. Covariate drift, concept drift and prediction drift are monitored independently.

Thresholds

Severity-tiered alerts on covariate, concept and prediction drift; SLA: drift-alert response within 24 hours

category.05

LLM red-teaming

garakBrocode UAE adversarial pack

Weekly automated red-team using garak plus a Brocode-curated UAE adversarial prompt pack (Arabic jailbreaks, dialect-coded harm prompts, regulator-sensitive topic probes). Manual red-team passes by a named in-house team before every production release.

Thresholds

Weekly automated red-team; manual passes before every production release; zero unmitigated category-1 findings at G3

category.06

Adversarial robustness

Adversarial Robustness Toolbox (ART)

Adversarial Robustness Toolbox (ART) evasion and poisoning tests on tabular and vision models. Findings flow into the regression suite so robustness becomes a CI-blocking criterion, not a pre-launch tickbox.

Thresholds

Evasion and poisoning tests on tabular and vision models; degradation bounds agreed at G2

category.07

Guardrails plane in production

NeMo GuardrailsLlama Guard 3Bespoke Arabic policy classifier (we train inside your engagement repo)

NVIDIA NeMo Guardrails plus Llama Guard 3 plus an Arabic policy classifier, with prompt and response logging to a tamper-evident WORM store for the audit trail. Composition order documented in the relevant ADR.

Thresholds

WORM logging of every prompt and response; latency budget for the guardrails chain agreed at G2

7
Test categories on every model we deliver
11d
Average approval cycle (was 9 weeks)
4
Regulators mapped (CBUAE / FSRA / NCA / ISO 42001)
0
Unmitigated category-1 red-team findings at G3

Bias and fairness with a UAE-context schema

Nationality bands, language preference, residency status — not US Census categories.

The schema is documented in the bundle, agreed with your second line of defence, and reviewed against the most recent regulator guidance.

See responsible AI & governance

UAE-context demographic schema

Nationality bands
Emirati / GCC / Arab expatriate / South Asian expatriate / Western expatriate / Other
Language preference
MSA / Khaleeji dialect / English / Other Arabic dialect
Residency status
Citizen / Golden visa / Employment visa / Family visa / Visit visa
Fairness metrics
Demographic parity, equalised odds, equal opportunity, predictive parity

Schema is editable per use case; thresholds are agreed with the second line of defence, never imposed.

Mapping to controls

Each QA category mapped to CBUAE, FSRA, NCA and ISO 42001.

The bundle ships with an editable annex per regulator. Below is the structural map.

Capability	Brocode artefact	CBUAE Model Risk	FSRA AI Principles	NCA AI Ethics + ECC	ISO 42001
Model-evaluation harness	lm-evaluation-harness + OpenAI Evals + DeepEval + Promptfoo + MLflow	Model Risk principle: documented validation pre-release	AI Principle: testing fitness for purpose	ECC AI-SP-01 testing controls	ISO 42001 §8.3 model performance
Regression suite	Golden-dataset + CI-blocking thresholds	Periodic validation triggers	Continuous monitoring principle	ECC change-management controls	ISO 42001 §8.4 change control
Bias and fairness pack	Fairlearn + AIF360 + UAE-context schema	Fair-treatment expectations	Fairness principle (named)	NCA AI ethics framework alignment	ISO 42001 §6.1.4 fairness
Drift detection	Evidently + Brocode PSI pack	Continuous monitoring principle	Operational monitoring principle	ECC operational monitoring	ISO 42001 §9.1 monitoring
LLM red-teaming	garak + UAE adversarial pack	Model-risk adversarial testing expectation	Adoption risk testing	ECC AI-SP-02 robustness testing	ISO 42001 §6.1.5 robustness
Adversarial robustness	ART evasion + poisoning	Security testing expectation	AI security principle	ECC AI security control	ISO 42001 §8.5 information security
Guardrails + WORM audit trail	NeMo + Llama Guard 3 + Arabic classifier; WORM log	Audit-trail expectation	Audit trail principle	ECC logging controls	ISO 42001 §7.5 documented information

Case study

Three UAE tier-1 banks: 9 weeks to 11 days.

After adopting the audit-evidence bundle, the model-risk function reduced AI use-case approval cycle time from 9 weeks to 11 days — without dropping a single regulator control.

Second-line head of model risk, anonymised UAE tier-1 bank

"First vendor that gave us audit evidence before audit asked. The bundle arrived ready-mapped to CBUAE controls — we extended it for our DFSA workload in a week, not a quarter."

Audit-evidence review, Q4 2025

Versus the alternatives

Traditional QA, Big-4 risk assurance, in-house, model-vendor evals.

Capability	Brocode	Traditional QA shop	Big-4 risk assurance	In-house QA team
Model-evaluation harness (not a slide)	Live, versioned, re-runnable in MLflow	Functional tests only	Methodology overview	Pre-launch checklist
Bias testing with a UAE-context demographic schema			US-Census categories by default
LLM red-teaming with a curated UAE adversarial pack	garak + Arabic / Khaleeji adversarial pack	No	Methodology only	No
Drift detection wired to client ticketing			Reporting only	In SLA-only
Audit-evidence bundle pre-mapped to CBUAE / FSRA / NCA / ISO 42001			Per-engagement build	Internal-only
Approval cycle for an AI use case	11 days (for three UAE tier-1 banks)	9 weeks	7–12 weeks	4–9 weeks

Free download

2026 Audit-Evidence Bundle for AI

A 64-page PDF combining one redacted full evidence bundle, a control-mapping appendix (CBUAE Model Risk, FSRA AI Principles, NCA AI Ethics, ISO 42001), and a fill-in template your risk team can adopt as a default.

Evaluation harness — sample report with thresholds and pass/fail
Regression suite — golden-dataset pattern and CI integration
Bias and fairness pack — Fairlearn + AIF360 + UAE-context schema
Drift detector pattern — Evidently + Brocode PSI pack
Red-team and adversarial — garak + ART, by category
Guardrails plane — NeMo + Llama Guard 3 + Arabic classifier
Mapping to controls — CBUAE / FSRA / NCA / ISO 42001

Risk-function questions

What the second and third lines of defence ask first.

Yes — the sample bundle in the lead magnet contains a redacted full evaluation report from a real engagement, including the per-segment metrics, calibration curves, faithfulness scores for RAG outputs, and the CI-blocking thresholds that flipped on the candidate model. Under NDA we share live MLflow access during the review so your risk team can re-run the evaluation on a new dataset.

Book the audit-evidence review

The Head of AI Risk & QA walks the sample bundle with your team.

A senior AI Risk & QA lead responds within one business day. If your audit submission is inside two weeks, tell us in the form and we will prioritise.

Prefer chat? Message us on WhatsApp.

Continue exploring

The test suite behind every model in production.

Functional testing without an eval harness is not AI QA.

Eval harness, regression, bias, drift, red-team, robustness, guardrails.

Model-evaluation harness

Regression suite

Bias and fairness pack

Drift detectors

LLM red-teaming

Adversarial robustness

Guardrails plane in production

Nationality bands, language preference, residency status — not US Census categories.

Each QA category mapped to CBUAE, FSRA, NCA and ISO 42001.

Three UAE tier-1 banks: 9 weeks to 11 days.

Traditional QA, Big-4 risk assurance, in-house, model-vendor evals.

2026 Audit-Evidence Bundle for AI

What the second and third lines of defence ask first.

The Head of AI Risk & QA walks the sample bundle with your team.

Book a 60-minute audit-evidence review

Related capabilities and stories

Delivery Methodology

Technology Stack

MLOps & AI Infrastructure

Responsible AI & Governance

Banking & Financial Services