Skip to content
Brocode SolutionsAI Software Development

eval-harness.run

The test suite behind every model in production.

Evaluation, regression, bias and fairness, drift, red-teaming, adversarial robustness and guardrails — packaged as an audit-evidence bundle your second and third lines of defence can attach to the regulator submission.

model.eval.run #4129

PASS
  • accuracy
    0.913pass
  • calibration
    ECE 0.041pass
  • robustness
    ART evasionpass
  • fairness
    EO gap 0.018pass
  • faithfulness
    RAGAS 0.86pass
  • drift
    PSI 0.07pass
  • red-team
    garak weeklypass

artefacts emitted

  • Evaluation Report
  • Bias Report
  • Red-Team Report
  • Drift Baseline
  • Regression Diff
  • Control Mapping

Why generic QA fails at AI

Functional testing without an eval harness is not AI QA.

A model can pass every traditional software test and still fail catastrophically in production. There are no fixed outputs; failure modes are statistical; the system can degrade silently for months. Traditional QA shops are strong on functional testing and have nothing for evaluation, bias, red-teaming or drift. Big-4 risk-assurance practices ship methodology overviews and seldom an actual eval-harness run.

Brocode treats AI quality as a continuous engineering discipline — seven test categories on every model, every release, every retrain. Each category names the framework (Fairlearn, AIF360, garak, Evidently, ART, NeMo Guardrails, Llama Guard 3) and emits an artefact that ends up in the audit-evidence bundle. The bundle is what your second and third lines of defence attach to the regulator submission.

Seven test categories on every model

Eval harness, regression, bias, drift, red-team, robustness, guardrails.

Each category names the frameworks, the thresholds, and the artefact it emits to the audit-evidence bundle.

category.01

Model-evaluation harness

lm-evaluation-harnessOpenAI EvalsDeepEvalPromptfooMLflow

An in-house extension of lm-evaluation-harness for classical ML and a custom LLM eval framework on top of OpenAI Evals + DeepEval + Promptfoo. Every evaluation run is versioned in MLflow and re-runnable on demand by the client risk team.

Thresholds

CI-blocking on accuracy, F1, calibration; LLM faithfulness and answer-relevance via RAGAS

category.02

Regression suite

Golden datasetsRAGASMLflow

Golden-dataset regression for every model release. Candidate-vs-incumbent comparison across the full evaluation pack — not just headline metrics — to catch silent regressions that average accuracy hides.

Thresholds

No metric may regress by more than 0.5% absolute or 1.5% relative without sign-off

category.03

Bias and fairness pack

FairlearnAIF360Brocode UAE-context schema

Demographic-parity, equalised-odds, equal-opportunity and predictive-parity tests using Fairlearn and AIF360, with a UAE-context demographic schema (nationality bands, language preference, residency status) — never US Census categories.

Thresholds

Demographic parity, equalised odds, equal opportunity, predictive parity — thresholds set per use case with the second line of defence

category.04

Drift detectors

Evidently AIBrocode PSI pack

Evidently AI plus a custom Brocode population-stability-index pack, with alerting wired to the client ticketing system. Covariate drift, concept drift and prediction drift are monitored independently.

Thresholds

Severity-tiered alerts on covariate, concept and prediction drift; SLA: drift-alert response within 24 hours

category.05

LLM red-teaming

garakBrocode UAE adversarial pack

Weekly automated red-team using garak plus a Brocode-curated UAE adversarial prompt pack (Arabic jailbreaks, dialect-coded harm prompts, regulator-sensitive topic probes). Manual red-team passes by a named in-house team before every production release.

Thresholds

Weekly automated red-team; manual passes before every production release; zero unmitigated category-1 findings at G3

category.06

Adversarial robustness

Adversarial Robustness Toolbox (ART)

Adversarial Robustness Toolbox (ART) evasion and poisoning tests on tabular and vision models. Findings flow into the regression suite so robustness becomes a CI-blocking criterion, not a pre-launch tickbox.

Thresholds

Evasion and poisoning tests on tabular and vision models; degradation bounds agreed at G2

category.07

Guardrails plane in production

NeMo GuardrailsLlama Guard 3Bespoke Arabic policy classifier (we train inside your engagement repo)

NVIDIA NeMo Guardrails plus Llama Guard 3 plus an Arabic policy classifier, with prompt and response logging to a tamper-evident WORM store for the audit trail. Composition order documented in the relevant ADR.

Thresholds

WORM logging of every prompt and response; latency budget for the guardrails chain agreed at G2

  • 7

    Test categories on every model we deliver

  • 11d

    Average approval cycle (was 9 weeks)

  • 4

    Regulators mapped (CBUAE / FSRA / NCA / ISO 42001)

  • 0

    Unmitigated category-1 red-team findings at G3

Bias and fairness with a UAE-context schema

Nationality bands, language preference, residency status — not US Census categories.

The schema is documented in the bundle, agreed with your second line of defence, and reviewed against the most recent regulator guidance.

See responsible AI & governance

UAE-context demographic schema

  • Nationality bands

    Emirati / GCC / Arab expatriate / South Asian expatriate / Western expatriate / Other

  • Language preference

    MSA / Khaleeji dialect / English / Other Arabic dialect

  • Residency status

    Citizen / Golden visa / Employment visa / Family visa / Visit visa

  • Fairness metrics

    Demographic parity, equalised odds, equal opportunity, predictive parity

Schema is editable per use case; thresholds are agreed with the second line of defence, never imposed.

Mapping to controls

Each QA category mapped to CBUAE, FSRA, NCA and ISO 42001.

The bundle ships with an editable annex per regulator. Below is the structural map.

CapabilityBrocode artefactCBUAE Model RiskFSRA AI PrinciplesNCA AI Ethics + ECCISO 42001
Model-evaluation harnesslm-evaluation-harness + OpenAI Evals + DeepEval + Promptfoo + MLflowModel Risk principle: documented validation pre-releaseAI Principle: testing fitness for purposeECC AI-SP-01 testing controlsISO 42001 §8.3 model performance
Regression suiteGolden-dataset + CI-blocking thresholdsPeriodic validation triggersContinuous monitoring principleECC change-management controlsISO 42001 §8.4 change control
Bias and fairness packFairlearn + AIF360 + UAE-context schemaFair-treatment expectationsFairness principle (named)NCA AI ethics framework alignmentISO 42001 §6.1.4 fairness
Drift detectionEvidently + Brocode PSI packContinuous monitoring principleOperational monitoring principleECC operational monitoringISO 42001 §9.1 monitoring
LLM red-teaminggarak + UAE adversarial packModel-risk adversarial testing expectationAdoption risk testingECC AI-SP-02 robustness testingISO 42001 §6.1.5 robustness
Adversarial robustnessART evasion + poisoningSecurity testing expectationAI security principleECC AI security controlISO 42001 §8.5 information security
Guardrails + WORM audit trailNeMo + Llama Guard 3 + Arabic classifier; WORM logAudit-trail expectationAudit trail principleECC logging controlsISO 42001 §7.5 documented information

Case study

Three UAE tier-1 banks: 9 weeks to 11 days.

After adopting the audit-evidence bundle, the model-risk function reduced AI use-case approval cycle time from 9 weeks to 11 days — without dropping a single regulator control.

Second-line head of model risk, anonymised UAE tier-1 bank

"First vendor that gave us audit evidence before audit asked. The bundle arrived ready-mapped to CBUAE controls — we extended it for our DFSA workload in a week, not a quarter."

Audit-evidence review, Q4 2025

Versus the alternatives

Traditional QA, Big-4 risk assurance, in-house, model-vendor evals.

CapabilityBrocodeTraditional QA shopBig-4 risk assuranceIn-house QA team
Model-evaluation harness (not a slide)Live, versioned, re-runnable in MLflowFunctional tests onlyMethodology overviewPre-launch checklist
Bias testing with a UAE-context demographic schemaUS-Census categories by default
LLM red-teaming with a curated UAE adversarial packgarak + Arabic / Khaleeji adversarial packNoMethodology onlyNo
Drift detection wired to client ticketingReporting onlyIn SLA-only
Audit-evidence bundle pre-mapped to CBUAE / FSRA / NCA / ISO 42001Per-engagement buildInternal-only
Approval cycle for an AI use case11 days (for three UAE tier-1 banks)9 weeks7–12 weeks4–9 weeks

Free download

2026 Audit-Evidence Bundle for AI

A 64-page PDF combining one redacted full evidence bundle, a control-mapping appendix (CBUAE Model Risk, FSRA AI Principles, NCA AI Ethics, ISO 42001), and a fill-in template your risk team can adopt as a default.

  • Evaluation harness — sample report with thresholds and pass/fail
  • Regression suite — golden-dataset pattern and CI integration
  • Bias and fairness pack — Fairlearn + AIF360 + UAE-context schema
  • Drift detector pattern — Evidently + Brocode PSI pack
  • Red-team and adversarial — garak + ART, by category
  • Guardrails plane — NeMo + Llama Guard 3 + Arabic classifier
  • Mapping to controls — CBUAE / FSRA / NCA / ISO 42001

Instant download. No spam. Unsubscribe any time.

Risk-function questions

What the second and third lines of defence ask first.

  • Yes — the sample bundle in the lead magnet contains a redacted full evaluation report from a real engagement, including the per-segment metrics, calibration curves, faithfulness scores for RAG outputs, and the CI-blocking thresholds that flipped on the candidate model. Under NDA we share live MLflow access during the review so your risk team can re-run the evaluation on a new dataset.

Book the audit-evidence review

The Head of AI Risk & QA walks the sample bundle with your team.

A senior AI Risk & QA lead responds within one business day. If your audit submission is inside two weeks, tell us in the form and we will prioritise.

Prefer chat? Message us on WhatsApp.

Quote request

Book a 60-minute audit-evidence review

A senior Brocode AI Risk & QA lead walks the sample bundle on screen with your second or third line of defence.

Prefer chat? Message us on WhatsApp — we'll see it within working hours.

Book a QA reviewWhatsApp