Skip to content
Brocode SolutionsAI Software Development

github.com/brocode

Every commit linked. Every dataset cited. Every benchmark reproducible.

The Open-Source Hour is paid time for every Brocode engineer to contribute upstream — and the contribution graph below is the result.

brocode org — last 12 months

LessMore

312

merged PRs

14

upstream projects

7

public datasets

1

Khaleeji Benchmark

Open-Source Hour: every engineer, every Friday afternoon, since 2023.

Why we publish

Signed by the CTO and the head of Arabic NLP.

Open source is the only signal a buyer or a researcher cannot fake. Code that runs in public gets reviewed by people without commercial interest in agreeing with you. That is the honest test of an engineering practice, and it is the test we built the Open-Source Hour around — paid time, every Friday afternoon, for every engineer in the company, audited quarterly by the CTO.

For buyers doing tie-breaker due diligence between two shortlisted vendors, the contribution graph and the merged-PR ledger answer the engineering-culture question without a single marketing slide. For researchers evaluating Brocode as a future employer or research collaborator, the same artefacts answer the career-relevance question.

Repositories and contributions

The named projects, the upstream PRs, the dataset cards.

Benchmark

Python

brocode/khaleeji-benchmark

Open evaluation suite for Arabic LLMs covering UAE / KSA / Qatar / Kuwait / Bahrain dialect comprehension, MSA reasoning, and Arabic-English code-switching. Public leaderboard with Falcon, Jais, Claude, GPT-4 and Brocode fine-tunes.

1.8k

Dataset

Dataset

brocode/khaleeji-dialect-corpus-v2

1.4M utterances across UAE, KSA, Qatar, Kuwait and Bahrain dialect markers. Apache-2.0 licensed, dataset card with provenance, statistics and ethics review.

3.2k

Dataset

Dataset

brocode/uae-government-correspondence-ner

Anonymised NER tags on Arabic correspondence templates, designed for the long-form formal Arabic that dominates UAE government workflows.

0.9k

Dataset

Dataset

brocode/arabic-financial-extraction-eval

1,800 labelled Arabic invoices and KYC documents for extraction benchmarking. Released with category breakdowns and a reproducer notebook.

0.7k

Upstream PR

Rust

huggingface/tokenizers — Arabic normalisation

Upstream PR fixing ZWNJ, tatweel, and alef-variant handling in the Hugging Face tokenizers package. Merged into main; ships from 0.20.0.

merged

Upstream PR

Python

vllm-project/vllm — RTL paged-attention fixes

Paged-attention edge-case fixes for right-to-left scripts, Arabic-locale tokenisation pathway, and sample serving configs for Falcon-family and Jais models.

merged

Upstream PR

Python

EleutherAI/lm-evaluation-harness — Arabic task pack

Public PR adding seven Arabic MCQA and generation tasks aligned to the Khaleeji Benchmark, with reproducer scripts and dataset cards.

merged

Upstream PR

C

pgvector/pgvector — high-dim build perf

Index-build performance fix on high-dimension Arabic-embedding workloads. Ships from 0.7.4 onwards.

merged

  • 312

    Merged PRs across 14 upstream projects

  • 7

    Public Arabic datasets (4.6M records)

  • 1

    Open Arabic-LLM benchmark with leaderboard

  • Fri

    Open-Source Hour — every engineer, every week

The Khaleeji Benchmark

An open Arabic LLM benchmark with a public leaderboard.

UAE / KSA / Qatar / Kuwait / Bahrain dialect comprehension, MSA reasoning, Arabic-English code-switching. Refreshed within 14 days of any major frontier-model release.

View the leaderboard on GitHub
ModelDialectMSA reasoningCode-switch
Claude 3.5 Sonnet74.182.479.6
GPT-4o71.383.978.2
Brocode-Jais-FT-v378.676.174.4
Falcon-3-7B-instruct64.569.766.0
Jais-30b-chat-v367.271.868.9

Illustrative snapshot from the public leaderboard; live numbers update on each release.

Versus the alternatives

What an engineering-culture due-diligence comparison actually looks like.

CapabilityBrocodeOffshore consultanciesBig-4 AI practiceSovereign-only integrator
Public commit graph for the consultancy orgYes — Brocode org + named maintainer handlesNoNoLimited internal repos
Arabic NLP datasets on Hugging Face7 datasets, 4.6M labelled records001–2, internal-only
Merged PRs to vLLM / Transformers / tokenizers / pgvector312 merged across 14 projects0–50<20
Open Arabic LLM benchmark with leaderboardKhaleeji Benchmark — public leaderboardNoNoInternal moat
Paid Open-Source Hour for every engineerYes — every Friday afternoon since 2023NoNoNo
Named maintainer attributionYes — engineers commit under their own GitHub handlesAnonymisedAnonymisedAnonymised

The Open-Source Hour policy

Paid time, every Friday afternoon, audited by the CTO.

It is a clause in our employment contract. It is a line in our delivery rate card. It is the reason the contribution graph above is real.

  1. policy.1

    Every engineer, every Friday afternoon

    No exceptions for delivery pressure. Four hours of paid time on Friday afternoon for upstream work; if a sprint demands the time, the sprint plan is wrong.

  2. policy.2

    Audited by the CTO every quarter

    Per-engineer contribution counts are reviewed by the CTO; zero-contribution quarters trigger a conversation, not a sanction.

  3. policy.3

    Public attribution by consent

    Engineers commit under their own GitHub handles with their consent — never aggregated into a single corporate handle that obscures who did the work.

Free download

Brocode AI Open-Source Report 2026

The 36-page PDF with the full commit ledger by repository, every dataset card with provenance and statistics, the Khaleeji Benchmark methodology, and a public bibliography.

  • Khaleeji Benchmark (open dataset + leaderboard)
  • Arabic NLP datasets — 7 datasets, 4.6M labelled records
  • vLLM contributions — RTL paged-attention, Arabic-locale path
  • tokenizers contributions — ZWNJ, tatweel, alef-variant fixes
  • pgvector contributions — high-dim index-build perf fix
  • Internal tools we open-sourced
  • Open-Source Hour audit ledger

Instant download. No spam. Unsubscribe any time.

Engineering-culture questions

What buyers and researchers ask before a maintainer call.

  • The contribution graph spans 18 months of activity with dense weeks across every quarter, driven by the Open-Source Hour policy that pays every engineer for one Friday afternoon a week to contribute upstream. The report contains the per-quarter contributor count, PR-merge cadence and dataset-release schedule — none of it is a marketing burst.

Talk to a maintainer

Arabic NLP, serving infra, data engineering, or MLOps.

The form routes directly to the maintainer specialism you choose. One business day to first response.

Researcher or engineer interested in joining? See Brocode engineering careers.

Prefer chat? Message us on WhatsApp.

Quote request

Talk to a Brocode open-source maintainer

A maintainer on the Khaleeji Benchmark, the Arabic NLP datasets, the vLLM PRs or the tokenizers PRs replies within one business day.

Prefer chat? Message us on WhatsApp — we'll see it within working hours.

Talk to a maintainerWhatsApp