github.com/brocode

Every commit linked. Every dataset cited. Every benchmark reproducible.

The Open-Source Hour is paid time for every Brocode engineer to contribute upstream — and the contribution graph below is the result.

Talk to an open-source maintainerDownload the 2026 Open-Source Report

brocode org — last 12 months

LessMore

312

merged PRs

upstream projects

public datasets

Khaleeji Benchmark

Open-Source Hour: every engineer, every Friday afternoon, since 2023.

Why we publish

Signed by the CTO and the head of Arabic NLP.

Open source is the only signal a buyer or a researcher cannot fake. Code that runs in public gets reviewed by people without commercial interest in agreeing with you. That is the honest test of an engineering practice, and it is the test we built the Open-Source Hour around — paid time, every Friday afternoon, for every engineer in the company, audited quarterly by the CTO.

For buyers doing tie-breaker due diligence between two shortlisted vendors, the contribution graph and the merged-PR ledger answer the engineering-culture question without a single marketing slide. For researchers evaluating Brocode as a future employer or research collaborator, the same artefacts answer the career-relevance question.

Repositories and contributions

The named projects, the upstream PRs, the dataset cards.

Benchmark

Python

brocode/khaleeji-benchmark

Open evaluation suite for Arabic LLMs covering UAE / KSA / Qatar / Kuwait / Bahrain dialect comprehension, MSA reasoning, and Arabic-English code-switching. Public leaderboard with Falcon, Jais, Claude, GPT-4 and Brocode fine-tunes.

★ 1.8k

Dataset

brocode/khaleeji-dialect-corpus-v2

1.4M utterances across UAE, KSA, Qatar, Kuwait and Bahrain dialect markers. Apache-2.0 licensed, dataset card with provenance, statistics and ethics review.

★ 3.2k

Dataset

brocode/uae-government-correspondence-ner

Anonymised NER tags on Arabic correspondence templates, designed for the long-form formal Arabic that dominates UAE government workflows.

★ 0.9k

Dataset

brocode/arabic-financial-extraction-eval

1,800 labelled Arabic invoices and KYC documents for extraction benchmarking. Released with category breakdowns and a reproducer notebook.

★ 0.7k

Upstream PR

Rust

huggingface/tokenizers — Arabic normalisation

Upstream PR fixing ZWNJ, tatweel, and alef-variant handling in the Hugging Face tokenizers package. Merged into main; ships from 0.20.0.

★ merged

Upstream PR

Python

vllm-project/vllm — RTL paged-attention fixes

Paged-attention edge-case fixes for right-to-left scripts, Arabic-locale tokenisation pathway, and sample serving configs for Falcon-family and Jais models.

★ merged

Upstream PR

Python

EleutherAI/lm-evaluation-harness — Arabic task pack

Public PR adding seven Arabic MCQA and generation tasks aligned to the Khaleeji Benchmark, with reproducer scripts and dataset cards.

★ merged

Upstream PR

pgvector/pgvector — high-dim build perf

Index-build performance fix on high-dimension Arabic-embedding workloads. Ships from 0.7.4 onwards.

★ merged

312
Merged PRs across 14 upstream projects
7
Public Arabic datasets (4.6M records)
1
Open Arabic-LLM benchmark with leaderboard
Fri
Open-Source Hour — every engineer, every week

The Khaleeji Benchmark

An open Arabic LLM benchmark with a public leaderboard.

UAE / KSA / Qatar / Kuwait / Bahrain dialect comprehension, MSA reasoning, Arabic-English code-switching. Refreshed within 14 days of any major frontier-model release.

View the leaderboard on GitHub

Model	Dialect	MSA reasoning	Code-switch
Claude 3.5 Sonnet	74.1	82.4	79.6
GPT-4o	71.3	83.9	78.2
Brocode-Jais-FT-v3	78.6	76.1	74.4
Falcon-3-7B-instruct	64.5	69.7	66.0
Jais-30b-chat-v3	67.2	71.8	68.9

Illustrative snapshot from the public leaderboard; live numbers update on each release.

Versus the alternatives

What an engineering-culture due-diligence comparison actually looks like.

Capability	Brocode	Offshore consultancies	Big-4 AI practice	Sovereign-only integrator
Public commit graph for the consultancy org	Yes — Brocode org + named maintainer handles	No	No	Limited internal repos
Arabic NLP datasets on Hugging Face	7 datasets, 4.6M labelled records	0	0	1–2, internal-only
Merged PRs to vLLM / Transformers / tokenizers / pgvector	312 merged across 14 projects	0–5	0	<20
Open Arabic LLM benchmark with leaderboard	Khaleeji Benchmark — public leaderboard	No	No	Internal moat
Paid Open-Source Hour for every engineer	Yes — every Friday afternoon since 2023	No	No	No
Named maintainer attribution	Yes — engineers commit under their own GitHub handles	Anonymised	Anonymised	Anonymised

The Open-Source Hour policy

Paid time, every Friday afternoon, audited by the CTO.

It is a clause in our employment contract. It is a line in our delivery rate card. It is the reason the contribution graph above is real.

policy.1
Every engineer, every Friday afternoon
No exceptions for delivery pressure. Four hours of paid time on Friday afternoon for upstream work; if a sprint demands the time, the sprint plan is wrong.
policy.2
Audited by the CTO every quarter
Per-engineer contribution counts are reviewed by the CTO; zero-contribution quarters trigger a conversation, not a sanction.
policy.3
Public attribution by consent
Engineers commit under their own GitHub handles with their consent — never aggregated into a single corporate handle that obscures who did the work.

Free download

Brocode AI Open-Source Report 2026

The 36-page PDF with the full commit ledger by repository, every dataset card with provenance and statistics, the Khaleeji Benchmark methodology, and a public bibliography.

Khaleeji Benchmark (open dataset + leaderboard)
Arabic NLP datasets — 7 datasets, 4.6M labelled records
vLLM contributions — RTL paged-attention, Arabic-locale path
tokenizers contributions — ZWNJ, tatweel, alef-variant fixes
pgvector contributions — high-dim index-build perf fix
Internal tools we open-sourced
Open-Source Hour audit ledger

Engineering-culture questions

What buyers and researchers ask before a maintainer call.

The contribution graph spans 18 months of activity with dense weeks across every quarter, driven by the Open-Source Hour policy that pays every engineer for one Friday afternoon a week to contribute upstream. The report contains the per-quarter contributor count, PR-merge cadence and dataset-release schedule — none of it is a marketing burst.

Talk to a maintainer

Arabic NLP, serving infra, data engineering, or MLOps.

The form routes directly to the maintainer specialism you choose. One business day to first response.

Researcher or engineer interested in joining? See Brocode engineering careers.

Prefer chat? Message us on WhatsApp.

Continue exploring

Every commit linked. Every dataset cited. Every benchmark reproducible.

Signed by the CTO and the head of Arabic NLP.

The named projects, the upstream PRs, the dataset cards.

An open Arabic LLM benchmark with a public leaderboard.

What an engineering-culture due-diligence comparison actually looks like.

Paid time, every Friday afternoon, audited by the CTO.

Brocode AI Open-Source Report 2026

What buyers and researchers ask before a maintainer call.

Arabic NLP, serving infra, data engineering, or MLOps.

Talk to a Brocode open-source maintainer

Related capabilities and stories

Technology Stack

Natural Language Processing

Generative AI Development

Quality Assurance for AI

Careers — engineering