Skip to content
Brocode SolutionsAI Software Development
Engineer benchmarking AI model inference latency on a Jetson development board with a Pareto curve displayed

Inference engineering · edge · low-latency cloud

Production-grade model optimisation. 4.2× faster on your target hardware.

Quantisation, pruning, distillation, kernel-level acceleration — hitting your latency, memory and accuracy budget at the same time. The optimised model is handed back as code you own and can rebuild.

brocode-bench@jetson-orin-nx
{
  "baseline_ms":     84.2,
  "optimised_ms":    18.9,
  "accuracy_baseline": 0.683,
  "accuracy_optimised": 0.677,
  "size_mb_before":  312,
  "size_mb_after":   41,
  "hardware":        "Jetson Orin NX 16GB",
  "stack":           "TensorRT 10 / INT8 PTQ"
}

Latency–accuracy Pareto

your data, your hardware

  • 4.2×

    Median latency reduction (11 cases)

  • −0.4 pp

    Median accuracy delta on client data

  • 78%

    Median model size reduction

  • 11

    Public benchmark cases on GitHub

Three optimisation problems we solve

Edge. Mobile. Low-latency cloud.

The model works in Colab. It does not work on the Jetson, on the phone, or under a 50 ms SLA. Every internal attempt to just try TensorRT or just try INT8 has crashed, dropped accuracy by six points, or saved twelve percent when seventy was needed. We have shipped against all three failure modes.

Edge / embedded

Drones, robotics, plant-floor vision.

YOLOv8 on Jetson Orin NX. Latency 84 ms → 19 ms. mAP held within 0.6 points. Real silicon, not simulator. The benchmark harness is on GitHub.

Mobile / on-device

Fintech super-apps, insurance claims apps.

Mobile CV model on Snapdragon 8 Gen 3. 312 MB → 41 MB. On-device latency 38 ms. Shipping in production iOS and Android. Core ML and LiteRT artefacts handed back to the customer.

Low-latency cloud

Fraud scoring, ASR, LLM serving at scale.

7B-parameter LLM on a single L40S. TTFT 480 ms → 110 ms. Tokens-per-second 3.4×. Speculative decoding, paged attention, KV-cache quantisation. vLLM + custom kernels.

The optimisation toolchain

Eleven tools. When we use each one.

Pinned versions, production-tested. No vendor-loyalty bias — we use TensorRT when NVIDIA is the target, AIMET when Snapdragon is the target, and Core ML when iOS is the target. The right tool for the silicon.

NVIDIA TensorRT 10 + TensorRT-LLM

Datacentre GPU and Jetson — peak throughput on NVIDIA silicon

NVIDIA Triton Inference Server

Multi-model serving with dynamic batching and ensemble pipelines

vLLM + TGI

LLM serving with paged attention, continuous batching, speculative decoding

ONNX Runtime

Cross-target portable inference — desktop, mobile, web execution providers

PyTorch 2.x + torch.compile + TorchAO

First-pass profiling, QAT, and the optimisation baseline

OpenVINO

Intel CPU and integrated graphics on edge industrial gear

Qualcomm AIMET + SNPE

Snapdragon mobile and automotive inference

Apple Core ML + coremltools

iOS on-device CV and audio with Neural Engine acceleration

Google AI Edge / LiteRT + MediaPipe

Android on-device pipelines and TFLite Micro

Hailo SDK

Hailo-8 and Hailo-15 NPUs in edge cameras and embedded boards

AWS Neuron SDK

Inferentia-2 and Trainium for high-volume cloud inference

Techniques in plain engineer-speak

No marketing. Written like an engineering README.

Post-training quantisation (PTQ)

INT8, INT4 and FP4 schemes. Weight-only and activation-aware. SmoothQuant and GPTQ for LLM weight calibration. Per-channel quantisation and outlier rescue for accuracy-sensitive layers. Usually the first thing we try; sometimes the only thing needed.

Quantisation-aware training (QAT)

When PTQ drops accuracy beyond the floor, we retrain with fake-quant ops in the graph. Recovers most of the gap at the cost of a training cycle on your data. We do this on-site or air-gapped when IP requires it.

Structured and unstructured pruning

Magnitude pruning, movement pruning, attention-head pruning. Structured pruning preserves the dense kernel path; unstructured needs sparse kernels to realise speedup. We choose per target hardware.

Knowledge distillation

Teacher–student pipelines on your training data. Particularly effective for narrow domain models where the student can be 10–20× smaller without measurable accuracy loss on the slices that matter.

Speculative decoding + KV-cache quantisation

For LLM serving: a small draft model proposes tokens, the larger model verifies. KV cache compressed to INT8 reclaims memory for longer context windows. FlashAttention-3 integration removes attention as the bottleneck.

Custom CUDA / Triton kernels

When the stack ships nothing fast enough. We have published kernel work for fused attention variants and quantised matmul; engagement engineers carry the contribution history publicly on GitHub.

Objections answered with evidence

Three things every ML lead asks. Three production references.

Engineers can do it themselves

Anonymised UAE defence integrator: 30 FPS hard deadline.

Real-time vision pipeline on Jetson AGX hitting 30 FPS at full resolution. Custom Triton kernels for fused attention. The customer engineering team had spent eight weeks on TensorRT and was at 22 FPS; the engagement closed at 31.4 FPS in four weeks.

Accuracy will tank

Anonymised GCC super-app: on-device document OCR.

Model size under 25 MB shipping in production iOS and Android. Accuracy on the customer evaluation set unchanged within 0.3 percentage points. PTQ with selective FP16 layers for the attention head; QAT was not needed.

Data has IP

Air-gapped engagement option.

Lead engineer flies on-site. All training and benchmarking on the customer hardware. No data egress. Used on three defence-integrator and one regulated-bank engagement in the last 18 months.

The hardware lab

Real silicon. Not simulators.

Benchmarks are run on the same boards your team will ship on. We are honest about what we do not yet have: ask us about a specific board and we will tell you whether it is in-house, on the procurement list, or on loan.

  • Jetson Orin Nano
  • Jetson Orin NX
  • Jetson AGX Orin
  • Hailo-8
  • Coral Edge TPU
  • Snapdragon X dev kit
  • NVIDIA A100
  • NVIDIA H100
  • NVIDIA L40S
  • AWS Inferentia-2
  • Apple M-series
  • Intel Meteor Lake CPU

How we compare

Hyperscaler optimisation services, in-house engineers, research labs.

Hyperscaler services optimise for their hardware. In-house sprints reach a depth ceiling. Research labs publish but do not ship. We are hardware-agnostic, kernel-level capable, and contractually accountable for a published Pareto.

CapabilityBrocodeAWS SageMaker NeoAzure ML optimisationGCP VertexIn-house TensorRT sprint
Hardware-agnostic (Jetson, Snapdragon, Hailo, Apple, Intel, GPU, Inferentia)Inferentia / TrainiumAzure VMs / GPUsTPU / VertexWhatever the customer owns
Ships optimised model as portable artefacts (ONNX, TensorRT plan, Core ML, LiteRT)Customer owns and can rebuildSageMaker-boundAzure ML-boundVertex-boundCode, not benchmark
Reproducible benchmark harness on GitHub

The result is verifiable forever, not a one-off claim.

Kernel-level work (custom CUDA, Triton kernels, FlashAttention-3)Sometimes
Air-gapped engagement optionEngineer flies on-site, no data egressLimited
Named engineers with public PyTorch / vLLM / ONNX Runtime PRs
Latency-accuracy Pareto contract in SoWAccuracy floor + p50/p95/p99 ceilingsVagueVagueVague"It will be faster"

Free download

The Edge & Inference Optimisation Playbook

Thirty-eight pages. Eleven production case studies. The decision tree. The reproducible benchmark harness on GitHub for YOLOv8 / Jetson, Llama-3-8B / L40S, and a mobile CV model on Android and iOS. Median latency reduction 4.2×, median accuracy delta −0.4 percentage points.

  • Pareto frontier methodology — accuracy floor, latency ceiling, memory budget
  • Quantisation: INT8 / INT4 / FP4, when to use which
  • Distillation playbook with teacher–student pipelines
  • Hardware targets: Jetson, Snapdragon, Apple, Intel, GPU, Inferentia
  • Three GitHub reproducer repos with CI workflows
  • When NOT to optimise — the cases we turn away

Instant download. No spam. Unsubscribe any time.

Frequently asked

What ML leads actually want to know.

  • We will measure first. Every engagement opens with a Pareto contract: accuracy floor, p50/p95/p99 latency ceilings, memory budget. We then run PTQ on your evaluation harness — your dataset, not COCO or MMLU — and report the actual accuracy delta. If PTQ misses the floor we move to QAT, distillation, or partial-quant strategies. Median accuracy delta across our published case studies is −0.4 percentage points on the client dataset.

Talk to the principal ML engineer

A 45-minute technical review. No sales people on the call.

Bring the model, the target hardware, the bottleneck, and the SLA. We will tell you what is achievable with PTQ alone, what needs QAT or distillation, and what is genuinely impossible without a redesign. If your team is already at the level needed, we will say so on the call and pass on the engagement.

Quote request

Book a 45-minute technical review with our Principal ML Engineer

A Brocode principal engineer with public PyTorch, vLLM or ONNX Runtime contribution history reviews your bottleneck and target hardware, and replies within one business day.

Prefer chat? Message us on WhatsApp — we'll see it within working hours.

Book technical reviewWhatsApp