
Inference engineering · edge · low-latency cloud
Production-grade model optimisation. 4.2× faster on your target hardware.
Quantisation, pruning, distillation, kernel-level acceleration — hitting your latency, memory and accuracy budget at the same time. The optimised model is handed back as code you own and can rebuild.
{
"baseline_ms": 84.2,
"optimised_ms": 18.9,
"accuracy_baseline": 0.683,
"accuracy_optimised": 0.677,
"size_mb_before": 312,
"size_mb_after": 41,
"hardware": "Jetson Orin NX 16GB",
"stack": "TensorRT 10 / INT8 PTQ"
}Latency–accuracy Pareto
your data, your hardware
4.2×
Median latency reduction (11 cases)
−0.4 pp
Median accuracy delta on client data
78%
Median model size reduction
11
Public benchmark cases on GitHub
Three optimisation problems we solve
Edge. Mobile. Low-latency cloud.
The model works in Colab. It does not work on the Jetson, on the phone, or under a 50 ms SLA. Every internal attempt to just try TensorRT or just try INT8 has crashed, dropped accuracy by six points, or saved twelve percent when seventy was needed. We have shipped against all three failure modes.
Edge / embedded
Drones, robotics, plant-floor vision.
YOLOv8 on Jetson Orin NX. Latency 84 ms → 19 ms. mAP held within 0.6 points. Real silicon, not simulator. The benchmark harness is on GitHub.
Mobile / on-device
Fintech super-apps, insurance claims apps.
Mobile CV model on Snapdragon 8 Gen 3. 312 MB → 41 MB. On-device latency 38 ms. Shipping in production iOS and Android. Core ML and LiteRT artefacts handed back to the customer.
Low-latency cloud
Fraud scoring, ASR, LLM serving at scale.
7B-parameter LLM on a single L40S. TTFT 480 ms → 110 ms. Tokens-per-second 3.4×. Speculative decoding, paged attention, KV-cache quantisation. vLLM + custom kernels.
The optimisation toolchain
Eleven tools. When we use each one.
Pinned versions, production-tested. No vendor-loyalty bias — we use TensorRT when NVIDIA is the target, AIMET when Snapdragon is the target, and Core ML when iOS is the target. The right tool for the silicon.
NVIDIA TensorRT 10 + TensorRT-LLM
Datacentre GPU and Jetson — peak throughput on NVIDIA silicon
NVIDIA Triton Inference Server
Multi-model serving with dynamic batching and ensemble pipelines
vLLM + TGI
LLM serving with paged attention, continuous batching, speculative decoding
ONNX Runtime
Cross-target portable inference — desktop, mobile, web execution providers
PyTorch 2.x + torch.compile + TorchAO
First-pass profiling, QAT, and the optimisation baseline
OpenVINO
Intel CPU and integrated graphics on edge industrial gear
Qualcomm AIMET + SNPE
Snapdragon mobile and automotive inference
Apple Core ML + coremltools
iOS on-device CV and audio with Neural Engine acceleration
Google AI Edge / LiteRT + MediaPipe
Android on-device pipelines and TFLite Micro
Hailo SDK
Hailo-8 and Hailo-15 NPUs in edge cameras and embedded boards
AWS Neuron SDK
Inferentia-2 and Trainium for high-volume cloud inference
Techniques in plain engineer-speak
No marketing. Written like an engineering README.
Post-training quantisation (PTQ)
INT8, INT4 and FP4 schemes. Weight-only and activation-aware. SmoothQuant and GPTQ for LLM weight calibration. Per-channel quantisation and outlier rescue for accuracy-sensitive layers. Usually the first thing we try; sometimes the only thing needed.
Quantisation-aware training (QAT)
When PTQ drops accuracy beyond the floor, we retrain with fake-quant ops in the graph. Recovers most of the gap at the cost of a training cycle on your data. We do this on-site or air-gapped when IP requires it.
Structured and unstructured pruning
Magnitude pruning, movement pruning, attention-head pruning. Structured pruning preserves the dense kernel path; unstructured needs sparse kernels to realise speedup. We choose per target hardware.
Knowledge distillation
Teacher–student pipelines on your training data. Particularly effective for narrow domain models where the student can be 10–20× smaller without measurable accuracy loss on the slices that matter.
Speculative decoding + KV-cache quantisation
For LLM serving: a small draft model proposes tokens, the larger model verifies. KV cache compressed to INT8 reclaims memory for longer context windows. FlashAttention-3 integration removes attention as the bottleneck.
Custom CUDA / Triton kernels
When the stack ships nothing fast enough. We have published kernel work for fused attention variants and quantised matmul; engagement engineers carry the contribution history publicly on GitHub.
Objections answered with evidence
Three things every ML lead asks. Three production references.
Engineers can do it themselves
Anonymised UAE defence integrator: 30 FPS hard deadline.
Real-time vision pipeline on Jetson AGX hitting 30 FPS at full resolution. Custom Triton kernels for fused attention. The customer engineering team had spent eight weeks on TensorRT and was at 22 FPS; the engagement closed at 31.4 FPS in four weeks.
Accuracy will tank
Anonymised GCC super-app: on-device document OCR.
Model size under 25 MB shipping in production iOS and Android. Accuracy on the customer evaluation set unchanged within 0.3 percentage points. PTQ with selective FP16 layers for the attention head; QAT was not needed.
Data has IP
Air-gapped engagement option.
Lead engineer flies on-site. All training and benchmarking on the customer hardware. No data egress. Used on three defence-integrator and one regulated-bank engagement in the last 18 months.
The hardware lab
Real silicon. Not simulators.
Benchmarks are run on the same boards your team will ship on. We are honest about what we do not yet have: ask us about a specific board and we will tell you whether it is in-house, on the procurement list, or on loan.
- Jetson Orin Nano
- Jetson Orin NX
- Jetson AGX Orin
- Hailo-8
- Coral Edge TPU
- Snapdragon X dev kit
- NVIDIA A100
- NVIDIA H100
- NVIDIA L40S
- AWS Inferentia-2
- Apple M-series
- Intel Meteor Lake CPU
How we compare
Hyperscaler optimisation services, in-house engineers, research labs.
Hyperscaler services optimise for their hardware. In-house sprints reach a depth ceiling. Research labs publish but do not ship. We are hardware-agnostic, kernel-level capable, and contractually accountable for a published Pareto.
| Capability | Brocode | AWS SageMaker Neo | Azure ML optimisation | GCP Vertex | In-house TensorRT sprint |
|---|---|---|---|---|---|
| Hardware-agnostic (Jetson, Snapdragon, Hailo, Apple, Intel, GPU, Inferentia) | Inferentia / Trainium | Azure VMs / GPUs | TPU / Vertex | Whatever the customer owns | |
| Ships optimised model as portable artefacts (ONNX, TensorRT plan, Core ML, LiteRT) | Customer owns and can rebuild | SageMaker-bound | Azure ML-bound | Vertex-bound | Code, not benchmark |
| Reproducible benchmark harness on GitHub The result is verifiable forever, not a one-off claim. | |||||
| Kernel-level work (custom CUDA, Triton kernels, FlashAttention-3) | Sometimes | ||||
| Air-gapped engagement option | Engineer flies on-site, no data egress | Limited | |||
| Named engineers with public PyTorch / vLLM / ONNX Runtime PRs | |||||
| Latency-accuracy Pareto contract in SoW | Accuracy floor + p50/p95/p99 ceilings | Vague | Vague | Vague | "It will be faster" |
Free download
The Edge & Inference Optimisation Playbook
Thirty-eight pages. Eleven production case studies. The decision tree. The reproducible benchmark harness on GitHub for YOLOv8 / Jetson, Llama-3-8B / L40S, and a mobile CV model on Android and iOS. Median latency reduction 4.2×, median accuracy delta −0.4 percentage points.
- Pareto frontier methodology — accuracy floor, latency ceiling, memory budget
- Quantisation: INT8 / INT4 / FP4, when to use which
- Distillation playbook with teacher–student pipelines
- Hardware targets: Jetson, Snapdragon, Apple, Intel, GPU, Inferentia
- Three GitHub reproducer repos with CI workflows
- When NOT to optimise — the cases we turn away
Frequently asked
What ML leads actually want to know.
We will measure first. Every engagement opens with a Pareto contract: accuracy floor, p50/p95/p99 latency ceilings, memory budget. We then run PTQ on your evaluation harness — your dataset, not COCO or MMLU — and report the actual accuracy delta. If PTQ misses the floor we move to QAT, distillation, or partial-quant strategies. Median accuracy delta across our published case studies is −0.4 percentage points on the client dataset.
Talk to the principal ML engineer
A 45-minute technical review. No sales people on the call.
Bring the model, the target hardware, the bottleneck, and the SLA. We will tell you what is achievable with PTQ alone, what needs QAT or distillation, and what is genuinely impossible without a redesign. If your team is already at the level needed, we will say so on the call and pass on the engagement.
Continue exploring
Related capabilities and stories
MLOps & AI Infrastructure
The optimised model still needs serving, monitoring, retraining.
Read moreComputer Vision
Most edge / embedded optimisation work originates here.
Read moreGenerative AI Development
LLM inference optimisation cross-feeds the GenAI page.
Read moreSelf-Hosted LLM Infrastructure
Clients who care about inference cost and data sensitivity also self-host.
Read moreManufacturing
The dominant industry for edge inference and defect detection.
Read more