Your model,
optimized to the limit
of NVIDIA Blackwell.
Mach optimizes LLM inference with the full state-of-the-art stack — NVFP4 quantization, custom-trained speculative-decode drafters, and tuned stock NVIDIA TRT-LLM — delivered as a drop-in config for your B200s. Every number on this page is measured on hardware, never projected.

MACH 1
Two purpose-built recipes for the most-deployed Llama models. Measured on real B200 hardware, drop-in via stock NVIDIA TRT-LLM.
MACH 1 — Llama 3.1-8B-Instruct
meta-llama/Llama-3.1-8B-Instruct · 8B decoder-only
Quality preservation · Output-correctness audit vs no-spec baseline: 100% coherent, semantic equivalence preserved (greedy, temperature 0). Full quality receipts publishing with the next release.
MACH 1 — Llama 3.3-70B-Instruct
meta-llama/Llama-3.3-70B-Instruct · 70B decoder-only
Quality preservation · HumanEval −1.83pp / MMLU −0.07pp / GSM8K −0.76pp vs un-quantized BF16 baseline (TP=1, n=164/1500/1319). Within statistical noise.
State of the art is a stack, not a trick.
Fast inference on Blackwell isn't one magic setting. It's four layers, each engineered and measured independently, compounding into throughput your hardware was actually built for.
NVFP4 quantization
Blackwell-native 4-bit floating point. Half the memory footprint, roughly double the effective bandwidth headroom — the single biggest lever on a memory-bound decoder.
Receipts · Llama 3.3-70B: −1.83pp HumanEval · −0.07pp MMLU · −0.76pp GSM8K vs un-quantized BF16. Within statistical noise.
Speculative decoding, done right
Custom drafter models trained per target — on the target's own outputs, not generic corpus text. That's the difference between acceptance rates that hold in production and drafters that look good in a README and stall on real traffic.
Receipts · Drafters trained, benchmarked, and shipped per model. Acceptance measured before anything is claimed.
Stock TRT-LLM, tuned to the metal
No fork, no patched runtime, no out-of-tree kernels. CUDA graphs, overlap scheduling, KV-cache and batching configuration tuned per model and per GPU topology — shipped as a single YAML your stack already understands.
Receipts · Drop-in on stock NVIDIA TRT-LLM. Your serving infrastructure doesn't change.
Measured, never projected
Every recipe is validated cross-machine on independent B200s before a number appears on this page. Quality receipts ship with every release. If we haven't measured it, we don't say it.
Receipts · 43,509–69,461 tok/s aggregate (Llama 8B, 1× B200, cross-machine) · 999 tok/s single-stream sustained (Llama 70B, 2× B200).
Methodology · Aggregate: NVIDIA trtllm-bench, ISL/OSL = 128/128, warm. Single-stream: 50 real prompts, batch 1, temperature 0, warm median; readings above the memory-bandwidth plausibility ceiling are discarded as measurement artifacts.
Honesty policy · We publish measured numbers only — no projections, no extrapolations, no cherry-picked peaks presented as sustained throughput.
Engineering, not magic.
Months of inference-performance research on NVIDIA Blackwell B200, distilled into a single configuration file per model.
MACH recipes are purpose-built per model architecture for NVIDIA Blackwell B200. Drop-in deployment via stock NVIDIA TRT-LLM. No model retraining, no inference-stack changes, no custom kernels. Full configurations delivered under NDA upon licensing agreement.
No fork. No patched runtime. No out-of-tree kernels. Your inference stack stays exactly where it is — MACH ships as a configuration file.
Recipes are purpose-built per decoder-only transformer. Llama today; Mistral, Qwen, DeepSeek, Llama 4 on the roadmap.
One config file. Point TRT-LLM at it. No re-training, no re-quantization, no re-engineering your serving infrastructure.
Llama 3.3-70B: −1.83pp HumanEval, −0.07pp MMLU, −0.76pp GSM8K vs un-quantized BF16 baseline. 8B receipts publishing with next release.
Performance terms — including any throughput floor, the benchmark protocol, hardware/precision/model variant, and refund mechanics — are negotiated and defined in the licensing agreement. No guarantees are made or implied on this page; all marketing numbers are reproducible benchmarks measured on the configurations cited.
What ships next.
Projections are based on memory-bandwidth scaling and architecture similarity — not yet measured. Numbers go on this page only after we measure them.
- MACH 1Llama 3.1-8B-Instruct + Llama 3.3-70B-InstructAvailableAvailable now·43.5K–69.5K tok/s aggregate (8B, 1× B200) · 999 tok/s single-stream (70B, 2× B200) · measured
- MACH 2Llama 3.1 405BRoadmapQ1–Q2 2026·Projection publishes after first measured run
- MACH 3Mistral Large 2RoadmapQ2 2026·Projection publishes after first measured run
- MACH 4Qwen 2.5 72BRoadmapQ2 2026·Projection publishes after first measured run
- MACH 5DeepSeek V3RoadmapQ3 2026·Projection publishes after first measured run
- MACH 6Llama 4RoadmapDay-1 on release·Projection TBD
- MACH ∞Custom model engagementsOpenOn request·Your model, your Blackwell hardware — full-stack optimization with a contract-defined throughput floor and refund mechanics
Supported hardware.
Single B200 today. Dual-B200 NVLink configurations are in active validation. PCIe-only multi-GPU configurations are not supported.
| Deployment | Model | Status |
|---|---|---|
| 1× B200 | MACH 1 — Llama 3.1-8B-Instruct | Validated |
| 2× B200 with NVLink | MACH 1 — Llama 3.3-70B-Instruct | Validated |
| 1× B200 | MACH 1 — Llama 3.3-70B-Instruct | Validation in progress |
| 2× B200 with NVLink | MACH 1 — Llama 3.1-8B-Instruct | Validation in progress |
Talk to us.
MACH 1 is delivered under NDA. Tell us what you're running and we'll send a recipe-fit assessment within one business day.
