Your model,
optimized to the limit
of NVIDIA Blackwell.

Mach optimizes LLM inference with the full state-of-the-art stack — NVFP4 quantization, custom-trained speculative-decode drafters, and tuned stock NVIDIA TRT-LLM — delivered as a drop-in config for your B200s. Every number on this page is measured on hardware, never projected.

43,509–69,461

Llama 3.1-8B · aggregate throughput

tok/s · 1× B200 · cross-machine validated · trtllm-bench · 128/128

19.2×

Single-user speedup · speculative decoding

Llama 8B · warm median vs no-spec · under 3% cross-run variance

Current product

MACH 1

Two purpose-built recipes for the most-deployed Llama models. Measured on real B200 hardware, drop-in via stock NVIDIA TRT-LLM.

NVFP4 (nvidia/Llama-3.1-8B-Instruct-NVFP4)

MACH 1 — Llama 3.1-8B-Instruct

meta-llama/Llama-3.1-8B-Instruct · 8B decoder-only

Available

43.5K–69.5K

tok/s · 1× B200 · cross-machine · 128/128

Aggregate throughput

1,570

tok/s · warm median · 19.2× vs no-spec · max 1,987

Single-stream, artifact-gated

Quality preservation · Output-correctness audit vs no-spec baseline: 100% coherent, semantic equivalence preserved (greedy, temperature 0). Full quality receipts publishing with the next release.

Validated cross-machine on two independent B200 pods, two datacenters, two driver versions

NVFP4 (nvidia/Llama-3.3-70B-Instruct-FP4)

MACH 1 — Llama 3.3-70B-Instruct

meta-llama/Llama-3.3-70B-Instruct · 70B decoder-only

Available

999

tok/s · 2× B200 TP2 · warm median · peak 1,265

Single-stream sustained

17,528

tok/s output · 2× B200 · 9.3 tok/s per watt

Aggregate @ 2,000 users

Quality preservation · HumanEval −1.83pp / MMLU −0.07pp / GSM8K −0.76pp vs un-quantized BF16 baseline (TP=1, n=164/1500/1319). Within statistical noise.

Locked runs on 2× B200 with NVLink (TP=2), warm, 50 real prompts, 4-run protocol

What's measured · Aggregate numbers are measured with NVIDIA's trtllm-bench at ISL/OSL = 128/128, warm. Single-stream numbers are measured on 50 real prompts, batch 1, temperature 0, median of warm runs — and artifact-gated: any per-prompt reading above the memory-bandwidth plausibility ceiling is discarded rather than marketed. Long-context throughput (1K/4K/8K/32K), TTFT curves, and full 8B quality receipts are publishing with the next release. We publish numbers only after we've measured them — no projections, no extrapolations.

The stack

State of the art is a stack, not a trick.

Fast inference on Blackwell isn't one magic setting. It's four layers, each engineered and measured independently, compounding into throughput your hardware was actually built for.

NVFP4 quantization

Blackwell-native 4-bit floating point. Half the memory footprint, roughly double the effective bandwidth headroom — the single biggest lever on a memory-bound decoder.

Receipts · Llama 3.3-70B: −1.83pp HumanEval · −0.07pp MMLU · −0.76pp GSM8K vs un-quantized BF16. Within statistical noise.

Speculative decoding, done right

Custom drafter models trained per target — on the target's own outputs, not generic corpus text. That's the difference between acceptance rates that hold in production and drafters that look good in a README and stall on real traffic.

Receipts · Drafters trained, benchmarked, and shipped per model. Acceptance measured before anything is claimed.

Stock TRT-LLM, tuned to the metal

No fork, no patched runtime, no out-of-tree kernels. CUDA graphs, overlap scheduling, KV-cache and batching configuration tuned per model and per GPU topology — shipped as a single YAML your stack already understands.

Receipts · Drop-in on stock NVIDIA TRT-LLM. Your serving infrastructure doesn't change.

Measured, never projected

Every recipe is validated cross-machine on independent B200s before a number appears on this page. Quality receipts ship with every release. If we haven't measured it, we don't say it.

Receipts · 43,509–69,461 tok/s aggregate (Llama 8B, 1× B200, cross-machine) · 999 tok/s single-stream sustained (Llama 70B, 2× B200).

Methodology · Aggregate: NVIDIA trtllm-bench, ISL/OSL = 128/128, warm. Single-stream: 50 real prompts, batch 1, temperature 0, warm median; readings above the memory-bandwidth plausibility ceiling are discarded as measurement artifacts.

Honesty policy · We publish measured numbers only — no projections, no extrapolations, no cherry-picked peaks presented as sustained throughput.

The recipe

Engineering, not magic.

Months of inference-performance research on NVIDIA Blackwell B200, distilled into a single configuration file per model.

MACH recipes are purpose-built per model architecture for NVIDIA Blackwell B200. Drop-in deployment via stock NVIDIA TRT-LLM. No model retraining, no inference-stack changes, no custom kernels. Full configurations delivered under NDA upon licensing agreement.

Stock TRT-LLM

No fork. No patched runtime. No out-of-tree kernels. Your inference stack stays exactly where it is — MACH ships as a configuration file.

Llama-architecture native

Recipes are purpose-built per decoder-only transformer. Llama today; Mistral, Qwen, DeepSeek, Llama 4 on the roadmap.

Drop-in deployment

One config file. Point TRT-LLM at it. No re-training, no re-quantization, no re-engineering your serving infrastructure.

Quality preserved

Llama 3.3-70B: −1.83pp HumanEval, −0.07pp MMLU, −0.76pp GSM8K vs un-quantized BF16 baseline. 8B receipts publishing with next release.

Performance terms — including any throughput floor, the benchmark protocol, hardware/precision/model variant, and refund mechanics — are negotiated and defined in the licensing agreement. No guarantees are made or implied on this page; all marketing numbers are reproducible benchmarks measured on the configurations cited.

Roadmap

What ships next.

Projections are based on memory-bandwidth scaling and architecture similarity — not yet measured. Numbers go on this page only after we measure them.

MACH 1Llama 3.1-8B-Instruct + Llama 3.3-70B-Instruct
Available
Available now·43.5K–69.5K tok/s aggregate (8B, 1× B200) · 999 tok/s single-stream (70B, 2× B200) · measured
MACH 2Llama 3.1 405B
Roadmap
Q1–Q2 2026·Projection publishes after first measured run
MACH 3Mistral Large 2
Roadmap
Q2 2026·Projection publishes after first measured run
MACH 4Qwen 2.5 72B
Roadmap
Q2 2026·Projection publishes after first measured run
MACH 5DeepSeek V3
Roadmap
Q3 2026·Projection publishes after first measured run
MACH 6Llama 4
Roadmap
Day-1 on release·Projection TBD
MACH ∞Custom model engagements
Open
On request·Your model, your Blackwell hardware — full-stack optimization with a contract-defined throughput floor and refund mechanics

Deployment matrix

Supported hardware.

Single B200 today. Dual-B200 NVLink configurations are in active validation. PCIe-only multi-GPU configurations are not supported.

Deployment	Model	Status
1× B200	MACH 1 — Llama 3.1-8B-Instruct	Validated
2× B200 with NVLink	MACH 1 — Llama 3.3-70B-Instruct	Validated
1× B200	MACH 1 — Llama 3.3-70B-Instruct	Validation in progress
2× B200 with NVLink	MACH 1 — Llama 3.1-8B-Instruct	Validation in progress

Hardware caveat · Deployment is supported on systems with proper NVLink interconnect (HGX B200 or DGX B200). PCIe-only 2-GPU configurations are not supported — memory-bandwidth scaling assumes NVLink-class interconnect between the GPUs.

Request access

Talk to us.

MACH 1 is delivered under NDA. Tell us what you're running and we'll send a recipe-fit assessment within one business day.

Your model,optimized to the limitof NVIDIA Blackwell.