Mach InferenceRequest access

Your model,
optimized to the limit
of NVIDIA Blackwell.

Mach optimizes LLM inference with the full state-of-the-art stack — NVFP4 quantization, custom-trained speculative-decode drafters, and tuned stock NVIDIA TRT-LLM — delivered as a drop-in config for your B200s. Every number on this page is measured on hardware, never projected.

NVIDIA Blackwell B200 GPU
43,509–69,461
Llama 3.1-8B · aggregate throughput
tok/s · 1× B200 · cross-machine validated · trtllm-bench · 128/128
19.2×
Single-user speedup · speculative decoding
Llama 8B · warm median vs no-spec · under 3% cross-run variance
Current product

MACH 1

Two purpose-built recipes for the most-deployed Llama models. Measured on real B200 hardware, drop-in via stock NVIDIA TRT-LLM.

NVFP4 (nvidia/Llama-3.1-8B-Instruct-NVFP4)

MACH 1 — Llama 3.1-8B-Instruct

meta-llama/Llama-3.1-8B-Instruct · 8B decoder-only

Available
43.5K–69.5K
tok/s · 1× B200 · cross-machine · 128/128
Aggregate throughput
1,570
tok/s · warm median · 19.2× vs no-spec · max 1,987
Single-stream, artifact-gated

Quality preservation · Output-correctness audit vs no-spec baseline: 100% coherent, semantic equivalence preserved (greedy, temperature 0). Full quality receipts publishing with the next release.

Validated cross-machine on two independent B200 pods, two datacenters, two driver versions
NVFP4 (nvidia/Llama-3.3-70B-Instruct-FP4)

MACH 1 — Llama 3.3-70B-Instruct

meta-llama/Llama-3.3-70B-Instruct · 70B decoder-only

Available
999
tok/s · 2× B200 TP2 · warm median · peak 1,265
Single-stream sustained
17,528
tok/s output · 2× B200 · 9.3 tok/s per watt
Aggregate @ 2,000 users

Quality preservation · HumanEval −1.83pp / MMLU −0.07pp / GSM8K −0.76pp vs un-quantized BF16 baseline (TP=1, n=164/1500/1319). Within statistical noise.

Locked runs on 2× B200 with NVLink (TP=2), warm, 50 real prompts, 4-run protocol
What's measured · Aggregate numbers are measured with NVIDIA's trtllm-bench at ISL/OSL = 128/128, warm. Single-stream numbers are measured on 50 real prompts, batch 1, temperature 0, median of warm runs — and artifact-gated: any per-prompt reading above the memory-bandwidth plausibility ceiling is discarded rather than marketed. Long-context throughput (1K/4K/8K/32K), TTFT curves, and full 8B quality receipts are publishing with the next release. We publish numbers only after we've measured them — no projections, no extrapolations.
The stack

State of the art is a stack, not a trick.

Fast inference on Blackwell isn't one magic setting. It's four layers, each engineered and measured independently, compounding into throughput your hardware was actually built for.

01

NVFP4 quantization

Blackwell-native 4-bit floating point. Half the memory footprint, roughly double the effective bandwidth headroom — the single biggest lever on a memory-bound decoder.

Receipts · Llama 3.3-70B: −1.83pp HumanEval · −0.07pp MMLU · −0.76pp GSM8K vs un-quantized BF16. Within statistical noise.

02

Speculative decoding, done right

Custom drafter models trained per target — on the target's own outputs, not generic corpus text. That's the difference between acceptance rates that hold in production and drafters that look good in a README and stall on real traffic.

Receipts · Drafters trained, benchmarked, and shipped per model. Acceptance measured before anything is claimed.

03

Stock TRT-LLM, tuned to the metal

No fork, no patched runtime, no out-of-tree kernels. CUDA graphs, overlap scheduling, KV-cache and batching configuration tuned per model and per GPU topology — shipped as a single YAML your stack already understands.

Receipts · Drop-in on stock NVIDIA TRT-LLM. Your serving infrastructure doesn't change.

04

Measured, never projected

Every recipe is validated cross-machine on independent B200s before a number appears on this page. Quality receipts ship with every release. If we haven't measured it, we don't say it.

Receipts · 43,509–69,461 tok/s aggregate (Llama 8B, 1× B200, cross-machine) · 999 tok/s single-stream sustained (Llama 70B, 2× B200).

Methodology · Aggregate: NVIDIA trtllm-bench, ISL/OSL = 128/128, warm. Single-stream: 50 real prompts, batch 1, temperature 0, warm median; readings above the memory-bandwidth plausibility ceiling are discarded as measurement artifacts.

Honesty policy · We publish measured numbers only — no projections, no extrapolations, no cherry-picked peaks presented as sustained throughput.

The recipe

Engineering, not magic.

Months of inference-performance research on NVIDIA Blackwell B200, distilled into a single configuration file per model.

MACH recipes are purpose-built per model architecture for NVIDIA Blackwell B200. Drop-in deployment via stock NVIDIA TRT-LLM. No model retraining, no inference-stack changes, no custom kernels. Full configurations delivered under NDA upon licensing agreement.
Stock TRT-LLM

No fork. No patched runtime. No out-of-tree kernels. Your inference stack stays exactly where it is — MACH ships as a configuration file.

Llama-architecture native

Recipes are purpose-built per decoder-only transformer. Llama today; Mistral, Qwen, DeepSeek, Llama 4 on the roadmap.

Drop-in deployment

One config file. Point TRT-LLM at it. No re-training, no re-quantization, no re-engineering your serving infrastructure.

Quality preserved

Llama 3.3-70B: −1.83pp HumanEval, −0.07pp MMLU, −0.76pp GSM8K vs un-quantized BF16 baseline. 8B receipts publishing with next release.

Performance terms — including any throughput floor, the benchmark protocol, hardware/precision/model variant, and refund mechanics — are negotiated and defined in the licensing agreement. No guarantees are made or implied on this page; all marketing numbers are reproducible benchmarks measured on the configurations cited.

Roadmap

What ships next.

Projections are based on memory-bandwidth scaling and architecture similarity — not yet measured. Numbers go on this page only after we measure them.

  1. MACH 1Llama 3.1-8B-Instruct + Llama 3.3-70B-Instruct
    Available
    Available now·43.5K–69.5K tok/s aggregate (8B, 1× B200) · 999 tok/s single-stream (70B, 2× B200) · measured
  2. MACH 2Llama 3.1 405B
    Roadmap
    Q1–Q2 2026·Projection publishes after first measured run
  3. MACH 3Mistral Large 2
    Roadmap
    Q2 2026·Projection publishes after first measured run
  4. MACH 4Qwen 2.5 72B
    Roadmap
    Q2 2026·Projection publishes after first measured run
  5. MACH 5DeepSeek V3
    Roadmap
    Q3 2026·Projection publishes after first measured run
  6. MACH 6Llama 4
    Roadmap
    Day-1 on release·Projection TBD
  7. MACH ∞Custom model engagements
    Open
    On request·Your model, your Blackwell hardware — full-stack optimization with a contract-defined throughput floor and refund mechanics
Deployment matrix

Supported hardware.

Single B200 today. Dual-B200 NVLink configurations are in active validation. PCIe-only multi-GPU configurations are not supported.

DeploymentModelStatus
1× B200MACH 1 — Llama 3.1-8B-InstructValidated
2× B200 with NVLinkMACH 1 — Llama 3.3-70B-InstructValidated
1× B200MACH 1 — Llama 3.3-70B-InstructValidation in progress
2× B200 with NVLinkMACH 1 — Llama 3.1-8B-InstructValidation in progress
Hardware caveat · Deployment is supported on systems with proper NVLink interconnect (HGX B200 or DGX B200). PCIe-only 2-GPU configurations are not supported — memory-bandwidth scaling assumes NVLink-class interconnect between the GPUs.
Request access

Talk to us.

MACH 1 is delivered under NDA. Tell us what you're running and we'll send a recipe-fit assessment within one business day.