ML System Bottleneck Analyzer

Pick a model and hardware. See the decode rate and what's bottlenecking it.

Model

Tuning & advanced options

Defaults work for most setups. Change these only if you need specific behavior.

Total prompt + generated tokens per request.
Cost assumptions (power pricing)
Model internals (override preset)

Hardware

Pick a GPU or CPU per device. Use + Add device for multi-GPU setups.

Results — approximate

Resource utilization

System topology — connection diagram

NVLink (300+ GB/s)
PCIe 5.0 (32-64 GB/s)
PCIe 4.0 (8-32 GB/s)
PCIe 3.0/DDR5 (<16 GB/s)
📚 Published benchmarks for reference — click to expand

Real-world token rates from vendor, community, and research sources. Use the filters to find a reference close to your configuration.

Model Quantization Framework Hardware Batch Size Sequence Length Token Rate (Batch) Token Rate (Single) Source