ML System Bottleneck Analyzer

Pick a model, pick the hardware, and see where the system will bottleneck and how many tokens/sec to expect.

Quick start or configure manually below

Model & workload

Llama3 8b @ Q4
Model
Weight precision. Lower = smaller memory, faster decode, slight accuracy loss.
Workload
Total prompt + generated tokens per request. Click a preset or type a custom value.
Advanced model internals
Concurrent requests processed together. Higher = better GPU utilization but more memory.
Distribution & runtime
AUTO enumerates all valid strategies and picks the highest decode rate.
Cost assumptions (power pricing)
Used only for the daily/monthly cost estimate in the System Total panel.

Hardware

Pick a template per device, or edit any field to create a custom spec. The device library saves custom builds for reuse.

System analysis — token rates are approximations

Resource utilization

System topology — connection diagram

NVLink (300+ GB/s)
PCIe 5.0 (32-64 GB/s)
PCIe 4.0 (8-32 GB/s)
PCIe 3.0/DDR5 (<16 GB/s)
📚 Published benchmarks for reference — click to expand

Real-world token rates from vendor, community, and research sources. Use the filters to find a reference close to your configuration.

Model Quantization Framework Hardware Batch Size Sequence Length Token Rate (Batch) Token Rate (Single) Source