ML System Bottleneck Analyzer

Pick a model and hardware. See the decode rate and what's bottlenecking it.

Model

Model Quantization

Tuning & advanced options

Defaults work for most setups. Change these only if you need specific behavior.

Sequence length (tokens) Total prompt + generated tokens per request. Batch size Distribution across devices Inference optimization Runtime framework Load a scenario preset

Cost assumptions (power pricing)

Hours/day $/kWh

Model internals (override preset)

Total Parameters (B) Hidden Size Number of Layers Number of Heads Architecture Family

KV Heads

Intermediate Size

Active Params (B)

Experts

Active Experts

Routing

Attention

Hardware

Pick a GPU or CPU per device. Use + Add device for multi-GPU setups.

Results — approximate

Resource utilization

System topology — connection diagram

NVLink (300+ GB/s)

PCIe 5.0 (32-64 GB/s)

PCIe 4.0 (8-32 GB/s)

PCIe 3.0/DDR5 (<16 GB/s)

📚 Published benchmarks for reference — click to expand

Real-world token rates from vendor, community, and research sources. Use the filters to find a reference close to your configuration.

Model	Quantization	Framework	Hardware	Batch Size	Sequence Length	Token Rate (Batch)	Token Rate (Single)	Source