While everyone's geeking out over Grok4's insane physics sims and Kimi K2's 1T OS bombshell (crushing coding benchmarks for pennies), the real AI drama is in the pricing shadows. After my LLM Selector post blew up here, I kept getting DMs asking "but which provider should I actually use?" So I dove deep into 439 models across 63 providers.
What I found? some interesting insights:
1. huge markup on identical models
Take DeepSeek R1 0528 (quality 68 from Artificial analysis bench, beats many flagships):
Completely free on Google Vertex and CentML (decent speeds too, 121 tok/s and 87 tok/s).
But jumps to $0.91 on Deepinfra, $4.25 on Fireworks Fast, and a whopping $5.50 on SambaNova, for the exact same model (ofc with speed differences).
Arbitrage alert: Why pay infinite markup when free tiers deliver the goods for experimentation or bulk runs?
2. Latency goldmines hiding in plain sight
Sub millisecond responses aren't just for premium setups:
Nebius Base crushes it with DeepSeek R1 at 0.61ms latency for $1.00/1M (103 tok/s) and Qwen3 235B at 0.56ms for $0.30/1M (50 tok/s).
Groq takes it further with models like Qwen3 32B at 0.14ms for $0.36/1M (627 tok/s).
Arbitrage alert: These blow away slower "enterprise" options costing 10x more, ideal for real-time apps
While everyone's geeking out over Grok4's insane physics sims and Kimi K2's 1T OS bombshell (crushing coding benchmarks for pennies), the real AI drama is in the pricing shadows. After my LLM Selector post blew up here, I kept getting DMs asking "but which provider should I actually use?" So I dove deep into 439 models across 63 providers.
What I found? some interesting insights:
1. huge markup on identical models Take DeepSeek R1 0528 (quality 68 from Artificial analysis bench, beats many flagships):
Completely free on Google Vertex and CentML (decent speeds too, 121 tok/s and 87 tok/s).
But jumps to $0.91 on Deepinfra, $4.25 on Fireworks Fast, and a whopping $5.50 on SambaNova, for the exact same model (ofc with speed differences).
Arbitrage alert: Why pay infinite markup when free tiers deliver the goods for experimentation or bulk runs?
2. Latency goldmines hiding in plain sight Sub millisecond responses aren't just for premium setups:
Nebius Base crushes it with DeepSeek R1 at 0.61ms latency for $1.00/1M (103 tok/s) and Qwen3 235B at 0.56ms for $0.30/1M (50 tok/s).
Groq takes it further with models like Qwen3 32B at 0.14ms for $0.36/1M (627 tok/s).
Arbitrage alert: These blow away slower "enterprise" options costing 10x more, ideal for real-time apps
3. speed demons with massive throughput gaps Hardware optimization creates wild performance swings:
Cerebras with Qwen3 32B at 2,496 tok/s for $0.50/1M and Llama 4 Scout at 2,808 tok/s for $0.70/1M.
Compare to the same models elsewhere: Often stuck at 40-80 tok/s for similar or higher prices.
Arbitrage alert: 50x+ throughput boosts on the same model?
4. Quality overpays that defy logic High-quality doesn't mean high-price anymore:
Qwen3 235B (quality 62) at $0.10/1M on Fireworks (79 tok/s): outperforms Claude 4 Opus (quality 58) which costs $30/1M everywhere (19-65 tok/s).
Grok 3 mini (quality 67) at $0.35/1M on xAI (210 tok/s), edging out pricier closed source rivals.
Arbitrage alert: 300x cheaper for better quality? Open-source gems like these make "premium" models look like rip-offs lol
5. Provider flips on big-name models Even giants like OpenAI show huge variances:
GPT-4.1 mini ($0.70/1M): Azure blasts 217 tok/s vs OpenAI's 73 tok/s.
o3 ($3.50/1M): OpenAI hits 199 tok/s vs Azure's slower 99 tok/s (with double the latency).
Arbitrage alert: Same price, but 3x throughput or half the latency? Picking the right endpoint saves thousands on production workloads.
We're in the Wild West of pricing amid all the hype. Big names coast on reputation, but smaller providers like Nebius and Cerebras optimize like mad.
Open-source crushes closed-source on value: top 20 price-perf plays are ALL open.
What should you do?
Stop assuming expensive = better
Hunt latency and speed arbitrages (they're everywhere)
Test specialised providers for throughput wins
Grab sub-$0.50 open-source beasts (like Qwen3 or Grok mini)
Exploit these gaps now before "normalization" hits
Centralised all the data from Artificial analysis on whatllm.com, and insights are the real gold.
Found crazier arbitrages? Spill in comments!
which hype are you actually buying, and why?
This rabbit hole hit harder than any benchmark!
Happy to geek out more!