Join Nostr
2026-04-13 22:47:13 UTC
in reply to

MachuPikacchu on Nostr: Haven’t ran qwen in a minute but it’s surprising you’re not getting higher ...

Haven’t ran qwen in a minute but it’s surprising you’re not getting higher throughput for gemma4 on your 3090s.

For what it’s worth if you use llama.cpp and disable reasoning you should see faster time to first byte at the cost of a slight degradation in quality. Haven’t used vllm so can’t comment there. I get 70-75 tok/s on a macbook m3 for comparison and only have 40 GPU cores