Join Nostr
2025-01-27 10:13:20 UTC
in reply to

Timo Zimmermann on Nostr: I’m using Ollama to host LLMs (so llama.cpp underneath) - system memory is usable - ...

I’m using Ollama to host LLMs (so llama.cpp underneath)

- system memory is usable
- if the model cannot 100% be offloaded to the GPU it’ll offload the maximum number of layers that fit in VRAM
- the GPU utilization goes down to 15-30%
- CPU utilization sits around 60%

A 32b model Q8 uses the 22GB available on the 4090 and 15GB system memory.

A response for 132k context takes about 3 minutes.
Same model, Q4 fully offloaded to VRAM 19s.