Short Text Note by Timo Zimmermann (reply)

2025-01-27 10:13:20 UTC

I’m using Ollama to host LLMs (so llama.cpp underneath)

- system memory is usable
- if the model cannot 100% be offloaded to the GPU it’ll offload the maximum number of layers that fit in VRAM
- the GPU utilization goes down to 15-30%
- CPU utilization sits around 60%

A 32b model Q8 uses the 22GB available on the 4090 and 15GB system memory.

A response for 132k context takes about 3 minutes.
Same model, Q4 fully offloaded to VRAM 19s.

Author Public Key

npub1pncq5phzs5jl2llevajfh7um9z48tfysn47p7lvhlw2vavwgdsgsdlru22

Show more details

Published at

2025-01-27 10:13:20 UTC

Kind type

1 Short Text Note

Event JSON

{ "id": "d335d492d782315aca1f113221edbd2d2d8d8acc0ac8e870dd1b843bbd5223f9", "pubkey": "0cf00a06e28525f57ff967649bfb9b28aa75a4909d7c1f7d97fb94ceb1c86c11", "created_at": 1737972800, "kind": 1, "tags": [ [ "e", "8b39f635f8861068376a1938ce1d752d98a05e62b105a57c5206c7735a337982", "", "reply", "bca276c2efbc77f202037845428983726efbee92fd4d8ab58d6c07cf9288290d" ], [ "p", "bca276c2efbc77f202037845428983726efbee92fd4d8ab58d6c07cf9288290d" ], [ "p", "9c031ba3d4464f56ce895aff6098269c9ed4ff45f2a256ac82aca11db5ca97fe" ], [ "proxy", "https://social.screamingatmyscreen.com/@fallenhitokiri/113899785442410631", "web" ], [ "e", "7347e64e2e739da6fe34c2f85d5f05fca94972236cb66429dd322002d969d23f", "", "root", "0cf00a06e28525f57ff967649bfb9b28aa75a4909d7c1f7d97fb94ceb1c86c11" ], [ "p", "0cf00a06e28525f57ff967649bfb9b28aa75a4909d7c1f7d97fb94ceb1c86c11" ], [ "proxy", "https://social.screamingatmyscreen.com/users/fallenhitokiri/statuses/113899785442410631", "activitypub" ], [ "L", "pink.momostr" ], [ "l", "pink.momostr.activitypub:https://social.screamingatmyscreen.com/users/fallenhitokiri/statuses/113899785442410631", "pink.momostr" ], [ "-" ] ], "content": "I’m using Ollama to host LLMs (so llama.cpp underneath)\n\n- system memory is usable\n- if the model cannot 100% be offloaded to the GPU it’ll offload the maximum number of layers that fit in VRAM\n- the GPU utilization goes down to 15-30%\n- CPU utilization sits around 60%\n\nA 32b model Q8 uses the 22GB available on the 4090 and 15GB system memory.\n\nA response for 132k context takes about 3 minutes.\nSame model, Q4 fully offloaded to VRAM 19s.", "sig": "570997c269407172721e8b0fa37ca1b315a8630dfe9f876e88a99a6625acc8ed564e332675eeb9f55f9ec5e207c59a872f7d0f53c0ceef6269fb192f09da338a" }

Timo Zimmermann on Nostr: I’m using Ollama to host LLMs (so llama.cpp underneath) - system memory is usable - ...