Join Nostr
2026-03-29 17:37:43 UTC
in reply to

Giovanni Crisalfi on Nostr: nprofile1q…fdfxr Yepp, bandwidth-bound workloads benefit most (as I found ...

Yepp, bandwidth-bound workloads benefit most (as I found optimizing this: https://github.com/gicrisf/qwen-asr-rs/tree/bf16-gemm)

In this case, most of the gain comes from the weight matrices: this avoids allocating temp f32 buffers and halves input/output memory traffic