nprofile1qy2hwumn8ghj7un9d3shjtnyd968gmewwp6kyqpqh659u7uz26ggxc9t534espqa3scnduw7fqhzcw86z74ejd0xsdpsxfdfxr (nprofile…dfxr) Yepp, bandwidth-bound workloads benefit most (as I found optimizing this: https://github.com/gicrisf/qwen-asr-rs/tree/bf16-gemm)
In this case, most of the gain comes from the weight matrices: this avoids allocating temp f32 buffers and halves input/output memory traffic