it might not just be the FPU performance. chucking four singles around eats four times as much memory bandwidth and cache as a uint32, and i believe both are fairly constrained resources on a Pi 400 (and, well, everything else). it all adds up... which is to say, if you don't feel up to rearchitecting the whole library, then you might get most of the way there by simply¹ changing it to store its buffers as RGBA8 and only convert to floats to manipulate them
1. "simply" might turn out to be a grossly inappropriate word here... sorry