I could probably improve further with a closer eye on cache usage, >1 sample buffer size, SIMD optimizations, and maybe some kind of GPU offloading node
Oh and maybe lowering the worker thread wait time so that it's not spending 1+ms just waiting around for the buffer to be filled
