Anyway, having a huge pipeline is only going to annihilate your Fmax. Instead have a “programmable ALU” with 128/256 bit path, RAM to store registers and operation microcode in ROM-ish blocks
It’ll still be faster than CPU, this is how most hardware accelerators for ECDSA work
