npub1sld6cd53kj8pkz9svnnt54um57nvwrlx9sgya7xhvhl5a5a4f3sq8063kn (npub1sld…63kn) might have something to do with how things pack into op cache lines?
cmp-then-movapd-then-jne is 3 ops
movapd-then-cmp-then-jne can potentially fuse the cmp+jne pair into 1 macro-op, so 2 ops
that cmp is 9 bytes, so pretty big
maybe the extra nop pushes it over some boundary so it can't go into some opcache line and that avoids other hiccups?