If I understand correctly how this works: There is a small always-on low-power core that is recording everything to a small buffer and doing a small amount of signal processing to see if there's a reasonable chance that you've said the activation phrase. When it detects this trigger, it wakes up the main core, which grabs the buffer and does some more complex signal processing to see if you *really* (or, at least, with much higher probability) said the activation phrase. If so, it's then forwarded to the thing that processes the command.
If the code on the main core doesn't have microphone access, the core is still woken up, but then the process that tries to check if you really said the activation phrase fails because it can't access the microphone.
There's probably an interesting side channel where a malicious version could (assuming the low-power core doesn't hardcode 'Okay Google') rapidly program different activation phrases to get a reasonably high probability of whether specific things are said.