they are effectively the same thing, one just has more tunability.
a basic dot-product based classifier could be a classification head just by setting each output’s weights to the embed.
I think a gated MLP may work better though, and would allow a lower LR for the main weights that could reduce OOD shift. Compared to a direct one it would allow some nonlinearity and a higher intermediate representation
Could also intentionally bottleneck training by having LoRA instead of full FT on the base model (but usually not worth it when considering these are <1B param)
