they are effectively the same thing, one just has more tunability. a basic ...

2026-04-23 20:50:16 UTC

they are effectively the same thing, one just has more tunability.

a basic dot-product based classifier could be a classification head just by setting each output’s weights to the embed.

I think a gated MLP may work better though, and would allow a lower LR for the main weights that could reduce OOD shift. Compared to a direct one it would allow some nonlinearity and a higher intermediate representation

Could also intentionally bottleneck training by having LoRA instead of full FT on the base model (but usually not worth it when considering these are <1B param)

Author Public Key

npub12262qa4uhw7u8gdwlgmntqtv7aye8vdcmvszkqwgs0zchel6mz7s6cgrkj

Show more details

semisol on Nostr: they are effectively the same thing, one just has more tunability. a basic ...