Join Nostr
2026-04-23 20:50:16 UTC
in reply to

semisol on Nostr: they are effectively the same thing, one just has more tunability. a basic ...

they are effectively the same thing, one just has more tunability.

a basic dot-product based classifier could be a classification head just by setting each output’s weights to the embed.

I think a gated MLP may work better though, and would allow a lower LR for the main weights that could reduce OOD shift. Compared to a direct one it would allow some nonlinearity and a higher intermediate representation

Could also intentionally bottleneck training by having LoRA instead of full FT on the base model (but usually not worth it when considering these are <1B param)