Join Nostr
2026-06-07 00:28:04 UTC
in reply to

semisol on Nostr: That is the result of RL, not the intention. If you encourage less tokens, less tool ...

That is the result of RL, not the intention.

If you encourage less tokens, less tool call rounds, and reward passing tests or similar, then that is what happens

The easiest way gets the reward first and the model rapidly converges to that