That is the result of RL, not the intention. If you encourage less tokens, less tool ...

2026-06-07 00:28:04 UTC

That is the result of RL, not the intention.

If you encourage less tokens, less tool call rounds, and reward passing tests or similar, then that is what happens

The easiest way gets the reward first and the model rapidly converges to that

Author Public Key

npub12262qa4uhw7u8gdwlgmntqtv7aye8vdcmvszkqwgs0zchel6mz7s6cgrkj

Show more details

semisol on Nostr: That is the result of RL, not the intention. If you encourage less tokens, less tool ...