I think that phrasing is ambiguous enough for anyone to paint as they like.
It's likely referring to using reinforcement learning instead of trying to imitate some example data, to improve coherence (and in particular, the step that went directly from the raw model that computes conditional probs of strings--which we can sample from and talk to by conditioning on a conversation prefix--based on internet data, straight to RL without adjustments to give response strings more in line with human conversational expectations).
I can see why the author would want to phrase it that way for a broader audience. The optimization phase has much less hand-holding direction than typical. It is kind of a misleading way to put it though.