so i've been working on a talk that im calling "claude is your insider threat now" ...

so i've been working on a talk that im calling "claude is your insider threat now" and it intially began with anthropics "china paper" they released last year surrounding the use of llms to do bad guy stuff. I ended up talking about it at great length with Tony from Versprite, and even ended up on his podcast about it - the big discovery there was "claude lying about running a tool, and claude lying about tool output"

turns out that shit is hardcoded

https://neuromatch.social/@jonny/116326861737478342

quoting
note1280…09la

OK i can't focus on work and keep looking at this repo.

So after every "subagent" runs, claude code creates *another* "agent" to check on whether the first "agent" did the thing it was supposed to. I don't know about you but i smell a bit of a problem, if you can't trust whether one "agent" with a very big fancy model did something, how in the fuck are you supposed to trust another "agent" running on the smallest crappiest model?

That's not the funny part, that's obvious and fundamental to the entire show here. HOWEVER RECALL [the above JSON Schema Verification thing](https://neuromatch.social/@jonny/116325123136895805 ) that is unconditionally added onto the end of every round of LLM calls. the mechanism for adding that hook is... JUST FUCKING ASKING THE MODEL TO CALL THAT TOOL. second pic is registering a hook s.t. "after some stop state happens, if there isn't a message indicating that we have successfully called the JSON validation thing, prompt the model saying "you must call the json validation thing"

this shit sucks so bad they can't even ***CALL THEIR OWN CODE FROM INSIDE THEIR OWN CODE.***

Look at the comment on pic 3 - "e.g. agent finished without calling structured output tool" - that's common enough that they have a whole goddamn error category for it, and the way it's handled is by just pretending the job was cancelled and nothing happened.

Viss on Nostr: so i've been working on a talk that im calling "claude is your insider threat now" ...