Tim Kellogg on Nostr: i’m very excited about the interpretability work that #anthropic has been doing ...
i’m very excited about the interpretability work that #anthropic has been doing with #LLMs.
in this paper, they used classical machine learning algorithms to discover concepts. if a concept like “golden gate bridge” is present in the text, then they discover the associated pattern of neuron activations.
this means that you can monitor LLM responses for concepts and behaviors, like “illicit behavior” or “fart jokes”
https://www.anthropic.com/research/mapping-mind-language-modelPublished at
2024-05-24 12:34:05Event JSON
{
"id": "5521a52d2bd09bfec5089dd4320ece305e2a63a39b5e4bf1acfa710b687c07eb",
"pubkey": "ad159d25c6d90f397ab2c21dca6492cb42079f31b8d80c9970d17c80802bd8a3",
"created_at": 1716554045,
"kind": 1,
"tags": [
[
"t",
"anthropic"
],
[
"t",
"LLMs"
],
[
"proxy",
"https://hachyderm.io/users/kellogh/statuses/112496085952640097",
"activitypub"
]
],
"content": "i’m very excited about the interpretability work that #anthropic has been doing with #LLMs. \n\nin this paper, they used classical machine learning algorithms to discover concepts. if a concept like “golden gate bridge” is present in the text, then they discover the associated pattern of neuron activations.\n\nthis means that you can monitor LLM responses for concepts and behaviors, like “illicit behavior” or “fart jokes”\n\nhttps://www.anthropic.com/research/mapping-mind-language-model",
"sig": "f5f651e4d9a58dcdce89e35f2bc0906e34c1f672f441b0ab1ee2384eb0246367c120fe9f7f34b02cc4f0c7deca739b231df19dcb6464aaa929b990c31f4a7220"
}