INDEX
Explanations
phrases indicating falsehood or deception
New Auto-Interp
Negative Logits
luk
-0.16
mlin
-0.15
CHandle
-0.15
opic
-0.15
輪
-0.15
é¡Į
-0.14
_Null
-0.14
avor
-0.14
etz
-0.14
alnız
-0.14
POSITIVE LOGITS
lie
0.97
lies
0.94
lying
0.87
Lie
0.82
Lies
0.78
Lie
0.74
lie
0.70
lied
0.66
lies
0.64
liar
0.64
Activations Density 0.101%