INDEX
Explanations
expressions of self-doubt or uncertainty
New Auto-Interp
Negative Logits
apur
-0.16
asher
-0.15
ValuePair
-0.15
Ñģем
-0.15
deen
-0.15
bilt
-0.14
Clock
-0.14
.HTML
-0.13
Runner
-0.13
ÐĴÑĸд
-0.13
POSITIVE LOGITS
correct
0.36
correctness
0.31
wrong
0.30
Correct
0.30
incorrect
0.30
Correct
0.29
correct
0.28
Wrong
0.25
WRONG
0.24
_correct
0.24
Activations Density 0.152%