INDEX
Explanations
themes of neutrality and balance in discourse
New Auto-Interp
Negative Logits
Bounding
-0.17
âĨĵ
-0.16
rete
-0.15
shortcut
-0.14
coni
-0.14
iena
-0.14
tail
-0.14
ottage
-0.14
oard
-0.14
mise
-0.14
POSITIVE LOGITS
neutral
0.55
neutrality
0.54
Neutral
0.51
neutral
0.49
-neutral
0.47
Neutral
0.46
impartial
0.41
neutr
0.34
neither
0.28
balanced
0.28
Activations Density 0.197%