INDEX
Explanations
judgments regarding moral and ethical standards related to exploitation and human rights issues
New Auto-Interp
Negative Logits
дописавши
-0.94
AndEndTag
-0.94
ModelExpression
-0.88
Wicidata
-0.87
kháu
-0.80
Попис
-0.77
سكانية
-0.76
<<<<<<<<<<<<<<
-0.75
+#+#
-0.71
wieś
-0.68
POSITIVE LOGITS
unacceptable
0.68
👎
0.57
violates
0.57
outright
0.57
harmful
0.56
不应该
0.55
intolerable
0.54
violation
0.54
downright
0.54
❌
0.52
Activations Density 0.391%