INDEX
    Explanations

    phrases indicating falsehood or deception

    New Auto-Interp
    Negative Logits
    luk
    -0.16
    mlin
    -0.15
    CHandle
    -0.15
    opic
    -0.15
    輪
    -0.15
    é¡Į
    -0.14
    _Null
    -0.14
    avor
    -0.14
    etz
    -0.14
    alnız
    -0.14
    POSITIVE LOGITS
     lie
    0.97
     lies
    0.94
     lying
    0.87
     Lie
    0.82
     Lies
    0.78
    Lie
    0.74
    lie
    0.70
     lied
    0.66
    lies
    0.64
     liar
    0.64
    Act Density 0.101%

    No Known Activations