如何在Python NLTK中计算Vader的“化合物"极性分数? [英] How is the Vader 'compound' polarity score calculated in Python NLTK?
问题描述
我正在使用Vader SentimentAnalyzer来获取极性分数.之前,我使用概率分数来表示正/负/中性,但我刚刚意识到复合"分数范围从-1(最大负数)到1(最大正数)将提供极性的单个度量.我想知道化合物"分数是如何计算的.是从[pos,neu,neg]向量计算得出的吗?
I'm using the Vader SentimentAnalyzer to obtain the polarity scores. I used the probability scores for positive/negative/neutral before, but I just realized the "compound" score, ranging from -1 (most neg) to 1 (most pos) would provide a single measure of polarity. I wonder how the "compound" score computed. Is that calculated from the [pos, neu, neg] vector?
推荐答案
VADER算法将情感分数输出到4类情感https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L441 :
The VADER algorithm outputs sentiment scores to 4 classes of sentiments https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L441:
-
neg
:负数 -
neu
:中性 -
pos
:肯定的 -
compound
:化合物(即总分)
neg
: Negativeneu
: Neutralpos
: Positivecompound
: Compound (i.e. aggregated score)
让我们遍历代码,化合物的第一个实例位于 https: //github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L421 ,它在其中计算:
Let's walk through the code, the first instance of compound is at https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L421, where it computes:
compound = normalize(sum_s)
normalize()
函数在 https://github.com/zh/.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L107 :
def normalize(score, alpha=15):
"""
Normalize the score to be between -1 and 1 using an alpha that
approximates the max expected value
"""
norm_score = score/math.sqrt((score*score) + alpha)
return norm_score
所以有一个超参数alpha
.
对于sum_s
,这是传递给score_valence()
函数
As for the sum_s
, it is a sum of the sentiment arguments passed to the score_valence()
function https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L413
And if we trace back this sentiment
argument, we see that it's computed when calling the polarity_scores()
function at https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L217:
def polarity_scores(self, text):
"""
Return a float for sentiment strength based on the input text.
Positive values are positive valence, negative value are negative
valence.
"""
sentitext = SentiText(text)
#text, words_and_emoticons, is_cap_diff = self.preprocess(text)
sentiments = []
words_and_emoticons = sentitext.words_and_emoticons
for item in words_and_emoticons:
valence = 0
i = words_and_emoticons.index(item)
if (i < len(words_and_emoticons) - 1 and item.lower() == "kind" and \
words_and_emoticons[i+1].lower() == "of") or \
item.lower() in BOOSTER_DICT:
sentiments.append(valence)
continue
sentiments = self.sentiment_valence(valence, sentitext, item, i, sentiments)
sentiments = self._but_check(words_and_emoticons, sentiments)
查看polarity_scores
函数,它的作用是遍历整个SentiText词典,并检查基于规则的sentiment_valence()
函数,以将化合价分配给情感https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L243,请参阅 http://comp.social.gatech.edu/papers的2.1.1节. /icwsm14.vader.hutto.pdf
Looking at the polarity_scores
function, what it's doing is to iterate through the whole SentiText lexicon and checks with the rule-based sentiment_valence()
function to assign the valence score to the sentiment https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L243, see Section 2.1.1 of http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf
回到综合得分,我们看到:
So going back to the compound score, we see that:
-
compound
分数是sum_s
和 的归一化分数
-
sum_s
是基于一些启发式方法和情感词典(又称为情感强度)和 所计算出的化合价.
- 归一化分数只是
sum_s
除以其平方再加上一个alpha参数,该参数会增加归一化函数的分母.
- the
compound
score is a normalized score ofsum_s
and sum_s
is the sum of valence computed based on some heuristics and a sentiment lexicon (aka. Sentiment Intensity) and- the normalized score is simply the
sum_s
divided by its square plus an alpha parameter that increases the denominator of the normalization function.
是从[pos,neu,neg]向量计算得出的吗?
不是真的=)
如果我们看一下score_valence
函数 https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L411 ,我们看到复合得分是用sum_s
计算的,而pos,neg和neu得分是使用_sift_sentiment_scores()
使用来自sentiment_valence()
的原始分数(不含总和)来计算单个pos,neg和neu分数.
If we take a look at the score_valence
function https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L411, we see that the compound score is computed with the sum_s
before the pos, neg and neu scores are computed using _sift_sentiment_scores()
that computes the invidiual pos, neg and neu scores using the raw scores from sentiment_valence()
without the sum.
如果我们看一下alpha
的数学公式,则根据alpha
的值,归一化的输出似乎很不稳定(如果不受约束的话):
If we take a look at this alpha
mathemagic, it seems the output of the normalization is rather unstable (if left unconstrained), depending on the value of alpha
:
alpha=0
:
alpha=15
:
alpha=50000
:
alpha=0.001
:
当它为负数时会变得时髦:
It gets funky when it's negative:
alpha=-10
:
alpha=-1,000,000
:
alpha=-1,000,000,000
:
这篇关于如何在Python NLTK中计算Vader的“化合物"极性分数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!