在不使用sendSplit的情况下使用R的qdap软件包估计文档极性 [英] Estimating document polarity using R's qdap package without sentSplit

查看:128
本文介绍了在不使用sendSplit的情况下使用R的qdap软件包估计文档极性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想将qdappolarity函数应用于文档向量,每个文档可以包含多个句子,并为每个文档获取相应的极性.例如:

I'd like to apply qdap's polarity function to a vector of documents, each of which could contain multiple sentences, and obtain the corresponding polarity for each document. For example:

library(qdap)
polarity(DATA$state)$all$polarity
# Results:
 [1] -0.8165 -0.4082  0.0000 -0.8944  0.0000  0.0000  0.0000 -0.5774  0.0000
[10]  0.4082  0.0000
Warning message:
In polarity(DATA$state) :
  Some rows contain double punctuation.  Suggested use of `sentSplit` function.

此警告不可忽略,因为它似乎增加了文档中每个句子的极性得分.这可能会导致文档级别的极性评分超出[-1,1]范围.

This warning can't be ignored, as it seems to add the polarity scores of each sentence in the document. This can result in document-level polarity scores outside the [-1, 1] bounds.

我知道可以先运行sentSplit然后在句子中求平均值的选项,也许可以通过单词数对极性进行加权,但这是(1)效率低下的(大约需要运行4倍的完整文档,警告),以及(2)不清楚如何给句子加重.此选项如下所示:

I'm aware of the option to first run sentSplit and then average across the sentences, perhaps weighting polarity by word count, but this is (1) inefficient (takes roughly 4x as long as running on the full documents with the warning), and (2) unclear how to weight sentences. This option would look something like this:

DATA$id <- seq(nrow(DATA)) # For identifying and aggregating documents 
sentences <- sentSplit(DATA, "state")
library(data.table) # For aggregation
pol.dt <- data.table(polarity(sentences$state)$all)
pol.dt[, id := sentences$id]
document.polarity <- pol.dt[, sum(polarity * wc) / sum(wc), "id"]

我希望我可以在去除了句点的矢量版本上运行polarity,但似乎sentSplit的作用还不止于此.这适用于DATA,但不适用于其他文本集(我不确定除句号以外的全部中断).

I was hoping I could run polarity on a version of the vector with periods removed, but it seems that sentSplit does more than that. This works on DATA but not on other sets of text (I'm unsure of the full set of breaks other than periods).

因此,我怀疑解决此问题的最佳方法是使文档向量的每个元素看起来像一个长句子.我将如何执行此操作,或者还有另一种方法?

So, I suspect the best way of approaching this is to make each element of the document vector look like one long sentence. How would I do this, or is there another way?

推荐答案

Max在此版本的qdap(1.3.4)中发现了一个错误,该错误将占位符视为影响方程的单词,因为分母为sqrt(n)其中n是字数.从1.3.5版开始,此问题已得到纠正,因此为什么两个不同的输出不匹配.

Max found a bug in this version of qdap (1.3.4) that counted a place holder as a word which affect the equation since the denominator is sqrt(n) where n is the word count. As of 1.3.5 this has been corrected, hence why the two different outputs did not match.

以下是输出:

library(qdap)
counts(polarity(DATA$state))[, "polarity"]

## > counts(polarity(DATA$state))[, "polarity"]
##  [1] -0.8164966 -0.4472136  0.0000000 -1.0000000  0.0000000  0.0000000  0.0000000
##  [8] -0.5773503  0.0000000  0.4082483  0.0000000
## Warning message:
## In polarity(DATA$state) : 
##   Some rows contain double punctuation.  Suggested use of `sentSplit` function.

在这种情况下,使用strip无关紧要.在更复杂的情况下,可能涉及放大器,取反器,负片和逗号.这是一个示例:

In this case using strip does not matter. It may in more complex situations involving amplifiers, negators, negatives, and commas. Here is an example:

## > counts(polarity("Really, I hate it"))[, "polarity"]
## [1] -0.5
## > counts(polarity(strip("Really, I hate it")))[, "polarity"]
## [1] -0.9

有关更多信息,请参阅文档.

see the documentation for more.

这篇关于在不使用sendSplit的情况下使用R的qdap软件包估计文档极性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆