使用 quanteda 进行 R 文本挖掘 [英] R Text Mining with quanteda

查看:155
本文介绍了使用 quanteda 进行 R 文本挖掘的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据集(Facebook 帖子)(通过 netvizz)并且我在 R 中使用了 quanteda 包.这是我的 R 代码.

I have a data set (Facebook posts) (via netvizz) and I use the quanteda package in R. Here is my R code.

# Load the relevant dictionary (relevant for analysis)
liwcdict <- dictionary(file = "D:/LIWC2001_English.dic", format = "LIWC")

# Read File
# Facebooks posts could be generated by  FB Netvizz 
# https://apps.facebook.com/netvizz
# Load FB posts as .csv-file from .zip-file 
fbpost <- read.csv("D:/FB-com.csv", sep=";")

# Define the relevant column(s)
fb_test <-as.character(FB_com$comment_message) #one column with 2700 entries
# Define as corpus
fb_corp <-corpus(fb_test)
class(fb_corp)

# LIWC Application
fb_liwc<-dfm(fb_corp, dictionary=liwcdict)
View(fb_liwc)

一切正常,直到:

> fb_liwc<-dfm(fb_corp, dictionary=liwcdict)
Creating a dfm from a corpus ...
   ... indexing 2,760 documents
   ... tokenizing texts, found 77,923 total tokens
   ... cleaning the tokens, 1584 removed entirely
   ... applying a dictionary consisting of 68 key entries
Error in `dimnames<-.data.frame`(`*tmp*`, value = list(docs = c("text1",  : 
  invalid 'dimnames' given for data frame

您如何解释错误消息?有什么解决问题的建议吗?

How would you interpret the error message? Are there any suggestions to solve the problem?

推荐答案

在 quanteda 0.7.2 版本中存在一个错误,导致 dfm() 在使用字典时失败不包含任何功能.您的示例失败了,因为在清理阶段,某些 Facebook 帖子文档"最终通过清理步骤删除了所有功能.

There was a bug in quanteda version 0.7.2 that caused dfm() to fail when using a dictionary when one of the documents contains no features. Your example fails because in the cleaning stage, some of the Facebook post "documents" end up having all of their features removed through the cleaning steps.

这不仅在 0.8.0 中得到修复,而且我们还在 dfm() 中更改了字典的底层实现,从而显着提高了速度.(LIWC 仍然是一个庞大而复杂的字典,正则表达式仍然意味着使用它比简单地索引标记要慢得多.我们将进一步优化它.)

This is not only fixed in 0.8.0, but also we changed the underlying implementation of dictionaries in dfm(), resulting in a significant speed improvement. (The LIWC is still a large and complicated dictionary, and the regular expressions still mean that it is much slower to use than simply indexing tokens. We will work on optimising this further.)

devtools::install_github("kbenoit/quanteda")
liwcdict <- dictionary(file = "LIWC2001_English.dic", format = "LIWC")
mydfm <- dfm(inaugTexts, dictionary = liwcdict)
## Creating a dfm from a character vector ...
##    ... indexing 57 documents
##    ... lowercasing
##    ... tokenizing
##    ... shaping tokens into data.table, found 134,024 total tokens
##    ... applying a dictionary consisting of 68 key entries
##    ... summing dictionary-matched features by document
##    ... indexing 68 feature types
##    ... building sparse matrix
##    ... created a 57 x 68 sparse dfm
##    ... complete. Elapsed time: 14.005 seconds.
topfeatures(mydfm, decreasing=FALSE)
## Fillers   Nonfl   Swear      TV  Eating   Sleep   Groom   Death  Sports  Sexual 
##       0       0       0      42      47      49      53      76      81     100 

如果文档在标记化和清理后包含零个特征,它也可以工作,这可能是破坏您在 Facebook 文本中使用的旧 dfm 的原因.

It will also work if a document contains zero features after tokenization and cleaning, which is probably what is breaking the older dfm you are using with your Facebook texts.

mytexts <- inaugTexts
mytexts[3] <- ""
mydfm <- dfm(mytexts, dictionary = liwcdict, verbose = FALSE)
which(rowSums(mydfm)==0)
## 1797-Adams 
##          3 

这篇关于使用 quanteda 进行 R 文本挖掘的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆