使用Quanteda计算R中大型语料库的余弦相似度 [英] Computing cosine similarities on a large corpus in R using quanteda

查看:445
本文介绍了使用Quanteda计算R中大型语料库的余弦相似度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试处理约85,000条推文的超大型语料库,并将其与电视广告中的对话进行比较.但是,由于我的主体大小,我无法处理余弦相似性度量,而没有收到错误:无法分配大小为n的向量"消息(在我的情况下为26 GB).

I am trying to work with a very large corpus of about 85,000 tweets that I'm trying to compare to dialog from television commercials. However, due to the size of my corpus, I am unable to process the cosine similarity measure without getting the "Error: cannot allocate vector of size n" message, (26 GB in my case).

我已经在具有大量内存的服务器上运行R 64位.我还尝试在具有最大内存(244 GB)但无济于事的服务器上使用AWS(相同的错误).

I am already running R 64 bit on a server with lots of memory. I've also tried using the AWS on the server with the most memory, (244 GB), but to no avail, (same error).

是否可以使用诸如fread之类的程序包来解决此内存限制,还是只需要发明一种方法来分解数据?非常感谢您的帮助,我在下面附加了代码:

Is there a way to use a package like fread to get around this memory limitation or do I just have to invent a way to break up my data? Thanks much for the help, I've appended the code below:

x <- NULL
y <- NULL
num <- NULL
z <- NULL
ad <- NULL
for (i in 1:nrow(ad.corp$documents)){
  num <- i
  ad <- paste("ad.num",num,sep="_")
  x <- subset(ad.corp, ad.corp$documents$num== yoad)
  z <- x + corp.all
  z$documents$texts <- as.character(z$documents$texts)
  PolAdsDfm <- dfm(z, ignoredFeatures = stopwords("english"), groups = "num",stem=TRUE, verbose=TRUE, removeTwitter=TRUE)
  PolAdsDfm <- tfidf(PolAdsDfm)
  y <- similarity(PolAdsDfm, ad, margin="documents",n=20, method = "cosine", normalize = T)
  y <- sort(y, decreasing=T)
  if (y[1] > .7){assign(paste(ad,x$documents$texts,sep="--"), y)}
  else {print(paste(ad,"didn't make the cut", sep="****"))}  
}

推荐答案

该错误最有可能是由强制dfm对象的之前版本的Quanteda(早于2016年1月1日,在GitHub上为0.9.1-8之前)引起的.进入密集矩阵以调用proxy :: simil().现在,较新的版本可在稀疏dfm对象上使用,而method = "correlation"method = "cosine"无需强制转换. (即将推出更多稀疏方法.)

The error was most likely caused by previous versions of quanteda (before 0.9.1-8, on GitHub as of 2016-01-01) that coerced dfm object into dense matrixes in order to call proxy::simil(). The newer version now works on sparse dfm objects without coercion for method = "correlation" and method = "cosine". (With more sparse methods to come soon.)

我不能真正遵循您在代码中所做的事情,但是看起来您正在汇总成组的文档之间出现成对相似性.我建议以下工作流程:

I can't really follow what you are doing in the code, but it looks like you are getting pairwise similarities between documents aggregated as groups. I would suggest the following workflow:

  1. 使用groups选项为要比较的所有文本组创建dfm.

  1. Create your dfm with the groups option for all groups of texts you want to compare.

使用tfidf()加权此dfm.

使用y <- textstat_simil(PolAdsDfm, margin = "documents", method = "cosine"),然后使用as.matrix(y)将其强制为完整的对称矩阵.然后,所有成对文档都位于该矩阵中,并且可以在大于阈值0.7的条件下直接从该对象中进行选择.

Use y <- textstat_simil(PolAdsDfm, margin = "documents", method = "cosine") and then coerce this to a full, symmetric matrix using as.matrix(y). All of your pairwise documents are then in that matrix, and you can select on the condition of being greater than your threshold of 0.7 directly from that object.

请注意,无需使用method = "cosine"标准化词频.在 quanteda 的较新版本中,无论如何都删除了normalize参数,因为我认为在进行相似度计算之前对dfm进行加权是一种更好的工作流程实践,而不是将权重构建到textstat_simil()中.

Note that there is no need to normalise term frequencies with method = "cosine". In newer versions of quanteda, the normalize argument has been removed anyway, since I think it's a better workflow practice to weight the dfm prior to any computation of similarities, rather than building weightings into textstat_simil().

最后的注释:我强烈建议不要使用此处提供的方法访问corpus对象的内部,因为这些内部可能会更改,然后破坏您的代码.例如,使用texts(z)代替z$documents$texts,并使用docvars(ad.corp, "num")代替ad.corp$documents$num.

Final note: I strongly suggest not accessing the internals of a corpus object using the method you have here, since those internals may change and then break your code. Use texts(z) instead of z$documents$texts, for instance, and docvars(ad.corp, "num") instead of ad.corp$documents$num.

这篇关于使用Quanteda计算R中大型语料库的余弦相似度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆