使用Quanteda计算R中大型语料库的余弦相似度 [英] Computing cosine similarities on a large corpus in R using quanteda
问题描述
我正在尝试处理约85,000条推文的超大型语料库,并将其与电视广告中的对话进行比较.但是,由于我的主体大小,我无法处理余弦相似性度量,而没有收到错误:无法分配大小为n的向量"消息(在我的情况下为26 GB).
I am trying to work with a very large corpus of about 85,000 tweets that I'm trying to compare to dialog from television commercials. However, due to the size of my corpus, I am unable to process the cosine similarity measure without getting the "Error: cannot allocate vector of size n" message, (26 GB in my case).
我已经在具有大量内存的服务器上运行R 64位.我还尝试在具有最大内存(244 GB)但无济于事的服务器上使用AWS(相同的错误).
I am already running R 64 bit on a server with lots of memory. I've also tried using the AWS on the server with the most memory, (244 GB), but to no avail, (same error).
是否可以使用诸如fread之类的程序包来解决此内存限制,还是只需要发明一种方法来分解数据?非常感谢您的帮助,我在下面附加了代码:
Is there a way to use a package like fread to get around this memory limitation or do I just have to invent a way to break up my data? Thanks much for the help, I've appended the code below:
x <- NULL
y <- NULL
num <- NULL
z <- NULL
ad <- NULL
for (i in 1:nrow(ad.corp$documents)){
num <- i
ad <- paste("ad.num",num,sep="_")
x <- subset(ad.corp, ad.corp$documents$num== yoad)
z <- x + corp.all
z$documents$texts <- as.character(z$documents$texts)
PolAdsDfm <- dfm(z, ignoredFeatures = stopwords("english"), groups = "num",stem=TRUE, verbose=TRUE, removeTwitter=TRUE)
PolAdsDfm <- tfidf(PolAdsDfm)
y <- similarity(PolAdsDfm, ad, margin="documents",n=20, method = "cosine", normalize = T)
y <- sort(y, decreasing=T)
if (y[1] > .7){assign(paste(ad,x$documents$texts,sep="--"), y)}
else {print(paste(ad,"didn't make the cut", sep="****"))}
}
推荐答案
该错误最有可能是由强制dfm对象的之前版本的Quanteda(早于2016年1月1日,在GitHub上为0.9.1-8之前)引起的.进入密集矩阵以调用proxy :: simil().现在,较新的版本可在稀疏dfm对象上使用,而method = "correlation"
和method = "cosine"
无需强制转换. (即将推出更多稀疏方法.)
The error was most likely caused by previous versions of quanteda (before 0.9.1-8, on GitHub as of 2016-01-01) that coerced dfm object into dense matrixes in order to call proxy::simil(). The newer version now works on sparse dfm objects without coercion for method = "correlation"
and method = "cosine"
. (With more sparse methods to come soon.)
我不能真正遵循您在代码中所做的事情,但是看起来您正在汇总成组的文档之间出现成对相似性.我建议以下工作流程:
I can't really follow what you are doing in the code, but it looks like you are getting pairwise similarities between documents aggregated as groups. I would suggest the following workflow:
-
使用groups选项为要比较的所有文本组创建dfm.
Create your dfm with the groups option for all groups of texts you want to compare.
使用tfidf()
加权此dfm.
使用y <- textstat_simil(PolAdsDfm, margin = "documents", method = "cosine")
,然后使用as.matrix(y)
将其强制为完整的对称矩阵.然后,所有成对文档都位于该矩阵中,并且可以在大于阈值0.7的条件下直接从该对象中进行选择.
Use y <- textstat_simil(PolAdsDfm, margin = "documents", method = "cosine")
and then coerce this to a full, symmetric matrix using as.matrix(y)
. All of your pairwise documents are then in that matrix, and you can select on the condition of being greater than your threshold of 0.7 directly from that object.
请注意,无需使用method = "cosine"
标准化词频.在 quanteda 的较新版本中,无论如何都删除了normalize
参数,因为我认为在进行相似度计算之前对dfm进行加权是一种更好的工作流程实践,而不是将权重构建到textstat_simil()
中.
Note that there is no need to normalise term frequencies with method = "cosine"
. In newer versions of quanteda, the normalize
argument has been removed anyway, since I think it's a better workflow practice to weight the dfm prior to any computation of similarities, rather than building weightings into textstat_simil()
.
最后的注释:我强烈建议不要使用此处提供的方法访问corpus
对象的内部,因为这些内部可能会更改,然后破坏您的代码.例如,使用texts(z)
代替z$documents$texts
,并使用docvars(ad.corp, "num")
代替ad.corp$documents$num
.
Final note: I strongly suggest not accessing the internals of a corpus
object using the method you have here, since those internals may change and then break your code. Use texts(z)
instead of z$documents$texts
, for instance, and docvars(ad.corp, "num")
instead of ad.corp$documents$num
.
这篇关于使用Quanteda计算R中大型语料库的余弦相似度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!