R: tm Textmining 包:Doc-Level 元数据生成很慢 [英] R: tm Textmining package: Doc-Level metadata generation is slow
问题描述
我有一个要处理的文档列表,对于每条记录,我想将一些元数据附加到 R 包 tm 生成的语料库"数据结构内的文档成员"(通过读取文本文件).
I have a list of documents to process, and for each record I want to attach some metadata to the document "member" inside the "corpus" data structure that tm, the R package, generates (from reading in text files).
这个 for 循环有效,但速度很慢,性能似乎随着函数 f ~ 1/n_docs 而下降.
This for-loop works but it is very slow, Performance seems to degrade as a function f ~ 1/n_docs.
for (i in seq(from= 1, to=length(corpus), by=1)){
if(opts$options$verbose == TRUE || i %% 50 == 0){
print(paste(i, " ", substr(corpus[[i]], 1, 140), sep = " "))
}
DublinCore(corpus[[i]], "title") = csv[[i,10]]
DublinCore(corpus[[i]], "Publisher" ) = csv[[i,16]] #institutions
}
这可能会对语料库变量产生影响,但我不知道是什么.但是当我把它放在 tm_map()(类似于 lapply() 函数)中时,它运行得更快,但更改不会持久:
This may do something to the corpus variable but I don't know what. But when I put it inside a tm_map() (similar to lapply() function), it runs much faster, but the changes are not made persistent:
i = 0
corpus = tm_map(corpus, function(x){
i <<- i + 1
if(opts$options$verbose == TRUE){
print(paste(i, " ", substr(x, 1, 140), sep = " "))
}
meta(x, tag = "Heading") = csv[[i,10]]
meta(x, tag = "publisher" ) = csv[[i,16]]
})
变量语料库退出tm_map函数后元数据字段为空.它应该被填满.我还有一些其他的事情与收藏有关.
Variable corpus has empty metadata fields after exiting the tm_map function. It should be filled. I have a few other things to do with the collection.
meta() 函数的 R 文档是这样说的:
The R documentation for the meta() function says this:
Examples:
data("crude")
meta(crude[[1]])
DublinCore(crude[[1]])
meta(crude[[1]], tag = "Topics")
meta(crude[[1]], tag = "Comment") <- "A short comment."
meta(crude[[1]], tag = "Topics") <- NULL
DublinCore(crude[[1]], tag = "creator") <- "Ano Nymous"
DublinCore(crude[[1]], tag = "Format") <- "XML"
DublinCore(crude[[1]])
meta(crude[[1]])
meta(crude)
meta(crude, type = "corpus")
meta(crude, "labels") <- 21:40
meta(crude)
我尝试了很多这样的调用(使用 var "corpus" 而不是 "crude"),但它们似乎不起作用.其他人似乎曾经对类似的数据集遇到过同样的问题(2009 年的论坛帖子,没有回复)
I tried many of these calls (with var "corpus" instead of "crude"), but they do not seem to work. Someone else once seemed to have had the same problem with a similar data set (forum post from 2009, no response)
推荐答案
这里有一些基准测试...
Here's a bit of benchmarking...
使用 for
循环:
expr.for <- function() {
for (i in seq(from= 1, to=length(corpus), by=1)){
DublinCore(corpus[[i]], "title") = LETTERS[round(runif(26))]
DublinCore(corpus[[i]], "Publisher" ) = LETTERS[round(runif(26))]
}
}
microbenchmark(expr.for())
# Unit: milliseconds
# expr min lq median uq max
# 1 expr.for() 21.50504 22.40111 23.56246 23.90446 70.12398
使用 tm_map
:
corpus <- crude
expr.map <- function() {
tm_map(corpus, function(x) {
meta(x, "title") = LETTERS[round(runif(26))]
meta(x, "Publisher" ) = LETTERS[round(runif(26))]
x
})
}
microbenchmark(expr.map())
# Unit: milliseconds
# expr min lq median uq max
# 1 expr.map() 5.575842 5.700616 5.796284 5.886589 8.753482
所以,正如您所注意到的,tm_map
版本似乎快了大约 4 倍.
So the tm_map
version, as you noticed, seems to be about 4 times faster.
在您的问题中,您说 tm_map
版本中的更改不是持久的,这是因为您没有在匿名函数的末尾返回 x
.最后应该是:
In your question you say that the changes in the tm_map
version are not persistent, it is because you don't return x
at the end of your anonymous function. In the end it should be :
meta(x, tag = "Heading") = csv[[i,10]]
meta(x, tag = "publisher" ) = csv[[i,16]]
x
这篇关于R: tm Textmining 包:Doc-Level 元数据生成很慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!