R-缓慢工作，对有序因子进行排序 [英] R - slowly working lapply with sort on ordered factor

查看：99 发布时间：2020/4/27 5:13:53 r text-mining lapply corpus term-document-matrix

本文介绍了R-缓慢工作，对有序因子进行排序的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

根据问题更有效的创建方法语料库和DTM 我已经准备了自己的方法来从大型语料库构建Term Document Matrix(我希望)，而这不需要术语x Documents的内存.

Based on the question More efficient means of creating a corpus and DTM I've prepared my own method for building a Term Document Matrix from a large corpus which (I hope) do not require Terms x Documents memory.

sparseTDM <- function(vc){
  id = unlist(lapply(vc, function(x){x$meta$id}))
  content = unlist(lapply(vc, function(x){x$content}))
  out = strsplit(content, "\\s", perl = T)
  names(out) = id
  lev.terms = sort(unique(unlist(out)))
  lev.docs = id

  v1 = lapply(
    out,
    function(x, lev) {
      sort(as.integer(factor(x, levels = lev, ordered = TRUE)))
    },
    lev = lev.terms
  )

  v2 = lapply(
    seq_along(v1),
    function(i, x, n){
      rep(i,length(x[[i]]))
    },
    x = v1,
    n = names(v1)
  )

  stm = data.frame(i = unlist(v1), j = unlist(v2)) %>%
    group_by(i, j) %>%
    tally() %>%
    ungroup()

  tmp = simple_triplet_matrix(
    i = stm$i,
    j = stm$j,
    v = stm$n,
    nrow = length(lev.terms),
    ncol = length(lev.docs),
    dimnames = list(Terms = lev.terms, Docs = lev.docs)
  )

  as.TermDocumentMatrix(tmp, weighting = weightTf)
}

在计算v1时速度变慢.它运行了30分钟，我停止了它.

It slows down at calculation of v1. It was running for 30 minutes and I stopped it.

我准备了一个小例子:

b = paste0("string", 1:200000)
a = sample(b,80)
microbenchmark(
  lapply(
    list(a=a),
    function(x, lev) {
      sort(as.integer(factor(x, levels = lev, ordered = TRUE)))
    },
    lev = b
  )
)

结果是:

Unit: milliseconds
expr      min       lq      mean   median       uq      max neval
...  25.80961 28.79981  31.59974 30.79836 33.02461 98.02512   100

Id和content有126522个元素，Lev.terms有155591个元素，因此看来我已经停止处理太早了.既然最终我将要处理大约600万个文档，有什么办法可以加快这段代码的速度吗?

Id and content has 126522 elements, Lev.terms has 155591 elements, so it looks that I've stopped processing too early. Since ultimately I'll be working on ~6M documents I need to ask... Is there any way to speed up this fragment of code?

R-缓慢工作，对有序因子进行排序 [英] R - slowly working lapply with sort on ordered factor

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

R-缓慢工作，对有序因子进行排序 [英] R - slowly working lapply with sort on ordered factor

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭