使用 R 和 Rweka 在 termdocument 矩阵中使用 bigrams 而不是单个单词 [英] bigrams instead of single words in termdocument matrix using R and Rweka

查看：28 发布时间：2021/9/6 19:03:45 r text text-mining

本文介绍了使用 R 和 Rweka 在 termdocument 矩阵中使用 bigrams 而不是单个单词的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我找到了一种在术语文档矩阵中使用二元组而不是单个标记的方法.解决方案已在 stackoverflow 上提出:findAssocs for multiple term in R

I've found a way to use use bigrams instead of single tokens in a term-document matrix. The solution has been posed on stackoverflow here: findAssocs for multiple terms in R

这个想法是这样的:

library(tm)
library(RWeka)
data(crude)

#Tokenizer for n-grams and passed on to the term-document matrix constructor
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
txtTdmBi <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer))

但是最后一行给了我错误:

However the final line gives me the error:

Error in rep(seq_along(x), sapply(tflist, length)) : 
  invalid 'times' argument
In addition: Warning message:
In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'

如果我从最后一行删除标记器，它会创建一个常规的 tdm，所以我猜问题出在 BigramTokenizer 函数的某个地方，尽管这与 Weka 站点在此处提供的示例相同:http://tm.r-forge.r-project.org/faq.html#Bigrams.

If I remove the tokenizer from the last line it creates a regular tdm, so I guess the problem is somewhere in the BigramTokenizer function although this is the same example that the Weka site gives here: http://tm.r-forge.r-project.org/faq.html#Bigrams.

推荐答案

受 Anthony 评论的启发，我发现您可以指定 parallel 库默认使用的线程数(指定它在调用 NgramTokenizer 之前):

Inspired by Anthony's comment, I found out that you can specify the number of threads that the parallel library uses by default (specify it before you call the NgramTokenizer):

# Sets the default number of threads to use
options(mc.cores=1)

由于 NGramTokenizer 似乎挂在 parallel::mclapply 调用上，因此更改线程数似乎可以解决这个问题.

Since the NGramTokenizer seems to hang on the parallel::mclapply call, changing the number of threads seems to work around it.

这篇关于使用 R 和 Rweka 在 termdocument 矩阵中使用 bigrams 而不是单个单词的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用 R 和 Rweka 在 termdocument 矩阵中使用 bigrams 而不是单个单词 [英] bigrams instead of single words in termdocument matrix using R and Rweka

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用 R 和 Rweka 在 termdocument 矩阵中使用 bigrams 而不是单个单词 [英] bigrams instead of single words in termdocument matrix using R and Rweka

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭