simple_triplet_matrix 中的错误——无法使用 RWeka 来计算短语 [英] Error in simple_triplet_matrix -- unable to use RWeka to count Phrases

查看:27
本文介绍了simple_triplet_matrix 中的错误——无法使用 RWeka 来计算短语的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用 TM,我将 DocumentTermMatrix 与字典列表进行比较以计算总数:

Using TM, I'm comparing a DocumentTermMatrix against a dictionary list to count totals:

totals <- inspect(DocumentTermMatrix(x, list(dictionary = d)))

这对单字很有效,但我想包含双字,但不知道如何做到这一点.

This works great for single words, but I want to include double words and can't figure out how to do this.

我尝试过 RWeka:

I tried RWeka:

TrigramTokenizer <- function(x) NGramTokenizer(x, 
                                               Weka_control(min = 3, max = 3))
tdm <- TermDocumentMatrix(v.corpus, 
                          control = list(tokenize = TrigramTokenizer))

但收到以下错误消息:

Error in simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms),  : 
  'i, j, v' different lengths
In addition: Warning messages:
1: In parallel::mclapply(x, termFreq, control) :
  all scheduled cores encountered errors in user code
2: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'
3: In simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms),  :
  NAs introduced by coercion.

你能帮忙处理错误信息吗?

Can you help with the Error message?

谢谢!!

推荐答案

查看我的回答 这里

使用 RWekaparallel 包似乎存在问题.一世在此处找到解决方法.

Seems there are problems using RWeka with parallel package. I found workaround solution here.

1:http://r.789695.n4.nabble.com/RWeka-and-multicore-package-td4678473.html#a4678948

最重要的一点是不要加载 RWeka 包并在封装的函数中使用命名空间.

The most important point is not loading the RWeka package and use the namespace in a encapsulated function.

所以你的分词器应该看起来像

So your tokenizer should look like

BigramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 2, max = 2))}

这篇关于simple_triplet_matrix 中的错误——无法使用 RWeka 来计算短语的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆