R 中的文档项矩阵 - 二元标记器不起作用 [英] Document-term matrix in R - bigram tokenizer not working

查看：29 发布时间：2021/9/8 20:08:28 r tokenize tm n-gram rweka

本文介绍了R 中的文档项矩阵 - 二元标记器不起作用的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试为语料库制作 2 个文档术语矩阵，一个带有 unigrams，一个带有 bigrams.但是，bigram 矩阵目前与 unigram 矩阵完全相同，我不确定为什么.

I am trying to make 2 document-term matrices for a corpus, one with unigrams and one with bigrams. However, the bigram matrix is currently just identical to the unigram matrix, and I'm not sure why.

代码:

docs<-Corpus(DirSource("data", recursive=TRUE))

# Get the document term matrices
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
dtm_unigram <- DocumentTermMatrix(docs, control = list(tokenize="words", 
    removePunctuation = TRUE, 
    stopwords = stopwords("english"), 
    stemming = TRUE))
dtm_bigram <- DocumentTermMatrix(docs, control = list(tokenize = BigramTokenizer,
    removePunctuation = TRUE,
    stopwords = stopwords("english"),
    stemming = TRUE))

inspect(dtm_unigram)
inspect(dtm_bigram)

我也尝试使用 ngram 包中的 ngram(x, n=2) 作为标记器，但这也不起作用.如何修复二元标记化?

I also tried using ngram(x, n=2) from the ngram package as the tokenizer, but that doesn't work either. How do I fix the bigram tokenization?

R 中的文档项矩阵 - 二元标记器不起作用 [英] Document-term matrix in R - bigram tokenizer not working

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

R 中的文档项矩阵 - 二元标记器不起作用 [英] Document-term matrix in R - bigram tokenizer not working

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭