R 和 tm 包:用一个或两个单词的字典创建一个术语文档矩阵? [英] R and tm package: create a term-document matrix with a dictionary of one or two words?

查看:30
本文介绍了R 和 tm 包:用一个或两个单词的字典创建一个术语文档矩阵?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目的:我想使用包含复合词或bigrams的字典创建一个术语-文档矩阵em>,作为一些关键字.

Purpose: I want to create a term-document matrix using a dictionary which has compound words, or bigrams, as some of the keywords.

Web 搜索: 作为文本挖掘和 R 中的 tm 包的新手,我访问了网络以找出如何做这个.以下是我找到的一些相关链接:

Web Search: Being new to text-mining and the tm package in R, I went to the web to figure out how to do this. Below are some relevant links that I found:

背景:其中,我更喜欢在RRWeka包中使用NGramTokenizer的解决方案,但我遇到了问题.在下面的示例代码中,我创建了三个文档并将它们放在一个语料库中.请注意,Docs 12 分别包含两个单词.Doc 3 只包含一个词.我的字典关键字是两个二元词组和一个一元词组.

Background: Of these, I preferred the solution that uses NGramTokenizer in the RWeka package in R, but I ran into a problem. In the example code below, I create three documents and place them in a corpus. Note that Docs 1 and 2 each contain two words. Doc 3 only contains one word. My dictionary keywords are two bigrams and a unigram.

问题:上述链接中的NGramTokenizer解决方案没有正确计算Doc 3中的unigram关键字.

Problem: The NGramTokenizer solution in the above links does not correctly count the unigram keyword in the Doc 3.

library(tm)
library(RWeka)

my.docs = c('jedi master', 'jedi grandmaster', 'jedi')
my.corpus = Corpus(VectorSource(my.docs))
my.dict = c('jedi master', 'jedi grandmaster', 'jedi')

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 2))

inspect(DocumentTermMatrix(my.corpus, control=list(tokenize=BigramTokenizer,
                                                  dictionary=my.dict)))

# <<DocumentTermMatrix (documents: 3, terms: 3)>>
# ...
# Docs  jedi  jedi grandmaster  jedi master
#    1     1                 0            1
#    2     1                 1            0
#    3     0                 0            0

我期待 Doc 3 的行给 jedi01 给另外两个.我有什么误解吗?

I was expecting the row for Doc 3 to give 1 for jedi and 0 for the other two. Is there something I am misunderstanding?

推荐答案

我遇到了同样的问题,发现 TM 包中的标记计数函数依赖于一个名为 wordLengths 的选项,它是一个两个数字的向量——要跟踪的令牌的最小和最大长度.默认情况下,TM 使用 3 个字符的最小字长(wordLengths = c(3, Inf)).您可以通过在调用 DocumentTermMatrix 时将其添加到 控制 列表来覆盖此选项,如下所示:

I ran into the same problem and found that token counting functions from the TM package rely on an option called wordLengths, which is a vector of two numbers -- the minimum and the maximum length of tokens to keep track of. By default, TM uses a minimum word length of 3 characters (wordLengths = c(3, Inf)). You can override this option by adding it to the control list in a call to DocumentTermMatrix like this:

DocumentTermMatrix(my.corpus,
                   control=list(
                       tokenize=newBigramTokenizer,
                       wordLengths = c(1, Inf)))

但是,您的绝地"一词长度超过 3 个字符.虽然,您可能在尝试弄清楚如何计算 ngrams 时更早地调整了选项的值,所以还是试试这个.另外,看看 bounds 选项,它告诉 TM 丢弃比指定值更频繁或更频繁​​的单词.

However, your 'jedi' word is more than 3 characters long. Although, you probably tweaked the option's value earlier while trying to figure out how to count ngrams, so still try this. Also, look at the bounds option, which tells TM to discard words less or more frequent than specified values.

这篇关于R 和 tm 包:用一个或两个单词的字典创建一个术语文档矩阵?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆