试图让R中的tf-idf权重起作用 [英] Trying to get tf-idf weighting working in R

查看:151
本文介绍了试图让R中的tf-idf权重起作用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用tm软件包进行一些非常基础的文本分析,并获得一些tf-idf分数;我正在运行OS X(尽管我已经在Debian Squeeze上尝试了相同的结果);我有一个目录(这是我的工作目录),其中包含几个文本文件(第一个包含 Ulysses 的前三集,第二个包含后三集,如果您必须知道的话) ).

I am trying to do some very basic text analysis with the tm package and get some tf-idf scores; I'm running OS X (though I've tried this on Debian Squeeze with the same result); I've got a directory (which is my working directory) with a couple text files in it (the first containing the first three episodes of Ulysses, the second containing the second three episodes, if you must know).

R版本:2.15.1 SessionInfo()报告有关tm的信息:[1] tm_0.5-8.3

R Version: 2.15.1 SessionInfo() Reports this about tm: [1] tm_0.5-8.3

相关代码位:

library('tm')
corpus <- Corpus(DirSource('.'))
dtm <- DocumentTermMatrix(corpus,control=list(weight=weightTfIdf))

str(dtm)
List of 6
 $ i       : int [1:12456] 1 1 1 1 1 1 1 1 1 1 ...
 $ j       : int [1:12456] 2 10 12 17 20 24 29 30 32 34 ...
 $ v       : num [1:12456] 1 1 1 1 1 1 1 1 1 1 ...
 $ nrow    : int 2
 $ ncol    : int 10646
 $ dimnames:List of 2
  ..$ Docs : chr [1:2] "bloom.txt" "telemachiad.txt"
  ..$ Terms: chr [1:10646] "_--c'est" "_--et" "_--for" "_--goodbye," ...
 - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
 - attr(*, "Weighting")= chr [1:2] "term frequency" "tf"

您会注意到,加权似乎仍然是默认的词频(tf),而不是我想要的加权tf-idf分数.

You will note, that the weighting appears to still be the default term frequency (tf) rather than the weighted tf-idf scores that I'd like.

很抱歉,如果我缺少明显的内容,但根据我阅读的文档,此应该可行.毫无疑问,错误不在于星星...

Apologies if I'm missing something obvious, but based on the documentation I've read, this should work. The fault, no doubt, lies not in the stars...

推荐答案

如果查看DocumentTermMatrix帮助页面(在示例中),您将看到以这种方式指定了control参数:

If you look at the DocumentTermMatrix help page, an at the example, you will see that the control argument is specified this way :

data(crude)
dtm <- DocumentTermMatrix(crude,
           control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE),
                          stopwords = TRUE))

因此,权重是使用名为weighting而不是weight的列表元素指定的.您可以通过传递函数名称或自定义函数来指定此权重,如示例中所示.但是以下方法也可以:

So, the weighting is specified with the list element named weighting, not weight. And you can specify this weighting by passing a function name or a custom function, as in the example. But the following works too :

data(crude)
dtm <- DocumentTermMatrix(crude, control = list(weighting = weightTfIdf))

这篇关于试图让R中的tf-idf权重起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆