R和手计算中的不同tf-idf值 [英] Different tf-idf values in R and hand calculation

查看:86
本文介绍了R和手计算中的不同tf-idf值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在R中玩耍以找到tf-idf值.

I am playing around in R to find the tf-idf values.

我有一组documents,例如:

D1 = "The sky is blue."
D2 = "The sun is bright."
D3 = "The sun in the sky is bright."

我想创建一个像这样的矩阵:

I want to create a matrix like this:

   Docs      blue    bright       sky       sun
   D1 tf-idf 0.0000000 tf-idf 0.0000000
   D2 0.0000000 tf-idf 0.0000000 tf-idf
   D3 0.0000000 tf-idf tf-idf tf-idf

所以,我在R中的代码:

library(tm)
docs <- c(D1 = "The sky is blue.", D2 = "The sun is bright.", D3 = "The sun in the sky is bright.")

dd <- Corpus(VectorSource(docs)) #Make a corpus object from a text vector
#Clean the text
dd <- tm_map(dd, stripWhitespace)
dd <- tm_map(dd, tolower)
dd <- tm_map(dd, removePunctuation)
dd <- tm_map(dd, removeWords, stopwords("english"))
dd <- tm_map(dd, stemDocument)
dd <- tm_map(dd, removeNumbers)
 inspect(dd)
    A corpus with 3 text documents

    The metadata consists of 2 tag-value pairs and a data frame
    Available tags are:
    create_date creator 
    Available variables in the data frame are:
    MetaID 

    $D1
    sky blue

    $D2
     sun bright

    $D3
      sun sky bright

    > dtm <- DocumentTermMatrix(dd, control = list(weighting = weightTfIdf))
    > as.matrix(dtm)
      Terms
            Docs      blue    bright       sky       sun
            D1 0.7924813 0.0000000 0.2924813 0.0000000
            D2 0.0000000 0.2924813 0.0000000 0.2924813
            D3 0.0000000 0.1949875 0.1949875 0.1949875

如果我进行手工计算,则矩阵应为:

If I do a hand calculation then the matrix should be:

            Docs  blue      bright       sky       sun
            D1    0.2385     0.0000000 0.3521    0.0000000
            D2    0.0000000 0.3521    0.0000000 0.3521
            D3    0.0000000 0.1949875 0.058     0.058 

我正在像说blue那样计算tf = 1/2 = 0.5,并且像log(3/1) = 0.477121255那样计算idf.因此tf-idf = tf*idf = 0.5*0.477 = 0.2385.这样,我正在计算其他tf-idf值.现在,我想知道为什么在手计算矩阵和R矩阵中得到不同的结果?哪个给出正确的结果?我是在手工计算中做错了什么,还是我的R代码有问题?

I am calculating like say blue as tf = 1/2 = 0.5 and idf as log(3/1) = 0.477121255. Therefore tf-idf = tf*idf = 0.5*0.477 = 0.2385. In this way, I am calculating the other tf-idf values. Now, I am wondering, why I am getting different results in the matrix of hand calculation and in the matrix of R? Which gives the correct results? Am I doing something wrong in hand calculation or is there something wrong in my R code?

推荐答案

您的手工计算与DocumentTermMatrix计算不同的原因是您使用的是不同的log基数.当您说log(3/1) = 0.477121255时,您必须使用对数为10的日志.在R中,它将为log10(3). R中的默认log是自然对数,因此,如果在R中键入log(3),则得到〜1.10.但是weightTfIdf使用对数基数2进行计算.因此,当为蓝色"计算tf-idf时,您会得到

The reason your hand calculation doesn't agree with the DocumentTermMatrix calculation is you are using a different log base. When you say log(3/1) = 0.477121255 you must be using log base 10. In R, that would be log10(3). The default log in R is natural log so if you type log(3) in R you get ~1.10. But the weightTfIdf uses log base 2 for its calculations. Thus when calculating tf-idf for "blue" you get

(1/2)*log2(3/1) = 0.7924813

我希望一切都可以清除.

I hope that clears things up.

这篇关于R和手计算中的不同tf-idf值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆