tm::findAssocs 数学这个函数是如何工作的? [英] Math of tm::findAssocs how does this function work?

查看:28
本文介绍了tm::findAssocs 数学这个函数是如何工作的?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在使用 findAssoc() 和文本挖掘(tm 包),但意识到我的数据集似乎有些不对劲.

I have been using findAssoc() with textmining (tm package) but realized that something doesn't seem right with my dataset.

我的数据集是保存在一列 csv 文件中的 1500 个开放式答案.所以我像这样调用数据集并使用典型的tm_map 将其放入语料库.

My dataset is 1500 open ended answers saved in one column of csv file. So I called the dataset like this and used typical tm_map to make it to corpus.

library(tm)
Q29 <- read.csv("favoritegame2.csv")
corpus <- Corpus(VectorSource(Q29$Q29))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus<- tm_map(corpus, removeWords, stopwords("english"))
dtm<- DocumentTermMatrix(corpus)

findAssocs(dtm, "like", .2)
> cousin  fill  ....
  0.28    0.20      

第一季度.当我找到与 like 相关的条款时,我没有看到输出 like = 1 作为输出的一部分.然而,

Q1. When I find Terms associated with like, I don't see the output like = 1 as part of the output. However,

dtm.df <-as.data.frame(inspect(dtm))

这个数据框包含 1500 个 obs.1689个变量..(或者是因为数据保存在一行csv文件中?)

this dataframe consists of 1500 obs. of 1689 variables..(Or is it because the data is save in a row of csv file?)

第 2 季度.即使当目标词like出现一次时,cousinfill出现一次,但分数是这样不同的.他们不应该是一样的吗?

Q2. Even though cousin and fill showed up once when the target term like showed up once, the score is different like this. Shouldn't they be same?

我正在尝试找到 findAssoc() 的数学运算,但还没有成功.任何建议都非常感谢!

I'm trying to find the math of findAssoc() but no success yet. Any advice is highly appreciated!

推荐答案

 findAssocs
#function (x, term, corlimit) 
#UseMethod("findAssocs", x)
#<environment: namespace:tm>

methods(findAssocs )
#[1] findAssocs.DocumentTermMatrix* findAssocs.matrix*   findAssocs.TermDocumentMatrix*

 getAnywhere(findAssocs.DocumentTermMatrix)
#-------------
A single object matching ‘findAssocs.DocumentTermMatrix’ was found
It was found in the following places
  registered S3 method for findAssocs from namespace tm
  namespace:tm
with value

function (x, term, corlimit) 
{
    ind <- term == Terms(x)
    suppressWarnings(x.cor <- cor(as.matrix(x[, ind]), as.matrix(x[, 
        !ind])))

那是删除自我引用的地方.

That was where self-references were removed.

    findAssocs(x.cor, term, corlimit)
}
<environment: namespace:tm>
#-------------
 getAnywhere(findAssocs.matrix)
#-------------
A single object matching ‘findAssocs.matrix’ was found
It was found in the following places
  registered S3 method for findAssocs from namespace tm
  namespace:tm
with value

function (x, term, corlimit) 
sort(round(x[term, which(x[term, ] > corlimit)], 2), decreasing = TRUE)
<environment: namespace:tm>

这篇关于tm::findAssocs 数学这个函数是如何工作的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆