R中的removeSparseTerms如何工作? [英] How does the removeSparseTerms in R work?

查看:505
本文介绍了R中的removeSparseTerms如何工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在R中使用removeSparseTerms方法,需要输入阈值.我还读到,该值越高,返回的矩阵中保留的项数就越多.

I am using the removeSparseTerms method in R and it required a threshold value to be input. I also read that the higher the value, the more will be the number of terms retained in the returned matrix.

此方法如何工作,其背后的逻辑是什么?我理解稀疏的概念,但是这个阈值是否表示一个术语应显示多少个文档,或其他比率等等?

How does this method work and what is the logic behind it? I understand the concept of sparseness but does this threshold indicate how many documents should a term be present it, or some other ratio, etc?

推荐答案

removeSparseTerms()sparse参数而言,稀疏度是指相对文档频率的阈值,在哪个一词将被删除.相对文档频率在这里表示比例.正如命令的帮助页所述(尽管不是很清楚),稀疏度更小,因为它接近1.0. (请注意,稀疏度不能取0或1.0的值,只能取介于两者之间的值.)

In the sense of the sparse argument to removeSparseTerms(), sparsity refers to the threshold of relative document frequency for a term, above which the term will be removed. Relative document frequency here means a proportion. As the help page for the command states (although not very clearly), sparsity is smaller as it approaches 1.0. (Note that sparsity cannot take values of 0 or 1.0, only values in between.)

例如,如果将sparse = 0.99设置为removeSparseTerms()的参数,则将仅删除更多小于0.99的术语. sparse = 0.99的确切解释是,对于术语$ j $,您将保留所有 $ df_j> N *(1-0.99)$,其中$ N $是文档数-在这种情况下,可能会保留所有术语(请参见下面的示例).

For example, if you set sparse = 0.99 as the argument to removeSparseTerms(), then this will remove only terms that are more sparse than 0.99. The exact interpretation for sparse = 0.99 is that for term $j$, you will retain all terms for which $df_j > N * (1 - 0.99)$, where $N$ is the number of documents -- in this case probably all terms will be retained (see example below).

在另一个极端附近(如果是sparse = .01),则仅保留(几乎)每个文档中出现的术语. (当然,这取决于术语的数量和文档的数量,在自然语言中,像"the"这样的常用词很可能出现在每个文档中,因此永远不会稀疏".)

Near the other extreme, if sparse = .01, then only terms that appear in (nearly) every document will be retained. (Of course this depends on the number of terms and the number of documents, and in natural language, common words like "the" are likely to occur in every document and hence never be "sparse".)

一个稀疏性阈值0.99的示例,其中一个术语最多出现在(第一个示例中)少于0.01个文档,而(第二个示例中)最多出现0.01个文档:

An example of the sparsity threshold of 0.99, where a term that occurs at most in (first example) less than 0.01 documents, and (second example) just over 0.01 documents:

> # second term occurs in just 1 of 101 documents
> myTdm1 <- as.DocumentTermMatrix(slam::as.simple_triplet_matrix(matrix(c(rep(1, 101), rep(1,1), rep(0, 100)), ncol=2)), 
+                                weighting = weightTf)
> removeSparseTerms(myTdm1, .99)
<<DocumentTermMatrix (documents: 101, terms: 1)>>
Non-/sparse entries: 101/0
Sparsity           : 0%
Maximal term length: 2
Weighting          : term frequency (tf)
> 
> # second term occurs in 2 of 101 documents
> myTdm2 <- as.DocumentTermMatrix(slam::as.simple_triplet_matrix(matrix(c(rep(1, 101), rep(1,2), rep(0, 99)), ncol=2)), 
+                                weighting = weightTf)
> removeSparseTerms(myTdm2, .99)
<<DocumentTermMatrix (documents: 101, terms: 2)>>
Non-/sparse entries: 103/99
Sparsity           : 49%
Maximal term length: 2
Weighting          : term frequency (tf)

以下是一些带有实际文本和术语的其他示例:

Here are a few additional examples with actual text and terms:

> myText <- c("the quick brown furry fox jumped over a second furry brown fox",
              "the sparse brown furry matrix",
              "the quick matrix")

> require(tm)
> myVCorpus <- VCorpus(VectorSource(myText))
> myTdm <- DocumentTermMatrix(myVCorpus)
> as.matrix(myTdm)
    Terms
Docs brown fox furry jumped matrix over quick second sparse the
   1     2   2     2      1      0    1     1      1      0   1
   2     1   0     1      0      1    0     0      0      1   1
   3     0   0     0      0      1    0     1      0      0   1
> as.matrix(removeSparseTerms(myTdm, .01))
    Terms
Docs the
   1   1
   2   1
   3   1
> as.matrix(removeSparseTerms(myTdm, .99))
    Terms
Docs brown fox furry jumped matrix over quick second sparse the
   1     2   2     2      1      0    1     1      1      0   1
   2     1   0     1      0      1    0     0      0      1   1
   3     0   0     0      0      1    0     1      0      0   1
> as.matrix(removeSparseTerms(myTdm, .5))
    Terms
Docs brown furry matrix quick the
   1     2     2      0     1   1
   2     1     1      1     0   1
   3     0     0      1     1   1

在最后一个使用sparse = 0.34的示例中,仅保留了三分之二文档中出现的术语.

In the last example with sparse = 0.34, only terms occurring in two-thirds of the documents were retained.

文本分析包 quanteda .这里的相同功能不是指 sparsity (稀疏性),而是直接指术语的文档频率(如 tf-idf 中一​​样).

An alternative approach for trimming terms from document-term matrixes based on a document frequency is the text analysis package quanteda. The same functionality here refers not to sparsity but rather directly to the document frequency of terms (as in tf-idf).

> require(quanteda)
> myDfm <- dfm(myText, verbose = FALSE)
> docfreq(myDfm)
     a  brown    fox  furry jumped matrix   over  quick second sparse    the 
     1      2      1      2      1      2      1      2      1      1      3 
> dfm_trim(myDfm, minDoc = 2)
Features occurring in fewer than 2 documents: 6 
Document-feature matrix of: 3 documents, 5 features.
3 x 5 sparse Matrix of class "dfmSparse"
       features
docs    brown furry the matrix quick
  text1     2     2   1      0     1
  text2     1     1   1      1     0
  text3     0     0   1      1     1

这种用法对我来说似乎简单得多.

This usage seems much more straightforward to me.

这篇关于R中的removeSparseTerms如何工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆