R tm 包创建 N 个最频繁项的矩阵 [英] R tm package create matrix of Nmost frequent terms

查看:39
本文介绍了R tm 包创建 N 个最频繁项的矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 R 中的 tm 包创建了一个 termDocumentMatrix.

I have a termDocumentMatrix created using the tm package in R.

我正在尝试创建一个包含 50 个最常出现的术语的矩阵/数据框.

I'm trying to create a matrix/dataframe that has the 50 most frequently occurring terms.

当我尝试转换为矩阵时,出现此错误:

When I try to convert to a matrix I get this error:

> ap.m <- as.matrix(mydata.dtm)
Error: cannot allocate vector of size 2.0 Gb

所以我尝试使用 Matrix 包转换为稀疏矩阵:

So I tried converting to sparse matrices using Matrix package:

> A <- as(mydata.dtm, "sparseMatrix") 
Error in as(from, "CsparseMatrix") : 
  no method or default for coercing "TermDocumentMatrix" to "CsparseMatrix"
> B <- Matrix(mydata.dtm, sparse = TRUE)
Error in asMethod(object) : invalid class 'NA' to dup_mMatrix_as_geMatrix

我尝试使用以下方法访问 tdm 的不同部分:

I've tried accessing the different parts of the tdm using:

> freqy1 <- data.frame(term1 = findFreqTerms(mydata.dtm, lowfreq=165))
> mydata.dtm[mydata.dtm$ Terms %in% freqy1$term1,]
Error in seq_len(nr) : argument must be coercible to non-negative integer

以下是一些其他信息:

> str(mydata.dtm)
List of 6
 $ i       : int [1:430206] 377 468 725 3067 3906 4150 4393 5188 5793 6665 ...
 $ j       : int [1:430206] 1 1 1 1 1 1 1 1 1 1 ...
 $ v       : num [1:430206] 1 1 1 1 1 1 1 1 2 3 ...
 $ nrow    : int 15643
 $ ncol    : int 17207
 $ dimnames:List of 2
  ..$ Terms: chr [1:15643] "000" "0mm" "100" "1000" ...
  ..$ Docs : chr [1:17207] "1" "2" "3" "4" ...
 - attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
 - attr(*, "Weighting")= chr [1:2] "term frequency" "tf"
> mydata.dtm
A term-document matrix (15643 terms, 17207 documents)

Non-/sparse entries: 430206/268738895
Sparsity           : 100%
Maximal term length: 54 
Weighting          : term frequency (tf)

我的理想输出是这样的:

My ideal output is something like this:

term      frequency
the         2123
and         2095
able         883
...          ...

有什么建议吗?

推荐答案

tm 中的 term-document 矩阵已经创建为稀疏矩阵.这里,mydata.tdm$imydata.tdm$j 是矩阵的索引向量,mydata.tdm$v 是相关的频率向量.这样你就可以创建一个稀疏矩阵写作:

The term-document matrices in tm are already created as sparse matrices. Here, mydata.tdm$i and mydata.tdm$j are the vectors of indexes of the matrix and mydata.tdm$v is the related vector of frequencies. So that you can create a sparse matrix writing :

sparseMatrix(i=mydata.tdm$i, j=mydata.tdm$j, x=mydata.tdm$v)

然后您可以使用 rowSums 并将您感兴趣的行链接到它们所代表的术语,使用 $Terms.

Then you can use rowSums and link the rows, you're interested in, to the terms, they stand for, with $Terms.

这篇关于R tm 包创建 N 个最频繁项的矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆