大术语文档矩阵/simple_triplet_matrix 的行总和 ??{tm 包} [英] Row sum for large term-document matrix / simple_triplet_matrix ?? {tm package}

查看:25
本文介绍了大术语文档矩阵/simple_triplet_matrix 的行总和 ??{tm 包}的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我有一个非常大的术语文档矩阵:

So I have a very large term-document matrix:

> class(ph.DTM)
[1] "TermDocumentMatrix"    "simple_triplet_matrix"

> ph.DTM
A term-document matrix (109996 terms, 262811 documents)

Non-/sparse entries: 3705693/28904453063
Sparsity           : 100%
Maximal term length: 191 
Weighting          : term frequency (tf)

如何获得每个术语的 rowSum(频率)?我试过了:

How do I get the rowSum (frequency) of each term? I tried:

> apply(ph.DTM, 1, sum)
Error in vector(typeof(x$v), nr * nc) : vector size cannot be NA
In addition: Warning message:
In nr * nc : NAs produced by integer overflow

显然,我知道removeSparseTerms:

ph.DTM2 <- removeSparseTerms(ph.DTM, 0.99999)

缩小尺寸:

> ph.DTM2
A term-document matrix (28842 terms, 262811 documents)

Non-/sparse entries: 3612620/7576382242
Sparsity           : 100%
Maximal term length: 24 
Weighting          : term frequency (tf)

但我仍然无法对其应用任何与矩阵相关的函数:

But I still cannot apply any matrix-related functions to it:

> as.matrix(ph.DTM2)
Error in vector(typeof(x$v), nr * nc) : vector size cannot be NA
In addition: Warning message:
In nr * nc : NAs produced by integer overflow

我怎样才能在这个对象上得到一个简单的行总和??谢谢!!

推荐答案

好吧,经过更多的 Google 搜索后,我遇到了 slam 包,它可以:

OK, after some more Google'ing, I came across the slam package, which enables:

ph.DTM3 <- rollup(ph.DTM, 2, na.rm=TRUE, FUN = sum)

哪个有效.

这篇关于大术语文档矩阵/simple_triplet_matrix 的行总和 ??{tm 包}的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆