在 R 中按频率排列 Document Term Matrix 的单词 [英] Arrange the words of the Document Term Matrix by frequency in R
本文介绍了在 R 中按频率排列 Document Term Matrix 的单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我很抱歉有新问题,但我是文本挖掘的新手,需要专业人士的建议.现在,经过 content_transformer
的长期折磨,我有干净的语料库下一个问题
i'm sorry for new question , but i newbie in text mining, and need in advices of profy.
Now, after long torments with content_transformer
i have clean corpus
The next question
1. How select from `dtm` the words with small frequencies , so that the amount of frequencies was not more than 1%
例如我需要这种格式
x 0,5% of all words in the dataset
y 0,2%
z 0,3%
所以这里总频率总和 =1%这是怎么做的?
so here total frequencies sum =1% How do this?
推荐答案
您可以查看 tm
包的 termDocumentMatrix
函数.这包含一种计算每个文档单词出现次数的方法.将这些数字添加到整个语料库中应该会引导您到达您想要的位置.
You can take a look into the termDocumentMatrix
function of the tm
package. This contains a way to count the occurrences of the words per document. Adding these numbers over the total corpus should lead you where you want to be.
dtm <- DocumentTermMatrix(corpus)
# wordcounts for complete corpus
counts <- colSums(as.matrix(dtm))
# number of documents
nb <- length(counts)
# frequencies
freqs <- counts / nb
这篇关于在 R 中按频率排列 Document Term Matrix 的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文