在 R 中按频率排列 Document Term Matrix 的单词 [英] Arrange the words of the Document Term Matrix by frequency in R

查看:23
本文介绍了在 R 中按频率排列 Document Term Matrix 的单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我很抱歉有新问题,但我是文本挖掘的新手,需要专业人士的建议.现在,经过 content_transformer 的长期折磨,我有干净的语料库下一个问题

i'm sorry for new question , but i newbie in text mining, and need in advices of profy. Now, after long torments with content_transformer i have clean corpus The next question

1. How  select from `dtm`  the words with small frequencies , so that the amount of frequencies was not more than 1%

例如我需要这种格式

x 0,5% of all words in the dataset
y 0,2%
z 0,3%

所以这里总频率总和 =1%这是怎么做的?

so here total frequencies sum =1% How do this?

推荐答案

您可以查看 tm 包的 termDocumentMatrix 函数.这包含一种计算每个文档单词出现次数的方法.将这些数字添加到整个语料库中应该会引导您到达您想要的位置.

You can take a look into the termDocumentMatrix function of the tm package. This contains a way to count the occurrences of the words per document. Adding these numbers over the total corpus should lead you where you want to be.

dtm <- DocumentTermMatrix(corpus)
# wordcounts for complete corpus
counts <- colSums(as.matrix(dtm))

# number of documents
nb <- length(counts)
# frequencies
freqs <- counts / nb

这篇关于在 R 中按频率排列 Document Term Matrix 的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆