使用TM包在R TermDocumentMatrix中查找自定义单词的频率 [英] Find frequency of a custom word in R TermDocumentMatrix using TM package

查看:265
本文介绍了使用TM包在R TermDocumentMatrix中查找自定义单词的频率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将大约50,000行varchar数据转换为一个语料库,然后使用TM程序包清理了该语料库,使用了停用词,标点符号和数字.

I turned about 50,000 rows of varchar data into a corpus, and then proceeded to clean said corpus using the TM package, getting ride of stopwords, punctuation, and numbers.

然后我将其转换为TermDocumentMatrix,并使用函数findFreqTerms和findMostFreqTerms进行文本分析. findMostFreqTerms返回常用字及其在数据中显示的次数.

I then turned it into a TermDocumentMatrix and used the functions findFreqTerms and findMostFreqTerms to run text analysis. findMostFreqTerms return the common words, and the number of times it shows up in the data.

但是,我想使用一个表示搜索"word"并返回"word"出现在TermDocumentMatrix中的次数的函数.

However, I want to use a function that says search for "word" and return how many times "word" appears in the TermDocumentMatrix.

TM中是否有实现此目的的功能?我是否必须将数据更改为data.frame并使用其他包&功能?

Is there a function in TM that achieves this? Do I have to change my data to a data.frame and use a different package & function?

推荐答案

由于您没有给出可重复的示例,因此,我将使用tm包中提供的crude数据集给出一个示例.

Since you have not given a reproducible example, I will give one using the crude dataset available in the tm package.

您可以(至少)以2种不同的方式进行操作.但是,任何将稀疏矩阵变成密集矩阵的事情都会占用大量内存.因此,我将给您2个选择.第一个使用稀疏的tdm矩阵,因此对内存更友好.第二个方法是,先将tdm转换为密集矩阵,然后再创建频率矢量.

You can do it in (at least) 2 different ways. But anything that turns a sparse matrix into a dense matrix can use a lot of memory. So I will give you 2 options. The first one is more memory friendly as it makes use of the sparse tdm matrix. The second one, first transforms the tdm into a dense matrix before creating a frequency vector.

library(tm)
data("crude")
crude <- as.VCorpus(crude)
crude <- tm_map(crude, stripWhitespace)
crude <- tm_map(crude, removePunctuation)
crude <- tm_map(crude, content_transformer(tolower))
crude <- tm_map(crude, removeWords, stopwords("english"))


tdm <- TermDocumentMatrix(crude)

# Making use of the fact that a tdm or dtm is a simple_triplet_matrix from slam
my_func <- function(data, word){
  slam::row_sums(data[data$dimnames$Terms == word, ])
}

my_func(tdm, "crude")
crude 
   21 
my_func(tdm, "oil")
oil 
 85

# turn tdm into dense matrix and create frequency vector. 
freq <- rowSums(as.matrix(tdm))
freq["crude"]
crude 
   21 
freq["oil"]
oil 
 85 

根据评论的要求:

# all words starting with cru. Adjust regex to find what you need.
freq[grep("^cru", names(freq))]
crucial   crude 
      2      21 

# separate words
freq[c("crude", "oil")]
crude   oil 
   21    85 

这篇关于使用TM包在R TermDocumentMatrix中查找自定义单词的频率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆