使用R的单词频率列表 [英] list of word frequencies using R
本文介绍了使用R的单词频率列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我一直在使用tm包来进行一些文本分析. 我的问题是创建一个单词及其频率与之相关的列表
I have been using the tm package to run some text analysis. My problem is with creating a list with words and their frequencies associated with the same
library(tm)
library(RWeka)
txt <- read.csv("HW.csv",header=T)
df <- do.call("rbind", lapply(txt, as.data.frame))
names(df) <- "text"
myCorpus <- Corpus(VectorSource(df$text))
myStopwords <- c(stopwords('english'),"originally", "posted")
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)
#building the TDM
btm <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
myTdm <- TermDocumentMatrix(myCorpus, control = list(tokenize = btm))
我通常使用以下代码生成频率范围内的单词列表
I typically use the following code for generating list of words in a frequency range
frq1 <- findFreqTerms(myTdm, lowfreq=50)
有没有办法使它自动化,以便我们得到一个包含所有单词及其频率的数据帧?
Is there any way to automate this such that we get a dataframe with all words and their frequency?
我面临的另一个问题是将术语文档矩阵转换为数据帧.当我处理大量数据样本时,我遇到了内存错误. 有一个简单的解决方案吗?
The other problem that i face is with converting the term document matrix into a data frame. As i am working on large samples of data, I run into memory errors. Is there a simple solution for this?
推荐答案
尝试一下
data("crude")
myTdm <- as.matrix(TermDocumentMatrix(crude))
FreqMat <- data.frame(ST = rownames(myTdm),
Freq = rowSums(myTdm),
row.names = NULL)
head(FreqMat, 10)
# ST Freq
# 1 "(it) 1
# 2 "demand 1
# 3 "expansion 1
# 4 "for 1
# 5 "growth 1
# 6 "if 1
# 7 "is 2
# 8 "may 1
# 9 "none 2
# 10 "opec 2
这篇关于使用R的单词频率列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文