Grepl组字符串并使用R计数所有字符串的频率 [英] Grepl group of strings and count frequency of all using R

查看:69
本文介绍了Grepl组字符串并使用R计数所有字符串的频率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一列5万行的推文,它们来自csv文件中的文本(推文由句子,短语等组成).我正在尝试计算该列中几个单词的出现频率.与下面的操作相比,有没有更简单的方法?

I have a column of 50k rows of tweets named text from a csv file (the tweets consists of sentences, phrases etc). I'm trying to count frequency of several words in that column. Is there an easier way to do it vs what I'm doing below?

# Reading my file
tweets <- read.csv('coffee.csv', header=TRUE)


# Doing a grepl per word (This is hard because I need to look for many words one by one)
coffee    <- grepl("coffee", text$tweets, ignore.case=TRUE)
mugs    <- grepl("mugs", text$tweets, ignore.case=TRUE)


# Calculate the % of times among all tweets (This is hard because I need to calculate one by one)

sum(coffee) / nrow(text)
sum(starbucks) / nrow(text)

预期的输出(假设我在那里有两个以上的单词)

Expected Output (assuming I have more than 2 words up there)

Word   Freq
coffee  50
mugs    40
cup     64
pen     12

推荐答案

您可以创建要计算频率/百分比的单词的向量,并使用 sapply 进行计算.

You can create a vector of the words that you want to count frequency/percentage for and use sapply to calculate them.

words <- c('coffee', 'mugs')

data.frame(words, t(sapply(paste0('\\b', words, '\\b'), function(x) {
  tmp <- grepl(x, tweets$text)
  c(perc = mean(tmp) * 100, 
    Freq = sum(tmp))
})), row.names = NULL) -> result
result

#   words     perc Freq
#1 coffee 33.33333    1
#2   mugs 66.66667    2

sapply 类似于 for 循环,因为它遍历 words 中定义的每个单词. grepl 返回 TRUE / FALSE 值,指示单词是否存在于存储在 tweets $ text 中的单词中> tmp .为了计算频率,我们使用 sum ,对于百分比,我们使用 mean .还向单词添加了单词边界( \\ b ),以便它们在 text 中完全匹配,因此'coffee''咖啡'

sapply is similar to for loop as it iterates over each word defined in words. grepl returns TRUE/FALSE values indicating if the word is present in tweets$text which is stored in tmp. To count the frequency we use sum and for percentage we use mean. Also added word boundary (\\b) to the words so that they match completely in the text hence 'coffee' does not match with 'coffees' etc.

数据

tweets <- data.frame(text = c('This is text with coffee in it with lot of mugs', 
                              'This has only mugs', 
                              'This has nothing'))

这篇关于Grepl组字符串并使用R计数所有字符串的频率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆