在R中,我如何统计语料库中的特定单词? [英] In R, how can I count specific words in a corpus?

查看:27
本文介绍了在R中,我如何统计语料库中的特定单词?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要统计特定单词的出现频率。很多话。我知道如何将所有单词归入一组(见下文),但我想了解每个特定单词的计数。

这是我目前的情况:

library(quanteda)
#function to count 
strcount <- function(x, pattern, split){unlist(lapply(strsplit(x, split),function(z) na.omit(length(grep(pattern, z)))))}
txt <- "Forty-four Americans have now taken the presidential oath. The words have been spoken during rising tides of prosperity and the still waters of peace. Yet, every so often the oath is taken amidst gathering clouds and raging storms. At these moments, America has carried on not simply because of the skill or vision of those in high office, but because We the People have remained faithful to the ideals of our forbearers, and true to our founding documents."
df<-data.frame(txt)
mydict<-dictionary(list(all_terms=c("clouds","storms")))
corp <- corpus(df, text_field = 'txt')
#count terms and save output to "overview"
overview<-dfm(corp,dictionary = mydict)
overview<-convert(overview, to ='data.frame')
如您所见,";cloud";和";Storms";的计数在结果data.frame中的";all_Terms";类别中。是否有一种简单的方法可以在单独的列中获取";mydict";中所有术语的计数,而无需为每个单独的术语编写代码?

E.g.
clouds, storms
1, 1

Rather than 
all_terms
2

推荐答案

您希望将字典值用作tokens_select()中的pattern,而不是像dfm(x, dictionary = ...)那样在查找函数中使用它们。操作方法如下:

library("quanteda")
## Package version: 2.1.2

txt <- "Forty-four Americans have now taken the presidential oath. The words have been spoken during rising tides of prosperity and the still waters of peace. Yet, every so often the oath is taken amidst gathering clouds and raging storms. At these moments, America has carried on not simply because of the skill or vision of those in high office, but because We the People have remained faithful to the ideals of our forbearers, and true to our founding documents."

mydict <- dictionary(list(all_terms = c("clouds", "storms")))

这将创建DFM,其中每列都是术语,而不是字典键:

dfmat <- tokens(txt) %>%
  tokens_select(mydict) %>%
  dfm()

dfmat
## Document-feature matrix of: 1 document, 2 features (0.0% sparse).
##        features
## docs    clouds storms
##   text1      1      1

您可以通过两种方式将其转换为数据。计数框:

convert(dfmat, to = "data.frame")
##   doc_id clouds storms
## 1  text1      1      1

textstat_frequency(dfmat)
##   feature frequency rank docfreq group
## 1  clouds         1    1       1   all
## 2  storms         1    1       1   all

虽然字典是pattern的有效输入(请参见?pattern),但您也可以将值的字符矢量反馈给tokens_select()

# no need for dictionary
tokens(txt) %>%
  tokens_select(c("clouds", "storms")) %>%
  dfm()
## Document-feature matrix of: 1 document, 2 features (0.0% sparse).
##        features
## docs    clouds storms
##   text1      1      1

这篇关于在R中,我如何统计语料库中的特定单词?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆