R 文本挖掘 - 如何将 R 数据框列中的文本更改为具有双字频数的几列? [英] R Text mining - how to change texts in R data frame column into several columns with bigram frequencies?

查看:21
本文介绍了R 文本挖掘 - 如何将 R 数据框列中的文本更改为具有双字频数的几列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

除了问题 R 文本挖掘 - 如何将 R 数据框列中的文本更改为具有词频的多列? 我想知道如何设法制作具有双字频的列,而不仅仅是词频.再次,非常感谢!

In addition to question R Text mining - how to change texts in R data frame column into several columns with word frequencies? I am wondering how I can manage to make columns with bigrams frequencies instead of just word frequencies. Again, many thanks in advance!

这是示例数据框(感谢 Tyler Rinker).

This is the example data frame (thanks to Tyler Rinker).

      person sex adult                                 state code
1         sam   m     0         Computer is fun. Not too fun.   K1
2        greg   m     0               No it's not, it's dumb.   K2
3     teacher   m     1                    What should we do?   K3
4         sam   m     0                  You liar, it stinks!   K4
5        greg   m     0               I am telling the truth!   K5
6       sally   f     0                How can we be certain?   K6
7        greg   m     0                      There is no way.   K7
8         sam   m     0                       I distrust you.   K8
9       sally   f     0           What are you talking about?   K9
10 researcher   f     1         Shall we move on?  Good then.  K10
11       greg   m     0 I'm hungry.  Let's eat.  You already?  K11

以上数据集:

library(qdap); DATA

推荐答案

qdap 的开发版本(应该会在接下来的几天内转到 CRAN)执行 ngram.现在您需要使用开发版.在玩具数据集上这很快,但在更大的数据集上,例如 qdapmraja1 数据集需要大约 5 分钟才能完成.你可以:

The dev version of qdap (should go to CRAN within the next few days) does ngrams. For now you'll need to use the dev version. On the toy data set this is fast but on a larger data set such as qdap's mraja1 data set requires ~5 minutes to complete. You could:

  1. 更明智地选择二元组(即不要全部使用它们,因为会有很多)
  2. 等待时机
  3. 并行运行
  4. 想出另一种方法来做到这一点
  5. 获得更快的计算机

这是获取qdap 的开发版本并运行bigram 搜索的代码:

Here's the code to get the dev version of qdap and run the bigram search:

library(devtools)
install_github("qdap", "trinker")
library(qdap)

## this gets the bigrams
bigrams <- sapply(ngrams(DATA$state)[[c("all_n", "n_2")]], paste, collapse=" ")

## This searches by grouping variable for bigram use
termco(DATA$state, DATA$person, bigrams)


## To get raw values
termco(DATA$state, DATA$person, bigrams)[["raw"]]

这篇关于R 文本挖掘 - 如何将 R 数据框列中的文本更改为具有双字频数的几列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆