R 文本挖掘 - 如何将 R 数据框列中的文本更改为具有词频的多列? [英] R Text mining - how to change texts in R data frame column into several columns with word frequencies?

查看:23
本文介绍了R 文本挖掘 - 如何将 R 数据框列中的文本更改为具有词频的多列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含 4 列的数据框.第 1 列包含 ID,第 2 列包含文本(每个约 100 个单词),第 3 和第 4 列包含标签.

I have a data frame with 4 columns. Column 1 consists of ID's, column 2 consists of texts (about 100 words each), column 3 and 4 consist labels.

现在我想从 texts 列中检索词频(最常见的词),并将这些频率作为额外的列添加到数据框中.我希望列名是单词本身,并在文本中填充它们的频率(范围从 0 到 ... 每个文本).

Now I would like to retrieve word frequencies (of the most common words) from the texts column and add those frequencies as extra columns to the data frame. I would like the column names to be the words themselves and the columns filled with their frequencies (ranging from 0 to ... per text) in the texts.

我尝试了 tm 包的一些功能,但直到现在都不令人满意.有谁知道如何处理这个问题或从哪里开始?有可以完成这项工作的软件包吗?

I tried some functions of the tm package but until now unsatisfactory. Does anyone has any idea how to deal with this problem or where to start? Is there a package that can do the job?

id  texts   label1    label2

推荐答案

好吧,让我们解决问题然后...

Well let's work through the issues then...

我猜你有一个看起来像这样的 data.frame:

I'm guessing you have a data.frame that looks like this:

       person sex adult                                 state code
1         sam   m     0         Computer is fun. Not too fun.   K1
2        greg   m     0               No it's not, it's dumb.   K2
3     teacher   m     1                    What should we do?   K3
4         sam   m     0                  You liar, it stinks!   K4
5        greg   m     0               I am telling the truth!   K5
6       sally   f     0                How can we be certain?   K6
7        greg   m     0                      There is no way.   K7
8         sam   m     0                       I distrust you.   K8
9       sally   f     0           What are you talking about?   K9
10 researcher   f     1         Shall we move on?  Good then.  K10
11       greg   m     0 I'm hungry.  Let's eat.  You already?  K11

这个数据集来自 qdap 包.获取 qdap 使用 install.packages("qdap").

This data set comes from the qdap package. to get qdap use install.packages("qdap").

现在要制作可重复的示例,我正在使用您的数据集进行讨论,请使用 qdap 中的 DATA 数据集执行我在此处所做的操作.

Now to make the reproducible example I was talking about with your data set do what I'm doing here with the DATA data set from qdap.

DATA
dput(head(DATA))

现在对于你最初的问题,我认为 wfm 会做你想做的:

Ok now for your original problem I think wfm will do what you want:

freqs <- t(wfm(DATA$state, 1:nrow(DATA)))
data.frame(DATA, freqs, check.names = FALSE)

如果您只想要最前面的这么多词,请使用我在这里使用的排序技术:

If you only wanted the top so many words use an ordering technique like I use here:

freqs <- t(wfm(DATA$state, 1:nrow(DATA)))
ords <- rev(sort(colSums(freqs)))[1:9]      #top 9 words
top9 <- freqs[, names(ords)]                #grab those columns from freqs  
data.frame(DATA, top9, check.names = FALSE) #put it together

结果如下:

> data.frame(DATA, top9, check.names = FALSE)
       person sex adult                                 state code you we what not no it's is i fun
1         sam   m     0         Computer is fun. Not too fun.   K1   0  0    0   1  0    0  1 0   2
2        greg   m     0               No it's not, it's dumb.   K2   0  0    0   1  1    2  0 0   0
3     teacher   m     1                    What should we do?   K3   0  1    1   0  0    0  0 0   0
4         sam   m     0                  You liar, it stinks!   K4   1  0    0   0  0    0  0 0   0
5        greg   m     0               I am telling the truth!   K5   0  0    0   0  0    0  0 1   0
6       sally   f     0                How can we be certain?   K6   0  1    0   0  0    0  0 0   0
7        greg   m     0                      There is no way.   K7   0  0    0   0  1    0  1 0   0
8         sam   m     0                       I distrust you.   K8   1  0    0   0  0    0  0 1   0
9       sally   f     0           What are you talking about?   K9   1  0    1   0  0    0  0 0   0
10 researcher   f     1         Shall we move on?  Good then.  K10   0  1    0   0  0    0  0 0   0
11       greg   m     0 I'm hungry.  Let's eat.  You already?  K11   1  0    0   0  0    0  0 0   0

这篇关于R 文本挖掘 - 如何将 R 数据框列中的文本更改为具有词频的多列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆