R 文本挖掘 - 如何将 R 数据框列中的文本更改为具有词频的多列? [英] R Text mining - how to change texts in R data frame column into several columns with word frequencies?
问题描述
我有一个包含 4 列的数据框.第 1 列包含 ID,第 2 列包含文本(每个约 100 个单词),第 3 和第 4 列包含标签.
I have a data frame with 4 columns. Column 1 consists of ID's, column 2 consists of texts (about 100 words each), column 3 and 4 consist labels.
现在我想从 texts 列中检索词频(最常见的词),并将这些频率作为额外的列添加到数据框中.我希望列名是单词本身,并在文本中填充它们的频率(范围从 0 到 ... 每个文本).
Now I would like to retrieve word frequencies (of the most common words) from the texts column and add those frequencies as extra columns to the data frame. I would like the column names to be the words themselves and the columns filled with their frequencies (ranging from 0 to ... per text) in the texts.
我尝试了 tm 包的一些功能,但直到现在都不令人满意.有谁知道如何处理这个问题或从哪里开始?有可以完成这项工作的软件包吗?
I tried some functions of the tm package but until now unsatisfactory. Does anyone has any idea how to deal with this problem or where to start? Is there a package that can do the job?
id texts label1 label2
推荐答案
好吧,让我们解决问题然后...
Well let's work through the issues then...
我猜你有一个看起来像这样的 data.frame:
I'm guessing you have a data.frame that looks like this:
person sex adult state code
1 sam m 0 Computer is fun. Not too fun. K1
2 greg m 0 No it's not, it's dumb. K2
3 teacher m 1 What should we do? K3
4 sam m 0 You liar, it stinks! K4
5 greg m 0 I am telling the truth! K5
6 sally f 0 How can we be certain? K6
7 greg m 0 There is no way. K7
8 sam m 0 I distrust you. K8
9 sally f 0 What are you talking about? K9
10 researcher f 1 Shall we move on? Good then. K10
11 greg m 0 I'm hungry. Let's eat. You already? K11
这个数据集来自 qdap 包.获取 qdap 使用 install.packages("qdap")
.
This data set comes from the qdap package. to get qdap use install.packages("qdap")
.
现在要制作可重复的示例,我正在使用您的数据集进行讨论,请使用 qdap 中的 DATA
数据集执行我在此处所做的操作.
Now to make the reproducible example I was talking about with your data set do what I'm doing here with the DATA
data set from qdap.
DATA
dput(head(DATA))
现在对于你最初的问题,我认为 wfm
会做你想做的:
Ok now for your original problem I think wfm
will do what you want:
freqs <- t(wfm(DATA$state, 1:nrow(DATA)))
data.frame(DATA, freqs, check.names = FALSE)
如果您只想要最前面的这么多词,请使用我在这里使用的排序技术:
If you only wanted the top so many words use an ordering technique like I use here:
freqs <- t(wfm(DATA$state, 1:nrow(DATA)))
ords <- rev(sort(colSums(freqs)))[1:9] #top 9 words
top9 <- freqs[, names(ords)] #grab those columns from freqs
data.frame(DATA, top9, check.names = FALSE) #put it together
结果如下:
> data.frame(DATA, top9, check.names = FALSE)
person sex adult state code you we what not no it's is i fun
1 sam m 0 Computer is fun. Not too fun. K1 0 0 0 1 0 0 1 0 2
2 greg m 0 No it's not, it's dumb. K2 0 0 0 1 1 2 0 0 0
3 teacher m 1 What should we do? K3 0 1 1 0 0 0 0 0 0
4 sam m 0 You liar, it stinks! K4 1 0 0 0 0 0 0 0 0
5 greg m 0 I am telling the truth! K5 0 0 0 0 0 0 0 1 0
6 sally f 0 How can we be certain? K6 0 1 0 0 0 0 0 0 0
7 greg m 0 There is no way. K7 0 0 0 0 1 0 1 0 0
8 sam m 0 I distrust you. K8 1 0 0 0 0 0 0 1 0
9 sally f 0 What are you talking about? K9 1 0 1 0 0 0 0 0 0
10 researcher f 1 Shall we move on? Good then. K10 0 1 0 0 0 0 0 0 0
11 greg m 0 I'm hungry. Let's eat. You already? K11 1 0 0 0 0 0 0 0 0
这篇关于R 文本挖掘 - 如何将 R 数据框列中的文本更改为具有词频的多列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!