使用csv文件中的2列创建术语频率矩阵? [英] Create a term frequency matrix using 2 columns from a csv file, in R?
问题描述
我是R.的新用户。我在csv文件中挖掘数据 - 一列中的报表摘要,另一列中的报表日期以及第三列中的报表代理。我需要调查与欺诈相关的术语如何随时间改变或因代理而变化。我已过滤包含欺诈一词的行,并创建了一个新的CSV文件。
I'm new to R. I'm mining data which is present in csv file - summaries of reports in one column, date of report in another column and report's agency in the thrid column. I need to investigate how terms associated with ‘fraud’ have changed over time or vary by agency. I've filtered the rows containing the term 'fraud' and created a new csv file.
如何创建一个包含年份的术语freq矩阵作为行和术语作为列所以我可以寻找顶级频率术语并做一些聚类?
How can I create a term freq matrix with years as rows and terms as columns so that I can look for top freq terms and do some clustering?
基本上,我需要创建一个术语频率矩阵的年
Basically, I need to create a term frequency matrix of terms against year
Input data: (csv)
**Year** **Summary** (around 300 words each)
1945 <text>
1985 <text>
2011 <text>
Desired 0utput : (Term frequency matrix)
term1 term2 term3 term4 .......
1945 3 5 7 8 .....
1985 1 2 0 7 .....
2011 . . .
Any help would be greatly appreciated.
推荐答案
这不是完全使用tm,而是qdap,因为它更适合您的数据类型:
This isn't exactly using tm but qdap instead as it fits your data type better:
library(qdap)
#create a fake data set (please do this in the future yourself)
dat <- data.frame(year=1945:(1945+10), summary=DATA$state)
## year summary
## 1 1945 Computer is fun. Not too fun.
## 2 1946 No it's not, it's dumb.
## 3 1947 What should we do?
## 4 1948 You liar, it stinks!
## 5 1949 I am telling the truth!
## 6 1950 How can we be certain?
## 7 1951 There is no way.
## 8 1952 I distrust you.
## 9 1953 What are you talking about?
## 10 1954 Shall we move on? Good then.
## 11 1955 I'm hungry. Let's eat. You already?
现在创建词频矩阵(类似于术语文档矩阵):
Now to create the word frequency matrix (similar to a term document matrix):
t(with(dat, wfm(summary, year)))
## about already am are be ... you
## 1945 0 0 0 0 0 0
## 1946 0 0 0 0 0 0
## 1947 0 0 0 0 0 0
## 1948 0 0 0 0 0 1
## 1949 0 0 1 0 0 0
## 1950 0 0 0 0 1 0
## 1951 0 0 0 0 0 0
## 1952 0 0 0 0 0 1
## 1953 1 0 0 1 0 1
## 1954 0 0 0 0 0 0
## 1955 0 1 0 0 0 1
或者,您可以创建一个tru DocumentTermMatrix qdap版本1.1.0 :
Or you can create a tru DocumentTermMatrix as of qdap version 1.1.0:
with(dat, dtm(summary, year))
## > with(dat, dtm(summary, year))
## A document-term matrix (11 documents, 41 terms)
##
## Non-/sparse entries: 51/400
## Sparsity : 89%
## Maximal term length: 8
## Weighting : term frequency (tf)
这篇关于使用csv文件中的2列创建术语频率矩阵?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!