R:在R中的文档术语矩阵中查找与文档中的术语“欺诈"相关的前10个术语 [英] R : Finding the top 10 terms associated with the term 'fraud' across documents in a Document Term Matrix in R
问题描述
我有一个以年份命名的39个文本文件的语料库-1945.txt,1978.txt .... 2013.txt.
I have a corpus of 39 text files named by the year - 1945.txt, 1978.txt.... 2013.txt.
我已将它们导入R并使用TM包创建了文档术语矩阵. 我正在尝试调查从1945年到2013年,与欺诈"一词相关的字词是如何变化的. 所需的输出将是一个39 x 10/5的矩阵,其中以年作为行标题,将前10或5个词作为列.
I've imported them into R and created a Document Term Matrix using TM package. I'm trying to investigate how words associated with term'fraud' have changed over years from 1945 to 2013. The desired output would be a 39 by 10/5 matrix with years as row titles and top 10 or 5 terms as columns.
任何帮助将不胜感激.
谢谢.
我的TDM的结构:
> str(ytdm)
List of 6
$ i : int [1:6791] 5 7 8 17 32 41 42 55 58 71 ...
$ j : int [1:6791] 1 1 1 1 1 1 1 1 1 1 ...
$ v : num [1:6791] 2 4 2 2 2 8 4 3 2 2 ...
$ nrow : int 193
$ ncol : int 39
$ dimnames:List of 2
..$ Terms: chr [1:193] "abus" "access" "account" "accur" ...
..$ Docs : chr [1:39] "1947" "1976" "1977" "1978" ...
- attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
- attr(*, "Weighting")= chr [1:2] "term frequency" "tf"
My ideal output is like this:
1947 account accur gao medicine fed ......
1948 access .............
.
.
.
.
.
.
推荐答案
您的示例无法复制,但findAssocs()可能正是您想要的.由于您只想每年查看一次合作伙伴,因此每年都需要一个dtm.
Your example can't be replicated but findAssocs() is probably what you're looking for. Since you want to only look at associates on a yearly basis you'll need a dtm for each year.
> library(tm)
> data(crude)
> # i don't have your data so pretend this is corpus of docs for each year
> names(crude) <- rep(c("1999","2000"),10)
> # create a dtm for each year
> dtm.list <- lapply(unique(names(crude)),function(x) TermDocumentMatrix(crude[names(crude)==x]))
> # get associations for each year
> assoc.list <- lapply(dtm.list,findAssocs,term="oil",corlimit=0.7)
> names(assoc.list) <- unique(names(crude))
> assoc.list
$`1999`
prices barrel.
0.79 0.70
$`2000`
15.8 opec and said prices, sell the analysts clearly fixed
0.94 0.94 0.92 0.92 0.91 0.91 0.88 0.85 0.85 0.85
late meeting never that trying who winter emergency above but
0.85 0.85 0.85 0.85 0.85 0.85 0.85 0.84 0.83 0.83
world they mln market agreement before bpd buyers energy prices
0.82 0.80 0.79 0.78 0.75 0.75 0.75 0.75 0.75 0.75
set through under will not its
0.75 0.75 0.75 0.74 0.72 0.70
> # or if you want the 5 top terms
> assoc.list <- lapply(dtm.list,function(x) names(findAssocs(x,"oil",0)[1:5]))
> names(assoc.list) <- unique(names(crude))
> assoc.list
$`1999`
[1] "prices" "barrel." "said." "minister" "arabian"
$`2000`
[1] "15.8" "opec" "and" "said" "prices,"
这篇关于R:在R中的文档术语矩阵中查找与文档中的术语“欺诈"相关的前10个术语的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!