R:在R中的文档术语矩阵中查找与文档中的术语“欺诈"相关的前10个术语 [英] R : Finding the top 10 terms associated with the term 'fraud' across documents in a Document Term Matrix in R

查看:170
本文介绍了R:在R中的文档术语矩阵中查找与文档中的术语“欺诈"相关的前10个术语的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个以年份命名的39个文本文件的语料库-1945.txt,1978.txt .... 2013.txt.

I have a corpus of 39 text files named by the year - 1945.txt, 1978.txt.... 2013.txt.

我已将它们导入R并使用TM包创建了文档术语矩阵. 我正在尝试调查从1945年到2013年,与欺诈"一词相关的字词是如何变化的. 所需的输出将是一个39 x 10/5的矩阵,其中以年作为行标题,将前10或5个词作为列.

I've imported them into R and created a Document Term Matrix using TM package. I'm trying to investigate how words associated with term'fraud' have changed over years from 1945 to 2013. The desired output would be a 39 by 10/5 matrix with years as row titles and top 10 or 5 terms as columns.

任何帮助将不胜感激.

谢谢.

我的TDM的结构:

> str(ytdm)
List of 6
 $ i       : int [1:6791] 5 7 8 17 32 41 42 55 58 71 ...
 $ j       : int [1:6791] 1 1 1 1 1 1 1 1 1 1 ...
 $ v       : num [1:6791] 2 4 2 2 2 8 4 3 2 2 ...
 $ nrow    : int 193
 $ ncol    : int 39
 $ dimnames:List of 2
  ..$ Terms: chr [1:193] "abus" "access" "account" "accur" ...
  ..$ Docs : chr [1:39] "1947" "1976" "1977" "1978" ...
 - attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
 - attr(*, "Weighting")= chr [1:2] "term frequency" "tf"

My ideal output is like this:


1947   account accur gao medicine fed ......
1948   access  .............
.
.
.
.
.
.

推荐答案

您的示例无法复制,但findAssocs()可能正是您想要的.由于您只想每年查看一次合作伙伴,因此每年都需要一个dtm.

Your example can't be replicated but findAssocs() is probably what you're looking for. Since you want to only look at associates on a yearly basis you'll need a dtm for each year.

> library(tm)
> data(crude)
> # i don't have your data so pretend this is corpus of docs for each year
> names(crude) <- rep(c("1999","2000"),10)
> # create a dtm for each year
> dtm.list <- lapply(unique(names(crude)),function(x) TermDocumentMatrix(crude[names(crude)==x]))
> # get associations for each year
> assoc.list <- lapply(dtm.list,findAssocs,term="oil",corlimit=0.7)
> names(assoc.list) <- unique(names(crude))
> assoc.list
$`1999`
 prices barrel. 
   0.79    0.70 

$`2000`
     15.8      opec       and      said   prices,      sell       the  analysts   clearly     fixed 
     0.94      0.94      0.92      0.92      0.91      0.91      0.88      0.85      0.85      0.85 
     late   meeting     never      that    trying       who    winter emergency     above       but 
     0.85      0.85      0.85      0.85      0.85      0.85      0.85      0.84      0.83      0.83 
    world      they       mln    market agreement    before       bpd    buyers    energy    prices 
     0.82      0.80      0.79      0.78      0.75      0.75      0.75      0.75      0.75      0.75 
      set   through     under      will       not       its 
     0.75      0.75      0.75      0.74      0.72      0.70 

> # or if you want the 5 top terms
> assoc.list <- lapply(dtm.list,function(x) names(findAssocs(x,"oil",0)[1:5]))
> names(assoc.list) <- unique(names(crude))
> assoc.list
$`1999`
[1] "prices"   "barrel."  "said."    "minister" "arabian" 

$`2000`
[1] "15.8"    "opec"    "and"     "said"    "prices,"

这篇关于R:在R中的文档术语矩阵中查找与文档中的术语“欺诈"相关的前10个术语的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆