每项频率 - R TM DocumentTermMatrix [英] Frequency Per Term - R TM DocumentTermMatrix

查看:13
本文介绍了每项频率 - R TM DocumentTermMatrix的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对 R 非常陌生,无法完全理解 DocumentTermMatrixs.我有一个用 TM 包创建的 DocumentTermMatrix,它有术语频率和里面的术语,但我不知道如何访问它们.

I'm very new to R and cannot quite wrap my head around DocumentTermMatrixs. I have a DocumentTermMatrix created with the TM package, it has the term frequency and the terms inside it but I cannot figure out how to access them.

理想情况下,我希望:

    Term  # 
    "the" 200 
    "is"  400 
    "a"   200 

目前我的代码是:

    library(tm)
    common.words <- c("amp","@RT","I","http","https", stopwords("english"), "you")
    x <- Corpus(VectorSource(results)) 
    x <- tm_map(x, stripWhitespace) 
    x <- tm_map(x, removeNumbers) 
    x <- tm_map(x, removePunctuation) 
    x <- tm_map(x, stripWhitespace)

    dtm <- DocumentTermMatrix(x)
    for(i in 1:length(common.words)) {
    dtm <- dtm[,!colnames(dtm)%in%c(common.words[i])]
    }

这是 str(dtm) 的输出

This is the output from str(dtm)

   List of 6
   $ i       : int [1:9769] 1 1 1 1 1 1 1 1 2 2 ...
   $ j       : int [1:9769] 1596 1684 1858 2112 2175 2490 2714 2814 873 961 ...
   $ v       : num [1:9769] 1 1 2 1 1 2 1 1 1 1 ...
   $ nrow    : int 1477
   $ ncol    : int 3201
   $ dimnames:List of 2
   ..$ Docs : chr [1:1477] "1" "2" "3" "4" ...
   ..$ Terms: chr [1:3201] "\u0093\u0085a" "aardvark" "aaron" "abbie" ...
    - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
    - attr(*, "Weighting")= chr [1:2] "term frequency" "tf"

谢谢,

-A

推荐答案

它似乎是数据的稀疏矩阵组织.频率似乎在v"列表中,您可以通过在条款"属性中查找术语的位置来获得该频率.为什么不提供 dput(head(results, 30)) 以便您的代码(和您的 SO 受众)有工作要做?在浏览了包中的示例之后,我怀疑您实际上想要以下内容:

It appears to be a sparse matrix organization of the data. It appears that the frequency is in the "v" list and you get that by looking up the position of your term in the Terms attribute. Why not provide dput(head(results, 30)) so your code (and your SO audience) will have something to work on? After plying around with the examples in the package, I suspect you actually want something along the lines of:

tdm <- TermDocumentMatrix(x)
z <- inspect( tdm[ c("the", "is", "a"), dimnames(tdm)$Docs] )
rowSums(z)

这篇关于每项频率 - R TM DocumentTermMatrix的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆