R:用 Quanteda 包删除CommonTerms? [英] R: removeCommonTerms with Quanteda package?

查看:17
本文介绍了R:用 Quanteda 包删除CommonTerms?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此处为 TM 包找到 removeCommonTerms 函数,使得

The removeCommonTerms function is found here for the TM package such that

removeCommonTerms <- function (x, pct) 
{
    stopifnot(inherits(x, c("DocumentTermMatrix", "TermDocumentMatrix")), 
        is.numeric(pct), pct > 0, pct < 1)
    m <- if (inherits(x, "DocumentTermMatrix")) 
        t(x)
    else x
    t <- table(m$i) < m$ncol * (pct)
    termIndex <- as.numeric(names(t[t]))
    if (inherits(x, "DocumentTermMatrix")) 
        x[, termIndex]
    else x[termIndex, ]
}

现在我想删除 Quanteda 软件包中过于常见的术语.我可以在创建文档特征矩阵或使用文档特征矩阵之前执行此删除操作.

now I would like to remove too common terms with the Quanteda package. I could do this removal before creating the Document-feature matrix or with the document-feature matrix.

如何使用 R 中的 Quanteda 包删除过于常见的术语?

推荐答案

您需要 dfm_trim 功能.来自 ?dfm_trim

You want the dfm_trim function. From ?dfm_trim

max_docfreq 出现特征的文档的最大数量或分数,超过该特征将被删除.(默认为无上限.)

max_docfreq maximum number or fraction of documents in which a feature appears, above which features will be removed. (Default is no upper limit.)

这需要最新版本的 quanteda(CRAN 上的新版本).

This requires the newest version of quanteda (fresh on CRAN).

packageVersion("quanteda")
## [1] ‘0.9.9.3’

inaugdfm <- dfm(data_corpus_inaugural)

dfm_trim(inaugdfm, max_docfreq = .8)
## Removing features occurring: 
##   - in more than 0.8 * 57 = 45.6 documents: 93
##   Total features removed: 93 (1.01%).
## Document-feature matrix of: 57 documents, 9,081 features (92.4% sparse).

这篇关于R:用 Quanteda 包删除CommonTerms?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆