在(稀疏)文档特征矩阵中拆分ngram [英] Split up ngrams in (sparse) document-feature matrix

查看:315
本文介绍了在(稀疏)文档特征矩阵中拆分ngram的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个问题的后续问题.在那里,我问是否有可能以某种方式将ngram特征拆分成文档特征矩阵(来自quanteda包的dfm类). bigrams产生两个独立的字母组合.

This is a follow up question to this one. There, I asked if it's possible to split up ngram-features in a document-feature matrix (dfm-class from the quanteda-package) in such a way that e.g. bigrams result in two separate unigrams.

为了更好地理解:我将dfm中的ngrams从德语翻译为英语.化合物("Emissionsminderung")在德语中很安静,但在英语中却不常见("Emissions reduction").

For better understanding: I got the ngrams in the dfm from translating the features from German to English. Compounds ("Emissionsminderung") are quiet common in German but not in English ("emission reduction").

library(quanteda)

eg.txt <- c('increase in_the great plenary', 
            'great plenary emission_reduction', 
            'increase in_the emission_reduction emission_increase')
eg.corp <- corpus(eg.txt)
eg.dfm <- dfm(eg.corp)

此示例有一个很好的答案,对于上述相对较小的矩阵,它绝对可以正常工作.但是,一旦矩阵变大,我就会不断遇到以下内存错误.

There was a nice answer to this example, which works absolutely fine for relatively small matrices as the one above. However, as soon as the matrix is bigger, I'm constantly running into the following memory error.

> #turn the dfm into a matrix
> DF <- as.data.frame(eg.dfm)
Error in asMethod(object) : 
  Cholmod-error 'problem too large' at file ../Core/cholmod_dense.c, line 105

因此,是否有解决此ngram问题或处理大型(稀疏)矩阵/数据帧的内存效率更高的方法?预先谢谢你!

Hence, is there a more memory efficient way to solve this ngram-problem or to deal with large (sparse) matrices/data frames? Thank you in advance!

推荐答案

这里的问题是,当您调用as.data.frame()时,您正在将稀疏(dfm)矩阵转换为密集对象.由于典型的文档特征矩阵是90%稀疏的,因此这意味着您要创建的东西超出了您的处理能力.解决方案:使用dfm处理功能维护稀疏性.

The problem here is that you are turning the sparse (dfm) matrix into a dense object when you call as.data.frame(). Since the typical document-feature matrix is 90% sparse, this means you are creating something larger than you can handle. The solution: use dfm handling functions to maintain the sparsity.

请注意,与链接中提出的建议相比,这都是更好的解决方案问题,但对于更大的对象也应该有效.

Note that this is both a better solution than proposed in the linked question but also should work efficiently for your much larger object.

这是一个执行此操作的函数.它允许您设置连接符,并可以使用ngrams的可变大小.最重要的是,它使用dfm方法来确保dfm保持稀疏.

Here's a function that does that. It allows you to set the concatenator character(s), and works with ngrams of variable sizes. Most importantly, it uses dfm methods to make sure the dfm remains sparse.

# function to split and duplicate counts in features containing 
# the concatenator character
dfm_splitgrams <- function(x, concatenator = "_") {
    # separate the unigrams
    x_unigrams <-  dfm_remove(x, concatenator, valuetype = "regex")

    # separate the ngrams
    x_ngrams <- dfm_select(x, concatenator, valuetype = "regex")
    # split into components
    split_ngrams <- stringi::stri_split_regex(featnames(x_ngrams), concatenator)
    # get a repeated index for the ngram feature names
    index_split_ngrams <- rep(featnames(x_ngrams), lengths(split_ngrams))
    # subset the ngram matrix using the (repeated) ngram feature names
    x_split_ngrams <- x_ngrams[, index_split_ngrams]
    # assign the ngram dfm the feature names of the split ngrams
    colnames(x_split_ngrams) <- unlist(split_ngrams, use.names = FALSE)

    # return the column concatenation of unigrams and split ngrams
    suppressWarnings(cbind(x_unigrams, x_split_ngrams))
}

所以:

dfm_splitgrams(eg.dfm)
## Document-feature matrix of: 3 documents, 9 features (40.7% sparse).
## 3 x 9 sparse Matrix of class "dfmSparse"
##        features
## docs    increase great plenary in the emission reduction emission increase
##   text1        1     1       1  1   1        0         0        0        0
##   text2        0     1       1  0   0        1         1        0        0
##   text3        1     0       0  1   1        1         1        1        1

在这里,拆分ngram会产生具有相同功能名称的新"unigram".您可以使用dfm_compress()(

Here, splitting ngrams results in new "unigrams" of the same feature name. You can (re)combine them efficiently with dfm_compress():

dfm_compress(dfm_splitgrams(eg.dfm))
## Document-feature matrix of: 3 documents, 7 features (33.3% sparse).
## 3 x 7 sparse Matrix of class "dfmSparse"
##        features
## docs    increase great plenary in the emission reduction
##   text1        1     1       1  1   1        0         0
##   text2        0     1       1  0   0        1         1
##   text3        2     0       0  1   1        2         1

这篇关于在(稀疏)文档特征矩阵中拆分ngram的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆