在R tm包中,从Document-Term-Matrix构建语料库 [英] In R tm package, build corpus FROM Document-Term-Matrix

查看:102
本文介绍了在R tm包中,从Document-Term-Matrix构建语料库的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用tm包从语料库构建文档术语矩阵非常简单. 我想根据文档术语矩阵建立语料库.

It's straightforward to build a document-term matrix from a corpus with the tm package. I'd like to build a corpus from a document-term-matrix.

让M为文档集中的文档数. 令V为该文档集中词汇中的术语数.然后,文档术语矩阵为M * V矩阵.

Let M be the number of documents in a document set. Let V be the number of terms in the vocabulary of that document set.Then a document-term-matrix is an M*V matrix.

我也有一个长度为V的词汇向量.在词汇向量中,是文档术语矩阵中由索引表示的单词.

I also have a vocabulary vector, of length V. In the vocabulary vector are the words represented by indices in the document-term-matrix.

从dtm和词汇量向量,我想构造一个语料库"对象.这是因为我想阻止我的文档集.我手动构建了dtm和vocab-即,从来没有一个代表我的数据集的tm语料库"对象,所以我不能使用该功能,

From the dtm and vocabulary vector, I'd like to construct a "corpus" object. This is because I'd like to stem my document set. I built my dtm and vocab manually - i.e. there never was a tm "corpus" object representing my dataset, so I can't use the function,

tm_map(corpus, stemDocument, language="english")

我一直在尝试建立一种变通方法,在该方法中,我只能阻止词汇表并保留唯一的单词,但是要维护dtm和词汇表向量之间的对应关系会变得有些复杂.

I've been trying to build a workaround where I stem the vocabulary and only keep unique words, but then it gets somewhat complicated trying to maintain the correspondence between the dtm and the vocabulary vector.

理想情况下,最终结果是我的词汇向量是词干,并且仅包含唯一条目,而dtm索引对应于词干的词汇向量.如果您能想到其他方法来做,我也将不胜感激.

Ideally, the end result would be that my vocabulary vector is stemmed and only contains unique entries, and the dtm indices correspond to the stemmed vocabulary vector. If you can think of some other way to do that, I would appreciate that as well.

如果我可以简单地从我的dtm和词汇向量构建一个tm语料库",阻止语料库,然后再转换回dtm和词汇向量(我已经知道如何进行这些转换),那么我的麻烦就可以解决.

My troubles would be fixed if I could simply build a tm "corpus" from my dtm and vocabulary vector, stem the corpus, and then convert back to a dtm and vocabulary vector (I already know how to make those conversions).

请让我知道是否可以进一步澄清问题.

Let me know if I can clarify the problem any further.

推荐答案

正在提供我自己的

Here's on approach providing my own minimal reproducible example (as a new user you may not be aware that this is your responsibility) from the tm package:

## Minimal Reproducible Example
library(tm)
data("crude")
dtm <- DocumentTermMatrix(crude,
    control = list(weighting =
    function(x)
        weightTfIdf(x, normalize = FALSE),
        stopwords = TRUE))

## Convert tdm to a list of text
dtm2list <- apply(dtm, 1, function(x) {
    paste(rep(names(x), x), collapse=" ")
})

## convert to a Corpus
myCorp <- VCorpus(VectorSource(dtm2list))
inspect(myCorp)

## Stemming
myCorp <- tm_map(myCorp, stemDocument)
inspect(myCorp)

这篇关于在R tm包中,从Document-Term-Matrix构建语料库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆