应用tm_map时tm丢失元数据 [英] tm loses the metadata when applying tm_map
问题描述
我的tm r库有一个(小)问题. 说我有一个语料库:
I have a (small) problem with the tm r library. say I have a corpus:
# boilerplate
bcorp <- c("one","two","three","four","five")
myCorpus <- Corpus(VectorSource(bcorp), list(lanuage = "en_US"))
tdm <- TermDocumentMatrix(myCorpus)
Docs(tdm)
结果:
[1] "1" "2" "3" "4" "5"
这有效.但是当我尝试使用转换tm_map()时:
This works. But when I try to use a transformation tm_map():
# this does not work
myCorpus <- Corpus(VectorSource(bcorp), list(lanuage = "en_US"))
myCorpus <- tm_map(myCorpus, tolower)
tdm <- TermDocumentMatrix(myCorpus)
给予
Error: inherits(doc, "TextDocument") is not TRUE
在这种情况下提出的解决方案是将其转换为PlainTextDocument.
The solution proposed in this case was to transform to PlainTextDocument.
# this works but erase the metadata
myCorpus <- Corpus(VectorSource(bcorp), list(lanuage = "en_US"))
myCorpus <- tm_map(myCorpus, tolower)
myCorpus <- tm_map(myCorpus, PlainTextDocument)
tdm <- TermDocumentMatrix(myCorpus)
Docs(tdm)
结果:
[1] "character(0)" "character(0)" "character(0)" "character(0)" "character(0)"
现在可以使用,但是会删除所有元数据(在这种情况下为文档名称).有没有一种方法可以保存元数据,或者先保存然后再还原它们?
Now it works, but erase all the metadata (in this case the doc names). There is a way to mantain the metadata, or to save and then restore them?
推荐答案
我找到了.
该行:
myCorpus <- tm_map(myCorpus, PlainTextDocument)
解决了问题,但清除了元数据.
solves the problem but erase the metadata.
我找到了此答案,它解释了使用tm_map()的更好方法.我只需要替换:
I found this answer that explain a better way to use tm_map(). I just have to substitute:
myCorpus <- tm_map(myCorpus, tolower)
具有:
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
所有作品!
这篇关于应用tm_map时tm丢失元数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!