应用tm_map时tm丢失元数据 [英] tm loses the metadata when applying tm_map

查看:88
本文介绍了应用tm_map时tm丢失元数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的tm r库有一个(小)问题. 说我有一个语料库:

I have a (small) problem with the tm r library. say I have a corpus:

# boilerplate
bcorp <- c("one","two","three","four","five")
myCorpus <- Corpus(VectorSource(bcorp), list(lanuage = "en_US"))
tdm <- TermDocumentMatrix(myCorpus)
Docs(tdm)

结果:

[1] "1" "2" "3" "4" "5"

这有效.但是当我尝试使用转换tm_map()时:

This works. But when I try to use a transformation tm_map():

# this does not work
myCorpus <- Corpus(VectorSource(bcorp), list(lanuage = "en_US"))
myCorpus <- tm_map(myCorpus, tolower)
tdm <- TermDocumentMatrix(myCorpus)

给予

Error: inherits(doc, "TextDocument") is not TRUE

在这种情况下提出的解决方案是将其转换为PlainTextDocument.

The solution proposed in this case was to transform to PlainTextDocument.

# this works but erase the metadata
myCorpus <- Corpus(VectorSource(bcorp), list(lanuage = "en_US"))
myCorpus <- tm_map(myCorpus, tolower)
myCorpus <- tm_map(myCorpus, PlainTextDocument)
tdm <- TermDocumentMatrix(myCorpus)
Docs(tdm)

结果:

[1] "character(0)" "character(0)" "character(0)" "character(0)" "character(0)"

现在可以使用,但是会删除所有元数据(在这种情况下为文档名称).有没有一种方法可以保存元数据,或者先保存然后再还原它们?

Now it works, but erase all the metadata (in this case the doc names). There is a way to mantain the metadata, or to save and then restore them?

推荐答案

我找到了.

该行:

myCorpus <- tm_map(myCorpus, PlainTextDocument)

解决了问题,但清除了元数据.

solves the problem but erase the metadata.

我找到了此答案,它解释了使用tm_map()的更好方法.我只需要替换:

I found this answer that explain a better way to use tm_map(). I just have to substitute:

myCorpus <- tm_map(myCorpus, tolower)

具有:

myCorpus <- tm_map(myCorpus, content_transformer(tolower))

所有作品!

这篇关于应用tm_map时tm丢失元数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆