如何正确使用stemDocument? [英] How is the correct use of stemDocument?

查看:25
本文介绍了如何正确使用stemDocument?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经阅读了这篇这个问题,但是我还是没明白tm_mapstemDocument的用法/代码>.让我们按照这个例子:

I have already read this and this questions, but I still didn't understand the use of stemDocument in tm_map. Let's follow this example:

q17 <- VCorpus(VectorSource(x = c("poder", "pode")),
               readerControl = list(language = "pt",
                                    load = TRUE))
lapply(q17, content)
$`character(0)`
[1] "poder"

$`character(0)`
[1] "pode"

如果我使用:

> stemDocument("poder", language = "portuguese")
[1] "pod"
> stemDocument("pode", language = "portuguese")
[1] "pod"

确实有效!但如果我使用:

it does work! But if I use:

> q17 <- tm_map(q17, FUN = stemDocument, language = "portuguese")
> lapply(q17, content)
$`character(0)`
[1] "poder"

$`character(0)`
[1] "pode"

它不起作用.为什么会这样?

it doesn't work. Why so?

推荐答案

不幸的是,您发现了一个错误.stemDocument 如果您在执行时传递语言,则可以使用:

Unfortunately you stumbled on a bug. stemDocument works if you pass on the language when you do:

stemDocument(x = c("poder", "pode"), language = "pt")
[1] "pod" "pod"

但是当在 tm_map 中使用它时,函数以 stemDocument.PlainTextDocument 开头.在此函数中,根据您在函数中提供的语言检查语料库的语言.这工作正常.但是在这个函数的末尾,所有的东西都被传递给函数stemDocument.character,但是没有语言组件.在stemDocument.character 中,默认语言指定为英语.因此,在 tm_map 调用(或 DocumentTermMatrix)中,您提供的语言将恢复为英语,并且词干提取无法正常工作.

But when using this in tm_map, the function starts of with stemDocument.PlainTextDocument. In this function the language of the corpus is checked against the language you supply in the function. This works correctly. But at the end of this function everything is passed on to the function stemDocument.character, but without the language component. In stemDocument.character the default language is specified as English. So within the tm_map call (or the DocumentTermMatrix) the language you supply with it will revert back to English and the stemming doesn't work correctly.

解决方法可能是使用 quanteda 包:

A workaround could be using the package quanteda:

library(quanteda)
my_dfm <- dfm(x = c("poder", "pode"))
my_dfm <- dfm_wordstem(my_dfm, language = "pt")

my_dfm

Document-feature matrix of: 2 documents, 1 feature (0.0% sparse).
2 x 1 sparse Matrix of class "dfm"
       features
docs    pod
  text1   1
  text2   1

由于您使用的是葡萄牙语,我建议您使用 quanteda、udpipe 或两者兼而有之的软件包.这两个软件包在处理非英语语言方面都比 tm 好得多.

Since you are working with Portuguese, I suggest using the packages quanteda, udpipe, or both. Both packages handle non-English languages a lot better than tm.

这篇关于如何正确使用stemDocument?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆