R tm stemCompletion 生成 NA 值 [英] R tm stemCompletion generates NA value

查看:24
本文介绍了R tm stemCompletion 生成 NA 值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我尝试将 stemCompletion 应用于语料库时,此函数会生成 NA 值..

when i try to apply stemCompletion to a corpus , this function generates NA values..

这是我的代码:

my.corpus <- tm_map(my.corpus, removePunctuation) 
my.corpus <- tm_map(my.corpus, removeWords, stopwords("english")) 

(这样做的结果之一是:[[2584]]分区计划)

(one result of this is: [[2584]] zoning plan )

下一步是阻塞语料库,所以:

the next step is stamming corpus and so:

my.corpus <- tm_map(my.corpus, stemDocument, language="english")
my.corpus <- tm_map(my.corpus, stemCompletion, dictionary=my.corpus_copy, type="first")

但结果是这样

[[2584]]北美工厂

[[2584]] NA plant

下一步应该是创建一个包含事务的关联矩阵,然后是先验规则,但是如果我继续尝试获取规则,inspect(rules) 函数会给我这个错误:

the next step should be the creation of an incidence matrix with transactions and then apriori rules but if i go on and try to get rules, the inspect(rules) function gives me this error:

> inspect(rules)
Errore in UseMethod("inspect", x) : 
no applicable method for 'inspect' applied to an object of class "c('rules','associations')"

有什么问题吗?我想 NA 值不能正确生成关联矩阵,然后是好的规则..这是问题吗?如果是这样,我该如何解决?

what's the problem? i suppose that NA values don't generate correctly the incidence matrix and then good rules.. is this the problem? if so how i can solve it?

这是问题的摘要:

this is an abstract:

my.words = c("β cell","zoning policy regional index brazil","zoning plan","zolpidem  adult","zizyphus spinosa hu")
my.corpus = Corpus(VectorSource(my.words))
my.corpus_copy = my.corpus
my.corpus = tm_map(my.corpus, removePunctuation)
my.corpus = tm_map(my.corpus, removeWords, c("the", stopwords("english"))) 
my.corpus = tm_map(my.corpus, stemDocument, language="english")
my.corpus <- tm_map(my.corpus, stemCompletion, dictionary=my.corpus_copy, type="first")
inspect(my.corpus)

推荐答案

stemCompletion() 此时如果将原始语料用作字典,只是对词干提取过程的近似逆转 参数.使用 grep(),它会在 字典 中搜索包含 当前词干的所有单词,然后根据'类型'.

stemCompletion() at this moment is only an approximate reversal of stemming process if original corpus is used as a dictionary parameter. Using grep() it searches in the dictionary all the words, which contain current stemmed word and then uses one of these for completion based upon the ‘type’.

因此,在词干过程返回的词不是未词干词的子串的情况下,它会失败.例如,'c('delivery', 'zoning') 的词干是由 wordStem() 返回的 c('deliveri', 'zone') 用于 stemDocument()强>.然而,在这两种情况下,词干词都不是非词干词的正确子串.因此,stemCompletion() 将找不到任何替换并返回 NA.

Thus it fails in cases where stemming process returned words which are not substrings of the un-stemmed words. For example, stems of ‘c('delivery’, 'zoning') are c('deliveri', 'zone') as returned by wordStem() used in stemDocument(). However, in both of these cases, stemmed words are not proper substrings of the un-stemmed words. Therefore, stemCompletion() would not find any replacement and would return NA.

有许多替代方法可以解决这个问题,包括在从 stemCompletion() 返回后用词干替换 NA,或者更好地修改 stemCompletion() 函数本身.修改它以便保留词干词而不是 NA 的一种简单方法是拥有您自己的版本stemCompletion_modified():(用 stemCompletion() 中的原始代码替换 ...) tm 包中的函数)

There are many alternatives to overcome this problem including replacing NAs with stemmed-words after returning from stemCompletion() or better modifying the stemCompletion() function itself. A simple way to modify it so that instead of NA it retains the stemmed-word is to have your own version of it stemCompletion_modified(): (replace ... with original code from stemCompletion() function in tm package)

stemCompletion_modified <- function (x, dictionary, type = ...) 
{
  ...
  #possibleCompletions <- lapply(x, function(w) grep(sprintf("^%s", w), dictionary, value = TRUE))
  possibleCompletions <- lapply(x, function(w) ifelse(identical(grep(sprintf("^%s", w), dictionary, value = TRUE),character(0)),w,grep(sprintf("^%s", w), dictionary, value = TRUE)))
  ...
} 

这篇关于R tm stemCompletion 生成 NA 值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆