茎完成不起作用 [英] stemCompletion is not working

查看:28
本文介绍了茎完成不起作用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用tm包对修复数据进行文本分析,将数据读入数据框,转换为语料库对象,使用lower、stipWhitespace、removestopwords等应用各种方法清理数据.

I am using tm package for text analysis of repair data, Reading data into data frame, converting to Corpus object, applied various methods to clean data using lower, stipWhitespace, removestopwords and so on.

为词干完成取回 Corpus 对象.

Taken back of Corpus object for stemCompletion.

使用tm_map函数执行stemDocument,我的对象词被词干了

Performed stemDocument using tm_map function, my object words got stemmed

达到了预期的结果.

当我使用 tm_map 函数运行 stemCompletion 操作时,它不起作用并得到以下错误

When I am running stemCompletion operation using tm_map function, it is not working and got below error

UseMethod("words") 中的错误:'words' 没有适用的方法应用于类字符"的对象

Error in UseMethod("words") : no applicable method for 'words' applied to an object of class "character"

执行 trackback() 以显示并得到如下步骤

Executed trackback() to show and got steps as below

> traceback()
9: FUN(X[[1L]], ...)
8: lapply(dictionary, words)
7: unlist(lapply(dictionary, words))
6: unique(unlist(lapply(dictionary, words)))
5: FUN(X[[1L]], ...)
4: lapply(X, FUN, ...)
3: mclapply(content(x), FUN, ...)
2: tm_map.VCorpus(c, stemCompletion, dictionary = c_orig)
1: tm_map(c, stemCompletion, dictionary = c_orig)

我该如何解决这个错误?

How can I resolve this error?

推荐答案

我在使用 tm v0.6 时遇到了同样的错误.我怀疑发生这种情况是因为 stemCompletion 不在此版本的 tm 包的默认转换中:

I received the same error when using tm v0.6. I suspect this occurs because stemCompletion is not in the default transformations for this version of the tm package:

>  getTransformations
function () 
c("removeNumbers", "removePunctuation", "removeWords", "stemDocument", 
    "stripWhitespace")
<environment: namespace:tm>

现在,tolower 函数也有同样的问题,但可以通过使用 content_transformer 函数使其运行.我为 stemCompletion 尝试了类似的方法,但没有成功.

Now, the tolower function has the same problem, but can be made operational by using the content_transformer function. I tried a similar approach for stemCompletion but was not successful.

注意,即使 stemCompletion 不是默认转换,当手动输入词干时它仍然有效:

Note, even though stemCompletion isn't a default transformation, it still works when manually fed stemmed words:

> stemCompletion("compani",dictCorpus)
    compani 
"companies" 

为了继续我的工作,我手动用单个空格分隔语料库中的每个文档,通过 stemCompletion 输入它们,然后将它们与以下内容连接在一起(笨拙而不优雅!) 函数:

So that I could continue with my work, I manually delimited each document in a corpus by single spaces, feed them through stemCompletion, and concatenated them back together with the following (clunky and not graceful!) function:

stemCompletion_mod <- function(x,dict=dictCorpus) {
  PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")),dictionary=dict, type="shortest"),sep="", collapse=" ")))
}

其中 dictCorpus 只是清理过的语料库的副本,但在它被词干之前.额外的 stripWhitespace 特定于我的语料库,但对于一般语料库来说可能是良性的.您可能希望根据需要将 type 选项从最短"更改.

where dictCorpus is just a copy of the cleaned corpus, but before it's stemmed. The extra stripWhitespace is specific for my corpus, but is likely benign for a general corpus. You may want to change the type option from "shortest" as needed.

对于一个完整的例子,让我们使用 tm 包中的 crude 数据设置一个虚拟语料库:

For a full example, let's setup a dummy corpus using the crude data in the tm package:

> data("crude")
> docs = Corpus(VectorSource(crude))
> docs <- tm_map(docs, content_transformer(tolower))
> docs <- tm_map(docs, removeNumbers)
> docs <- tm_map(docs, removeWords, stopwords("english"))
> docs <- tm_map(docs, removePunctuation)
> docs <- tm_map(docs, stripWhitespace)
> docs <- tm_map(docs, PlainTextDocument)
> dictCorpus <- docs
> docs <- tm_map(docs, stemDocument)

> # Define modified stemCompletion function
> stemCompletion_mod <- function(x,dict=dictCorpus) {
  PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")),dictionary=dict, type="shortest"),sep="", collapse=" ")))
}

> # Original doc in crude data
> crude[[1]]
<<PlainTextDocument (metadata: 15)>>
Diamond Shamrock Corp said that
effective today it had cut its contract prices for crude oil by
1.50 dlrs a barrel.
    The reduction brings its posted price for West Texas
Intermediate to 16.00 dlrs a barrel, the copany said.
    "The price reduction today was made in the light of falling
oil product prices and a weak crude oil market," a company
spokeswoman said.
    Diamond is the latest in a line of U.S. oil companies that
have cut its contract, or posted, prices over the last two days
citing weak oil markets.
 Reuter

> # Stemmed example in crude data
> docs[[1]]
<<PlainTextDocument (metadata: 7)>>
diamond shamrock corp said effect today cut contract price crude oil dlrs barrel 
reduct bring post price west texa intermedi dlrs barrel copani said price reduct today 
made light fall oil product price weak crude oil market compani spokeswoman said diamond 
latest line us oil compani cut contract post price last two day cite weak oil market reuter

> # Stem comlpeted example in crude data
> stemCompletion_mod(docs[[1]],dictCorpus)
<<PlainTextDocument (metadata: 7)>>
diamond shamrock corp said effect today cut contract price crude oil dlrs barrel 
reduction brings posted price west texas intermediate dlrs barrel NA said price reduction today 
made light fall oil product price weak crude oil market companies spokeswoman said diamond 
latest line us oil companies cut contract posted price last two day cited weak oil market reuter

注意:这个例子很奇怪,因为拼错的单词copany"在这个过程中被映射:->copani"->NA".不知道如何纠正这个...

为了在整个语料库中运行 stemCompletion_mod,我只使用 sapply(或 parSapply 和雪包).

To run the stemCompletion_mod through the entire corpus, I just use sapply (or parSapply with snow package).

也许比我更有经验的人可以建议进行更简单的修改,以使 stemCompletion 在 tm 包的 v0.6 中工作.

Perhaps someone with more experience than me could suggest a simpler modification to get stemCompletion to work in v0.6 of the tm package.

这篇关于茎完成不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆