lemmatizzation中的R错误带wordnet的文档语料库 [英] R error in lemmatizzation a corpus of document with wordnet

查看:94
本文介绍了lemmatizzation中的R错误带wordnet的文档语料库的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用wordnet库对R中的文档语料库进行词法模糊处理.这是代码:

i'm trying to lemmatizzate a corpus of document in R with wordnet library. This is the code:

corpus.documents <- Corpus(VectorSource(vector.documents))
corpus.documents <- tm_map(corpus.documents removePunctuation)

library(wordnet)
lapply(corpus.documents,function(x){
  x.filter <- getTermFilter("ContainsFilter", x, TRUE)
  terms <- getIndexTerms("NOUN", 1, x.filter)
  sapply(terms, getLemma)
})

但是运行此命令时.我遇到此错误:

but when running this. I have this error:

Errore in .jnew(paste("com.nexagis.jawbone.filter", type, sep = "."), word,  :
java.lang.NoSuchMethodError: <init> 

那些是堆栈调用:

5 stop(structure(list(message = "java.lang.NoSuchMethodError: <init>", 
call = .jnew(paste("com.nexagis.jawbone.filter", type, sep = "."), 
    word, ignoreCase), jobj = <S4 object of class structure("jobjRef", package 
="rJava")>), .Names = c("message", 
"call", "jobj"), class = c("NoSuchMethodError", "IncompatibleClassChangeError",  ... 
4 .jnew(paste("com.nexagis.jawbone.filter", type, sep = "."), word, 
ignoreCase) 
3 getTermFilter("ContainsFilter", x, TRUE) 
2 FUN(X[[1L]], ...) 
1 lapply(corpus.documents, function(x) {
x.filter <- getTermFilter("ContainsFilter", x, TRUE)
terms <- getIndexTerms("NOUN", 1, x.filter)
sapply(terms, getLemma) ... 

怎么了?

推荐答案

因此,这不能解决您对wordnet的使用,但确实提供了一种可能适用于您的词条化的选项(更好的是,IMO ... ).这使用了西北大学开发的MorphAdorner API.您可以在此处找到详细的文档.在下面的代码中,我使用了他们的纯文本API的代理.

So this does not address your use of wordnet, but does provide an option for lemmatizing that might work for you (and is better, IMO...). This uses the MorphAdorner API developed at Northwestern University. You can find detailed documentation here. In the code below I'm using their Adorner for Plain Text API.

# MorphAdorner (Northwestern University) web service
adorn <- function(text) {
  require(httr)
  require(XML)
  url <- "http://devadorner.northwestern.edu/maserver/partofspeechtagger"
  response <- GET(url,query=list(text=text, media="xml", 
                                 xmlOutputType="outputPlainXML",
                                 corpusConfig="ncf", # Nineteenth Century Fiction
                                 includeInputText="false", outputReg="true"))
  doc <- content(response,type="text/xml")
  words <- doc["//adornedWord"]
  xmlToDataFrame(doc,nodes=words)
}

library(tm)
vector.documents <- c("Here is some text.", 
                      "This might possibly be some additional text, but then again, maybe not...",
                      "This is an abstruse grammatical construction having as it's sole intention the demonstration of MorhAdorner's capability.")
corpus.documents <- Corpus(VectorSource(vector.documents))
lapply(corpus.documents,function(x) adorn(as.character(x)))
# [[1]]
#   token spelling standardSpelling lemmata partsOfSpeech
# 1  Here     Here             Here    here            av
# 2    is       is               is      be           vbz
# 3  some     some             some    some             d
# 4  text     text             text    text            n1
# 5     .        .                .       .             .
# ...

我只是显示第一个文档"的词形化. partsOfSpeech遵循NUPOS约定.

I'm just showing the lemmatization of the first "document". partsOfSpeech follows the NUPOS convention.

这篇关于lemmatizzation中的R错误带wordnet的文档语料库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆