lemmatizzation中的R错误带wordnet的文档语料库 [英] R error in lemmatizzation a corpus of document with wordnet
问题描述
我正在尝试使用wordnet库对R中的文档语料库进行词法模糊处理.这是代码:
i'm trying to lemmatizzate a corpus of document in R with wordnet library. This is the code:
corpus.documents <- Corpus(VectorSource(vector.documents))
corpus.documents <- tm_map(corpus.documents removePunctuation)
library(wordnet)
lapply(corpus.documents,function(x){
x.filter <- getTermFilter("ContainsFilter", x, TRUE)
terms <- getIndexTerms("NOUN", 1, x.filter)
sapply(terms, getLemma)
})
但是运行此命令时.我遇到此错误:
but when running this. I have this error:
Errore in .jnew(paste("com.nexagis.jawbone.filter", type, sep = "."), word, :
java.lang.NoSuchMethodError: <init>
那些是堆栈调用:
5 stop(structure(list(message = "java.lang.NoSuchMethodError: <init>",
call = .jnew(paste("com.nexagis.jawbone.filter", type, sep = "."),
word, ignoreCase), jobj = <S4 object of class structure("jobjRef", package
="rJava")>), .Names = c("message",
"call", "jobj"), class = c("NoSuchMethodError", "IncompatibleClassChangeError", ...
4 .jnew(paste("com.nexagis.jawbone.filter", type, sep = "."), word,
ignoreCase)
3 getTermFilter("ContainsFilter", x, TRUE)
2 FUN(X[[1L]], ...)
1 lapply(corpus.documents, function(x) {
x.filter <- getTermFilter("ContainsFilter", x, TRUE)
terms <- getIndexTerms("NOUN", 1, x.filter)
sapply(terms, getLemma) ...
怎么了?
推荐答案
因此,这不能解决您对wordnet
的使用,但确实提供了一种可能适用于您的词条化的选项(更好的是,IMO ... ).这使用了西北大学开发的MorphAdorner API.您可以在此处找到详细的文档.在下面的代码中,我使用了他们的纯文本API的代理.
So this does not address your use of wordnet
, but does provide an option for lemmatizing that might work for you (and is better, IMO...). This uses the MorphAdorner API developed at Northwestern University. You can find detailed documentation here. In the code below I'm using their Adorner for Plain Text API.
# MorphAdorner (Northwestern University) web service
adorn <- function(text) {
require(httr)
require(XML)
url <- "http://devadorner.northwestern.edu/maserver/partofspeechtagger"
response <- GET(url,query=list(text=text, media="xml",
xmlOutputType="outputPlainXML",
corpusConfig="ncf", # Nineteenth Century Fiction
includeInputText="false", outputReg="true"))
doc <- content(response,type="text/xml")
words <- doc["//adornedWord"]
xmlToDataFrame(doc,nodes=words)
}
library(tm)
vector.documents <- c("Here is some text.",
"This might possibly be some additional text, but then again, maybe not...",
"This is an abstruse grammatical construction having as it's sole intention the demonstration of MorhAdorner's capability.")
corpus.documents <- Corpus(VectorSource(vector.documents))
lapply(corpus.documents,function(x) adorn(as.character(x)))
# [[1]]
# token spelling standardSpelling lemmata partsOfSpeech
# 1 Here Here Here here av
# 2 is is is be vbz
# 3 some some some some d
# 4 text text text text n1
# 5 . . . . .
# ...
我只是显示第一个文档"的词形化. partsOfSpeech
遵循NUPOS约定.
I'm just showing the lemmatization of the first "document". partsOfSpeech
follows the NUPOS convention.
这篇关于lemmatizzation中的R错误带wordnet的文档语料库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!