R或python中的Lemmatizer(am，是->是吗?) [英] Lemmatizer in R or python (am, are, is -> be?)

查看：131 发布时间：2020/5/18 0:39:23 python r nlp nltk lemmatization

本文介绍了R或python中的Lemmatizer(am，是->是吗?)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我不是[计算]语言，所以请在这个话题上大吃晚饭.

I'm not a [computational] linguistic, so please excuse my supper dummy-ness in this topic.

根据Wikipedia的描述，词形化定义为:

According to Wikipedia, lemmatisation is defined as:

语言学中的词法化(或词法化)是将单词的不同变体形式组合在一起的过程，以便可以将它们作为单个项目进行分析.

Lemmatisation (or lemmatization) in linguistics, is the process of grouping together the different inflected forms of a word so they can be analysed as a single item.

现在我的问题是，集合{am，is，is}的任何成员的复词版本是否应该为"be"?如果没有，为什么不呢?

Now my question is, is the lemmatised version of any member of the set {am, is, are} supposed to be "be"? If not, why not?

第二个问题:如何在R或python中得到它?我已经尝试过类似此链接的方法，但其中没有一个给出的是" ".我想至少出于对文本文档进行分类的目的，这是有道理的.

Second question: How do I get that in R or python? I've tried methods like this link, but non of them gives "be" given "are". I guess at least for the purpose of classifying text documents, this makes sense to be true.

在此处中，任何给定的演示都无法做到这一点.

I also couldn't do that with any of the given demos here.

我在做什么/假设做错了吗?

What am I doing/assuming wrong?

推荐答案

因此，这是使用R的一种方法，它使用Northwestern University lemmatizer，

So here is a way to do it in R, using the Northwestern University lemmatizer, MorphAdorner.

lemmatize <- function(wordlist) {
  get.lemma <- function(word, url) {
    response <- GET(url,query=list(spelling=word,standardize="",
                                   wordClass="",wordClass2="",
                                   corpusConfig="ncf",    # Nineteenth Century Fiction
                                   media="xml"))
    content <- content(response,type="text")
    xml     <- xmlInternalTreeParse(content)
    return(xmlValue(xml["//lemma"][[1]]))    
  }
  require(httr)
  require(XML)
  url <- "http://devadorner.northwestern.edu/maserver/lemmatizer"
  return(sapply(wordlist,get.lemma,url=url))
}

words <- c("is","am","was","are")
lemmatize(words)
#   is   am  was  are 
# "be" "be" "be" "be"

正如我怀疑的那样，正确的词法化要求掌握词类(词性)，上下文正确的拼写，并且还取决于所使用的语料库.

As I suspect you are aware, correct lemmatization requires knowledge of the word class (part of speech), contextually correct spelling, and also depends upon which corpus is being used.

这篇关于R或python中的Lemmatizer(am，是->是吗?)的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

R或python中的Lemmatizer(am，是->是吗?) [英] Lemmatizer in R or python (am, are, is -> be?)

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

R或python中的Lemmatizer(am，是->是吗?) [英] Lemmatizer in R or python (am, are, is -&gt; be?)

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

R或python中的Lemmatizer(am，是->是吗?) [英] Lemmatizer in R or python (am, are, is -> be?)

登录关闭