如何在R中执行放缩? [英] How to perform Lemmatization in R?

查看:74
本文介绍了如何在R中执行放缩?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此问题可能与 在R或python中(我是->是吗?) ,但是由于上一个已经关闭,它说得太宽泛了,所以我又添加了它,唯一的答案不是高效(因为它为此访问外部网站,所以速度太慢,因为我的语料库非常庞大,无法找到引理).因此,该问题的一部分与上述问题类似.

This question is a possible duplicate of Lemmatizer in R or python (am, are, is -> be?), but I'm adding it again since the previous one was closed saying it was too broad and the only answer it has is not efficient (as it accesses an external website for this, which is too slow as I have very large corpus to find the lemmas for). So a part of this question will be similar to the above mentioned question.

根据Wikipedia,词形化定义为:

According to Wikipedia, lemmatization is defined as:

语言学中的词法化(或词法化)是将单词的不同变体形式组合在一起的过程,以便可以将它们作为单个项目进行分析.

Lemmatisation (or lemmatization) in linguistics, is the process of grouping together the different inflected forms of a word so they can be analysed as a single item.

一个简单的Google搜索R中的词形化将仅 指向R的包wordnet.当我尝试此程序包时,期望输入词法向量c("run", "ran", "running")会导致词形化功能在c("run", "run", "run")中,我看到此软件包通过各种过滤器名称和字典仅提供了与grepl函数相似的功能.

A simple Google search for lemmatization in R will only point to the package wordnet of R. When I tried this package expecting that a character vector c("run", "ran", "running") input to the lemmatization function would result in c("run", "run", "run"), I saw that this package only provides functionality similar to grepl function through various filter names and a dictionary.

wordnet包中的示例代码,由于过滤器名称本身说明了该问题,因此最多提供5个以"car"开头的单词:

An example code from wordnet package, which gives maximum of 5 words starting with "car", as the filter name explains itself:

filter <- getTermFilter("StartsWithFilter", "car", TRUE)
terms <- getIndexTerms("NOUN", 5, filter)
sapply(terms, getLemma)

以上是我正在寻找的词形化.我要寻找的是使用R来查找单词的真实词根:(例如,从c("run", "ran", "running")c("run", "run", "run")).

The above is NOT the lemmatization that I'm looking for. What I'm looking for is, using R I want to find true roots of the words: (For e.g. from c("run", "ran", "running") to c("run", "run", "run")).

推荐答案

您好,您可以尝试打包 koRpus 允许使用 Treetagger :

Hello you can try package koRpus which allow to use Treetagger :

tagged.results <- treetag(c("run", "ran", "running"), treetagger="manual", format="obj",
                      TT.tknz=FALSE , lang="en",
                      TT.options=list(path="./TreeTagger", preset="en"))
tagged.results@TT.res

##     token tag lemma lttr wclass                               desc stop stem
## 1     run  NN   run    3   noun             Noun, singular or mass   NA   NA
## 2     ran VVD   run    3   verb                   Verb, past tense   NA   NA
## 3 running VVG   run    7   verb Verb, gerund or present participle   NA   NA

有关所需结果,请参见lemma列.

See the lemma column for the result you're asking for.

这篇关于如何在R中执行放缩?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆