R中的词根代替词根 [英] Base word stemming instead of root word stemming in R

查看:89
本文介绍了R中的词根代替词根的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在R中使用NLP进行词根提取时,是否有任何方法可以获取基本词而不是词根?

代码:

> #Loading libraries
> library(tm)
> library(slam)
> 
> #Vector
> Vec=c("happyness happies happys","sky skies")
> 
> #Creating Corpus
> Txt=Corpus(VectorSource(Vec))
> 
> #Stemming
> Txt=tm_map(Txt, stemDocument)
> 
> #Checking result
> inspect(Txt)
A corpus with 2 text documents

The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
  create_date creator 
Available variables in the data frame are:
  MetaID 

[[1]]
happi happi happi

[[2]]
sky sky

> 

我可以使用R获得幸福快乐幸福"的基本单词"happy"(基本单词)而不是"happi"(根单词).

解决方案

您可能正在寻找一个词干提取器. 以下是来自 CRAN任务视图:自然语言处理的一些词干:

  • RWeka 是与Weka的接口是用于以Java编写的数据挖掘任务的机器学习算法的集合.在自然语言处理的上下文中特别有用的是它的令牌化和词干提取功能.

  • Snowball 提供了Snowball提取器,其中包含Porter词干分析器和其他几种用于不同语言的词干分析器.有关详细信息,请参见Snowball页面.

  • Rstem 是Porter词干算法的C版本.

Is there any way to get base word instead of root word in stemming using NLP in R?

Code:

> #Loading libraries
> library(tm)
> library(slam)
> 
> #Vector
> Vec=c("happyness happies happys","sky skies")
> 
> #Creating Corpus
> Txt=Corpus(VectorSource(Vec))
> 
> #Stemming
> Txt=tm_map(Txt, stemDocument)
> 
> #Checking result
> inspect(Txt)
A corpus with 2 text documents

The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
  create_date creator 
Available variables in the data frame are:
  MetaID 

[[1]]
happi happi happi

[[2]]
sky sky

> 

Can I get base word "happy" (base word) instead of "happi" (root word) for "happyness happies happys" using R.

解决方案

You're probably looking for a stemmer. Here are some stemmers from CRAN Task View: Natural Language Processing:

  • RWeka is a interface to Weka which is a collection of machine learning algorithms for data mining tasks written in Java. Especially useful in the context of natural language processing is its functionality for tokenization and stemming.

  • Snowball provides the Snowball stemmers which contain the Porter stemmer and several other stemmers for different languages. See the Snowball webpage for details.

  • Rstem is an alternative interface to a C version of Porter's word stemming algorithm.

这篇关于R中的词根代替词根的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆