使用 R 文本分析进行词干分析 [英] Stemming with R Text Analysis

查看:30
本文介绍了使用 R 文本分析进行词干分析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 TM 包进行大量分析.我最大的问题之一与词干和类似词干的转换有关.

I am doing a lot of analysis with the TM package. One of my biggest problems are related to stemming and stemming-like transformations.

假设我有几个与会计相关的术语(我知道拼写问题).
提取词干后,我们有:

Let's say I have several accounting related terms (I am aware of the spelling issues).
After stemming we have:

accounts   -> account  
account    -> account  
accounting -> account  
acounting  -> acount  
acount     -> acount  
acounts    -> acount  
accounnt   -> accounnt  

结果:3 个字词(帐户、帐户、帐户),其中我希望有 1 个(帐户),因为所有这些都与同一个字词相关.

Result: 3 Terms (account, acount, account) where I would have liked 1 (account) as all these relate to the same term.

1) 纠正拼写是可能的,但我从未在 R 中尝试过.这可能吗?

1) To correct spelling is a possibility, but I have never attempted that in R. Is that even possible?

2) 另一种选择是制作一个参考列表,即 account = (accounts, account,accounting,acounting,acount,acounts,account),然后用主术语替换所有出现的项.我将如何在 R 中执行此操作?

2) The other option is to make a reference list i.e. account = (accounts, account, accounting, acounting, acount, acounts, accounnt) and then replace all occurrences with the master term. How would I do this in R?

再次,任何帮助/建议将不胜感激.

Once again, any help/suggestions would be greatly appreciated.

推荐答案

我们可以设置同义词列表并替换这些值.例如

We could set up a list of synonyms and replace those values. For example

synonyms <- list(
    list(word="account", syns=c("acount", "accounnt"))
)

这表示我们想用帐户"替换帐户"和帐户"(我假设我们在提取词干后这样做).现在让我们创建测试数据.

This says we want to replace "acount" and "accounnt" with "account" (i'm assuming we're doing this after stemming). Now let's create test data.

raw<-c("accounts", "account", "accounting", "acounting", 
     "acount", "acounts", "accounnt")

现在让我们定义一个转换函数,用主要同义词替换列表中的单词.

And now let's define a transformation function that will replace the words in our list with the primary synonym.

library(tm)
replaceSynonyms <- content_transformer(function(x, syn=NULL) { 
    Reduce(function(a,b) {
        gsub(paste0("\\b(", paste(b$syns, collapse="|"),")\\b"), b$word, a)}, syn, x)   
})

这里我们使用 content_transformer 函数来定义自定义转换.基本上我们只是做一个 gsub 来替换每个单词.然后我们可以在语料库中使用它

Here we use the content_transformer function to define a custom transformation. And basically we just do a gsub to replace each of the words. We can then use this on a corpus

tm <- Corpus(VectorSource(raw))
tm <- tm_map(tm, stemDocument)
tm <- tm_map(tm, replaceSynonyms, synonyms)
inspect(tm)

我们可以看到所有这些值都根据需要转换为帐户".要添加其他同义词,只需将其他列表添加到主 synonyms 列表即可.每个子列表都应该有名称word"和syns".

and we can see all these values are transformed into "account" as desired. To add other synonyms, just add additional lists to the main synonyms list. Each sub-list should have the names "word" and "syns".

这篇关于使用 R 文本分析进行词干分析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆