使用 tm-package 进行文本挖掘 - 词干提取 [英] Text-mining with the tm-package - word stemming

查看:37
本文介绍了使用 tm-package 进行文本挖掘 - 词干提取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 tm 包在 R 中进行一些文本挖掘.一切都非常顺利.但是,词干提取后出现一个问题(http://en.wikipedia.org/wiki/Stemming).显然,有些词具有相同的词干,但重要的是不要将它们放在一起"(因为这些词的含义不同).

I am doing some text mining in R with the tm-package. Everything works very smooth. However, one problem occurs after stemming (http://en.wikipedia.org/wiki/Stemming). Obviously, there are some words, which have the same stem, but it is important that they are not "thrown together" (as those words mean different things).

有关示例,请参见下面的 4 个文本.在这里,您不能将讲师"或讲座"(关联"和关联")互换使用.然而,这是在第 4 步中完成的.

For an example see the 4 texts below. Here you cannnot use "lecturer" or "lecture" ("association" and "associate") interchangeable. However, this is what is done in step 4.

是否有任何优雅的解决方案如何手动为某些案例/单词实现这一点(例如,讲师"和讲座"被保留为两个不同的东西)?

Is there any elegant solution how to implement this for some cases/words manually (e.g. that "lecturer" and "lecture" are kept as two different things)?

texts <- c("i am member of the XYZ association",
"apply for our open associate position", 
"xyz memorial lecture takes place on wednesday", 
"vote for the most popular lecturer")

# Step 1: Create corpus
corpus <- Corpus(DataframeSource(data.frame(texts)))

# Step 2: Keep a copy of corpus to use later as a dictionary for stem completion
corpus.copy <- corpus

# Step 3: Stem words in the corpus
corpus.temp <- tm_map(corpus, stemDocument, language = "english")  

inspect(corpus.temp)

# Step 4: Complete the stems to their original form
corpus.final <- tm_map(corpus.temp, stemCompletion, dictionary = corpus.copy)  

inspect(corpus.final)

推荐答案

我不是 100% 你所追求的,也不完全了解 tm_map 的工作原理.如果我理解,那么以下工作.据我了解,您想提供不应被词干的单词列表.我使用 qdap 包主要是因为我很懒,而且它有一个我喜欢的函数 mgsub.

I'm not 100% what you're after and don't totally get how tm_map works. If I understand then the following works. As I understand you want to supply a list of words that should not be stemmed. I'm using the qdap package mostly because I'm lazy and it has a function mgsub I like.

请注意,我对使用 mgsubtm_map 感到沮丧,因为它不断抛出错误,所以我只使用了 lapply.

Note that I got frustrated with using mgsub and tm_map as it kept throwing an error so I just used lapply instead.

texts <- c("i am member of the XYZ association",
    "apply for our open associate position", 
    "xyz memorial lecture takes place on wednesday", 
    "vote for the most popular lecturer")

library(tm)
# Step 1: Create corpus
corpus.copy <- corpus <- Corpus(DataframeSource(data.frame(texts)))

library(qdap)
# Step 2: list to retain and indentifier keys
retain <- c("lecturer", "lecture")
replace <- paste(seq_len(length(retain)), "SPECIAL_WORD", sep="_")

# Step 3: sub the words you want to retain with identifier keys
corpus[seq_len(length(corpus))] <- lapply(corpus, mgsub, pattern=retain, replacement=replace)

# Step 4: Stem it
corpus.temp <- tm_map(corpus, stemDocument, language = "english")  

# Step 5: reverse -> sub the identifier keys with the words you want to retain
corpus.temp[seq_len(length(corpus.temp))] <- lapply(corpus.temp, mgsub, pattern=replace, replacement=retain)

inspect(corpus)       #inspect the pieces for the folks playing along at home
inspect(corpus.copy)
inspect(corpus.temp)

# Step 6: complete the stem
corpus.final <- tm_map(corpus.temp, stemCompletion, dictionary = corpus.copy)  
inspect(corpus.final)

基本上它的工作原理是:

  1. 为所提供的NO STEM"词(mgsub)替换一个唯一的标识符键
  2. 然后你干(使用stemDocument)
  3. 接下来你反转它并用NO STEM"字样(mgsub)子标识符键
  4. 最后完成词干 (stemCompletion)
  1. subbing out a unique identifier key for the supplied "NO STEM" words (the mgsub)
  2. then you stem (using stemDocument)
  3. next you reverse it and sub the identifier keys with the "NO STEM" words (the mgsub)
  4. last complete the Stem (stemCompletion)

输出如下:

## >     inspect(corpus.final)
## A corpus with 4 text documents
## 
## The metadata consists of 2 tag-value pairs and a data frame
## Available tags are:
##   create_date creator 
## Available variables in the data frame are:
##   MetaID 
## 
## $`1`
## i am member of the XYZ associate
## 
## $`2`
##  for our open associate position
## 
## $`3`
## xyz memorial lecture takes place on wednesday
## 
## $`4`
## vote for the most popular lecturer

这篇关于使用 tm-package 进行文本挖掘 - 词干提取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆