使用 tm-package 进行文本挖掘 - 词干提取 [英] Text-mining with the tm-package - word stemming
问题描述
我正在使用 tm
包在 R 中进行一些文本挖掘.一切都非常顺利.但是,词干提取后出现一个问题(http://en.wikipedia.org/wiki/Stemming).显然,有些词具有相同的词干,但重要的是不要将它们放在一起"(因为这些词的含义不同).
I am doing some text mining in R with the tm
-package. Everything works very smooth. However, one problem occurs after stemming (http://en.wikipedia.org/wiki/Stemming). Obviously, there are some words, which have the same stem, but it is important that they are not "thrown together" (as those words mean different things).
有关示例,请参见下面的 4 个文本.在这里,您不能将讲师"或讲座"(关联"和关联")互换使用.然而,这是在第 4 步中完成的.
For an example see the 4 texts below. Here you cannnot use "lecturer" or "lecture" ("association" and "associate") interchangeable. However, this is what is done in step 4.
是否有任何优雅的解决方案如何手动为某些案例/单词实现这一点(例如,讲师"和讲座"被保留为两个不同的东西)?
Is there any elegant solution how to implement this for some cases/words manually (e.g. that "lecturer" and "lecture" are kept as two different things)?
texts <- c("i am member of the XYZ association",
"apply for our open associate position",
"xyz memorial lecture takes place on wednesday",
"vote for the most popular lecturer")
# Step 1: Create corpus
corpus <- Corpus(DataframeSource(data.frame(texts)))
# Step 2: Keep a copy of corpus to use later as a dictionary for stem completion
corpus.copy <- corpus
# Step 3: Stem words in the corpus
corpus.temp <- tm_map(corpus, stemDocument, language = "english")
inspect(corpus.temp)
# Step 4: Complete the stems to their original form
corpus.final <- tm_map(corpus.temp, stemCompletion, dictionary = corpus.copy)
inspect(corpus.final)
推荐答案
我不是 100% 你所追求的,也不完全了解 tm_map
的工作原理.如果我理解,那么以下工作.据我了解,您想提供不应被词干的单词列表.我使用 qdap 包主要是因为我很懒,而且它有一个我喜欢的函数 mgsub
.
I'm not 100% what you're after and don't totally get how tm_map
works. If I understand then the following works. As I understand you want to supply a list of words that should not be stemmed. I'm using the qdap package mostly because I'm lazy and it has a function mgsub
I like.
请注意,我对使用 mgsub
和 tm_map
感到沮丧,因为它不断抛出错误,所以我只使用了 lapply
.
Note that I got frustrated with using mgsub
and tm_map
as it kept throwing an error so I just used lapply
instead.
texts <- c("i am member of the XYZ association",
"apply for our open associate position",
"xyz memorial lecture takes place on wednesday",
"vote for the most popular lecturer")
library(tm)
# Step 1: Create corpus
corpus.copy <- corpus <- Corpus(DataframeSource(data.frame(texts)))
library(qdap)
# Step 2: list to retain and indentifier keys
retain <- c("lecturer", "lecture")
replace <- paste(seq_len(length(retain)), "SPECIAL_WORD", sep="_")
# Step 3: sub the words you want to retain with identifier keys
corpus[seq_len(length(corpus))] <- lapply(corpus, mgsub, pattern=retain, replacement=replace)
# Step 4: Stem it
corpus.temp <- tm_map(corpus, stemDocument, language = "english")
# Step 5: reverse -> sub the identifier keys with the words you want to retain
corpus.temp[seq_len(length(corpus.temp))] <- lapply(corpus.temp, mgsub, pattern=replace, replacement=retain)
inspect(corpus) #inspect the pieces for the folks playing along at home
inspect(corpus.copy)
inspect(corpus.temp)
# Step 6: complete the stem
corpus.final <- tm_map(corpus.temp, stemCompletion, dictionary = corpus.copy)
inspect(corpus.final)
基本上它的工作原理是:
- 为所提供的NO STEM"词(
mgsub
)替换一个唯一的标识符键 - 然后你干(使用
stemDocument
) - 接下来你反转它并用NO STEM"字样(
mgsub
)子标识符键 - 最后完成词干 (
stemCompletion
)
- subbing out a unique identifier key for the supplied "NO STEM" words (the
mgsub
) - then you stem (using
stemDocument
) - next you reverse it and sub the identifier keys with the "NO STEM" words (the
mgsub
) - last complete the Stem (
stemCompletion
)
输出如下:
## > inspect(corpus.final)
## A corpus with 4 text documents
##
## The metadata consists of 2 tag-value pairs and a data frame
## Available tags are:
## create_date creator
## Available variables in the data frame are:
## MetaID
##
## $`1`
## i am member of the XYZ associate
##
## $`2`
## for our open associate position
##
## $`3`
## xyz memorial lecture takes place on wednesday
##
## $`4`
## vote for the most popular lecturer
这篇关于使用 tm-package 进行文本挖掘 - 词干提取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!