Snowball Stemmer 只词干最后一个词 [英] Snowball Stemmer only stems last word

查看:26
本文介绍了Snowball Stemmer 只词干最后一个词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用 R 中的 tm 包对纯文本文档语料库中的文档进行词干.当我将 SnowballStemmer 函数应用于语料库的所有文档时,仅对每个文档的最后一个词进行词干.

I want to stem the documents in a Corpus of plain text documents using the tm package in R. When I apply the SnowballStemmer function to all documents of the corpus, only the last word of each document is stemmed.

library(tm)
library(Snowball)
library(RWeka)
library(rJava)
path <- c("C:/path/to/diretory")
corp <- Corpus(DirSource(path),
               readerControl = list(reader = readPlain, language = "en_US",
                                    load = TRUE))
tm_map(corp,SnowballStemmer) #stemDocument has the same problem

我认为这与将文档读入语料库的方式有关.用一些简单的例子来说明这一点:

I think it is related to the way the documents are read into the corpus. To illustrate this with some simple examples:

> vec<-c("running runner runs","happyness happies")
> stemDocument(vec) 
   [1] "running runner run" "happyness happi" 

> vec2<-c("running","runner","runs","happyness","happies")
> stemDocument(vec2)
   [1] "run"    "runner" "run"    "happy"  "happi" <- 

> corp<-Corpus(VectorSource(vec))
> corp<-tm_map(corp, stemDocument)
> inspect(corp)
   A corpus with 2 text documents

   The metadata consists of 2 tag-value pairs and a data frame
   Available tags are:
     create_date creator 
   Available variables in the data frame are:
     MetaID 

   [[1]]
   run runner run

   [[2]]
   happy happi

> corp2<-Corpus(DirSource(path),readerControl=list(reader=readPlain,language="en_US" ,  load=T))
> corp2<-tm_map(corp2, stemDocument)
> inspect(corp2)
   A corpus with 2 text documents

   The metadata consists of 2 tag-value pairs and a data frame
     Available tags are:
     create_date creator 
   Available variables in the data frame are:
     MetaID 

   $`1.txt`
   running runner runs

   $`2.txt`
   happyness happies

推荐答案

加载所需的库

library(tm)
library(Snowball)

创建向量

vec<-c("running runner runs","happyness happies")

从向量创建语料库

vec<-Corpus(VectorSource(vec))

非常重要的是检查我们语料库的类并保存它,因为我们想要一个 R 函数理解的标准语料库

very important thing is to check class of our corpus and preserve it as we want a standard corpus that R functions understand

class(vec[[1]])

vec[[1]]
<<PlainTextDocument (metadata: 7)>>
running runner runs

这可能会告诉你纯文本文档

this will probably tell you Plain text document

所以现在我们修改错误的stemDocument 函数.首先,我们将纯文本转换为字符,然后拆分文本,应用现在工作正常的 stemDocument 并将其粘贴回一起.最重要的是,我们将输出重新转换为 tm 包给出的 PlainTextDocument.

So now we modify our faulty stemDocument function. first we convert our plain text to character and then we split out text, apply stemDocument which works fine now and paste it back together. most importantly we reconvert output to PlainTextDocument given by tm package.

stemDocumentfix <- function(x)
{
    PlainTextDocument(paste(stemDocument(unlist(strsplit(as.character(x), " "))),collapse=' '))
}

现在我们可以在我们的语料库上使用标准的 tm_map

now we can use standard tm_map on our corpus

vec1 = tm_map(vec, stemDocumentfix)

结果是

vec1[[1]]
<<PlainTextDocument (metadata: 7)>>
run runner run

您需要记住的最重要的事情是始终保留语料库中的文档类别.我希望这是使用加载的 2 个库中的函数来解决您的问题的简化解决方案.

most important thing you need remember is to presever class of documents in corpus always. i hope this is a simplified solution to your problem using function from within the 2 libraries loaded.

这篇关于Snowball Stemmer 只词干最后一个词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆