从 R 中用户定义的语料库中删除停用词 [英] Removing stopwords from a user-defined corpus in R

查看:50
本文介绍了从 R 中用户定义的语料库中删除停用词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组文件:

documents = c("She had toast for breakfast",
 "The coffee this morning was excellent", 
 "For lunch let's all have pancakes", 
 "Later in the day, there will be more talks", 
 "The talks on the first day were great", 
 "The second day should have good presentations too")

在这组文档中,我想删除停用词.我已经删除了标点符号并转换为小写,使用:

In this set of documents, I would like to remove the stopwords. I have already removed punctuation and converted to lower case, using:

documents = tolower(documents) #make it lower case
documents = gsub('[[:punct:]]', '', documents) #remove punctuation

首先我转换为 Corpus 对象:

First I convert to a Corpus object:

documents <- Corpus(VectorSource(documents))

然后我尝试删除停用词:

Then I try to remove the stopwords:

documents = tm_map(documents, removeWords, stopwords('english')) #remove stopwords

但这最后一行导致以下错误:

But this last line results in the following error:

THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC() 进行调试.

这已经被问过了这里 但没有给出答案.这个错误是什么意思?

This has already been asked here but an answer was not given. What does this error mean?

编辑

是的,我正在使用 tm 包.

Yes, I am using the tm package.

这是 sessionInfo() 的输出:

Here is the output of sessionInfo():

R 版本 3.0.2 (2013-09-25)平台:x86_64-apple-darwin10.8.0(64位)

R version 3.0.2 (2013-09-25) Platform: x86_64-apple-darwin10.8.0 (64-bit)

推荐答案

当我遇到 tm 问题时,我经常最终只是编辑原始文本.

When I run into tm problems I often end up just editing the original text.

删除单词有点尴尬,但您可以将 tm 的停用词列表中的正则表达式粘贴在一起.

For removing words it's a little awkward, but you can paste together a regex from tm's stopword list.

stopwords_regex = paste(stopwords('en'), collapse = '\\b|\\b')
stopwords_regex = paste0('\\b', stopwords_regex, '\\b')
documents = stringr::str_replace_all(documents, stopwords_regex, '')

> documents
[1] "     toast  breakfast"             " coffee  morning  excellent"      
[3] " lunch lets   pancakes"            "later   day  will   talks"        
[5] " talks   first day  great"         " second day   good presentations "

这篇关于从 R 中用户定义的语料库中删除停用词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆