删除R中的停用词 [英] delete stop words in R

查看:132
本文介绍了删除R中的停用词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个具有这种结构的数据框:

I Have a dataframe wich has this structure :

Note.Reco Review Review.clean.lower
10 Good Products  good products
9 Nice film      nice film
....         ....

第一列是电影的排名,第二列是客户的评论,第三列是小写字母的评论.

The first column is the rank of the film, then the second column is the custmer's review then the 3rd column is the review with lowercase letters.

我现在尝试删除停用词:

I try now to delete stop words with this :

Data_clean$Raison.Reco.clean1 <- Corpus(VectorSource(Data_clean$Review.clean.lower))
Data_clean$Review.clean.lower1 <- tm_map(Data_clean$Review.clean.lower1, removeWords, stopwords("english"))

但是 R studio 崩溃了

But R studio crashes

你能帮我解决这个问题吗?

Can you help me to resolve this problem please?

谢谢

#clean up
# remove grammar/punctuation
Data_clean$Review.clean.lower <- tolower(gsub('[[:punct:]0-9]', ' ', Data_clean$Review))

Data_corpus <- Corpus(VectorSource(Data_clean$Review.clean.lower))

Data_clean <- tm_map(Data_corpus,  removeWords, stopwords("french"))

train <- Data_clean[train.index, ]
test <- Data_clean[test.index, ]

所以当我运行最后两条指令时出现错误.

So I get error when I run the 2 last instructions .

推荐答案

试试下面的.您可以在语料库上进行清理,而不是直接在列上进行清理.

Try the below . You can do cleaning on the corpus and not column directly.

Data_corpus <-
  Corpus(VectorSource(Data_clean$Review.clean.lower))

  Data_clean <- tm_map(Data_corpus,  removeWords, stopwords("english"))

正如您所提到的,您希望能够在删除停用词后访问输出,请尝试以下而不是以上:

As mentioned by you, you want to be able to access the output after removing stop words, try the below instead of the above:

library(tm)

stopWords <- stopwords("en")

Data_clean$Review.clean.lower<- as.character(Data_clean$Review.clean.lower)
 '%nin%' <- Negate('%in%')
 Data_clean$Review.clean.lower1<-lapply(Data_clean$Review.clean.lower, function(x) {
  chk <- unlist(strsplit(x," "))
  p <- chk[chk %nin% stopWords]
  paste(p,collapse = " ")
})

上述代码的示例输出:

>  print(Data_clean)
>       note Note.Reco.Review Review.clean.lower Review.clean.lower1
>     1   10    Good Products      good products       good products
>     2    9        Nice film     is a nice film           nice film

另请检查以下内容:R 使用 %in% 从字符向量中删除停用词

这篇关于删除R中的停用词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆