R使用%in%从字符向量中删除停用词 [英] R remove stopwords from a character vector using %in%
问题描述
我有一个数据框,其中包含要从中删除停用词的字符串.我试图避免使用tm
包,因为它的数据集很大,并且tm
的运行速度似乎有点慢.我正在使用tm
stopword
字典.
I have a data frame with strings that I'd like to remove stop words from. I'm trying to avoid using the tm
package as it's a large data set and tm
seems to run a bit slowly. I am using the tm
stopword
dictionary.
library(plyr)
library(tm)
stopWords <- stopwords("en")
class(stopWords)
df1 <- data.frame(id = seq(1,5,1), string1 = NA)
head(df1)
df1$string1[1] <- "This string is a string."
df1$string1[2] <- "This string is a slightly longer string."
df1$string1[3] <- "This string is an even longer string."
df1$string1[4] <- "This string is a slightly shorter string."
df1$string1[5] <- "This string is the longest string of all the other strings."
head(df1)
df1$string1 <- tolower(df1$string1)
str1 <- strsplit(df1$string1[5], " ")
> !(str1 %in% stopWords)
[1] TRUE
这不是我要的答案.我正在尝试获取stopWords
向量中NOT的向量或单词字符串.
This is not the answer I'm looking for. I'm trying to get a vector or string of the words NOT in the stopWords
vector.
我在做什么错了?
推荐答案
您没有正确访问列表,也没有从%in%
结果中返回元素(该结果给出了TRUE/FALSE的逻辑向量) ).您应该执行以下操作:
You are not accessing the list properly and you're not getting the elements back from the result of %in%
(which gives a logical vector of TRUE/FALSE). You should do something like this:
unlist(str1)[!(unlist(str1) %in% stopWords)]
(或)
str1[[1]][!(str1[[1]] %in% stopWords)]
对于整个data.frame
df1,您可以执行以下操作:
For the whole data.frame
df1, you could do something like:
'%nin%' <- Negate('%in%')
lapply(df1[,2], function(x) {
t <- unlist(strsplit(x, " "))
t[t %nin% stopWords]
})
# [[1]]
# [1] "string" "string."
#
# [[2]]
# [1] "string" "slightly" "string."
#
# [[3]]
# [1] "string" "string."
#
# [[4]]
# [1] "string" "slightly" "shorter" "string."
#
# [[5]]
# [1] "string" "string" "strings."
这篇关于R使用%in%从字符向量中删除停用词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!