R 使用 %in% 从字符向量中删除停用词 [英] R remove stopwords from a character vector using %in%

查看：23 发布时间：2022/1/2 17:53:46 r nlp subset tm stop-words

本文介绍了R 使用 %in% 从字符向量中删除停用词的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个包含字符串的数据框，我想从中删除停用词.我试图避免使用 tm 包，因为它是一个大型数据集，并且 tm 似乎运行速度有点慢.我正在使用 tm stopword 字典.

I have a data frame with strings that I'd like to remove stop words from. I'm trying to avoid using the tm package as it's a large data set and tm seems to run a bit slowly. I am using the tm stopword dictionary.

library(plyr)
library(tm)

stopWords <- stopwords("en")
class(stopWords)

df1 <- data.frame(id = seq(1,5,1), string1 = NA)
head(df1)
df1$string1[1] <- "This string is a string."
df1$string1[2] <- "This string is a slightly longer string."
df1$string1[3] <- "This string is an even longer string."
df1$string1[4] <- "This string is a slightly shorter string."
df1$string1[5] <- "This string is the longest string of all the other strings."

head(df1)
df1$string1 <- tolower(df1$string1)
str1 <-  strsplit(df1$string1[5], " ")

> !(str1 %in% stopWords)
[1] TRUE

这不是我要找的答案.我正在尝试获取不在 stopWords 向量中的单词的向量或字符串.

This is not the answer I'm looking for. I'm trying to get a vector or string of the words NOT in the stopWords vector.

我做错了什么?

推荐答案

您没有正确访问列表，也没有从 %in% 的结果中取回元素(这给出了TRUE/FALSE 的逻辑向量).你应该这样做:

You are not accessing the list properly and you're not getting the elements back from the result of %in% (which gives a logical vector of TRUE/FALSE). You should do something like this:

unlist(str1)[!(unlist(str1) %in% stopWords)]

(或)

str1[[1]][!(str1[[1]] %in% stopWords)]

对于整个 data.frame df1，您可以执行以下操作:

For the whole data.frame df1, you could do something like:

'%nin%' <- Negate('%in%')
lapply(df1[,2], function(x) {
    t <- unlist(strsplit(x, " "))
    t[t %nin% stopWords]
})

# [[1]]
# [1] "string"  "string."
# 
# [[2]]
# [1] "string"   "slightly" "string." 
# 
# [[3]]
# [1] "string"  "string."
# 
# [[4]]
# [1] "string"   "slightly" "shorter"  "string." 
# 
# [[5]]
# [1] "string"   "string"   "strings."

这篇关于R 使用 %in% 从字符向量中删除停用词的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

R 使用 %in% 从字符向量中删除停用词 [英] R remove stopwords from a character vector using %in%

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

R 使用 %in% 从字符向量中删除停用词 [英] R remove stopwords from a character vector using %in%

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭