从字符向量中删除可能包含特殊字符的整个单词列表，而不匹配单词的各个部分 [英] Remove a list of whole words that may contain special chars from a character vector without matching parts of words

查看：103 发布时间：2020/11/21 18:34:58 r regex gsub stringr

本文介绍了从字符向量中删除可能包含特殊字符的整个单词列表，而不匹配单词的各个部分的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在R中有一个单词列表，如下所示:

I have a list of words in R as shown below:

 myList <- c("at","ax","CL","OZ","Gm","Kg","C100","-1.00")

我想从文本中删除上面列表中的单词，如下所示:

And I want to remove the words which are found in the above list from the text as below:

 myText <- "This is at Sample ax Text, which CL is OZ better and cleaned Gm, where C100 is not equal to -1.00. This is messy text Kg."

在删除了不需要的myList单词后，myText应该如下所示:

After removing the unwanted myList words, the myText should look like:

  This is at Sample Text, which is better and cleaned, where is not equal to. This is messy text.

我正在使用:

  stringr::str_replace_all(myText,"[^a-zA-Z\\s]", " ")

但这对我没有帮助.我该怎么办?

But this is not helping me. What I should do??

推荐答案

您可以将PCRE regex与gsub base R函数一起使用(它也可以与str_replace_all中的ICU regex一起使用):

You may use a PCRE regex with a gsub base R function (it will also work with ICU regex in str_replace_all):

\s*(?<!\w)(?:at|ax|CL|OZ|Gm|Kg|C100|-1\.00)(?!\w)

请参见 regex演示.

详细信息

\s*-0个或更多空格
(?<!\w)-反向查找，可确保在当前位置之前没有单词char
(?:at|ax|CL|OZ|Gm|Kg|C100|-1\.00)-一个非捕获组，在字符向量内包含转义的项，其中包含您需要删除的单词
(?!\w)-否定的超前查询，可确保在当前位置之后立即没有单词char.

\s* - 0 or more whitespaces
(?<!\w) - a negative lookbehind that ensures there is no word char immediately before the current location
(?:at|ax|CL|OZ|Gm|Kg|C100|-1\.00) - a non-capturing group containing the escaped items inside the character vector with the words you need to remove
(?!\w) - a negative lookahead that ensures there is no word char immediately after the current location.

注意:我们不能在此处使用\b单词边界，因为myList字符向量中的项目可能以非单词字符开头/结尾，而

NOTE: We cannot use \b word boundary here because the items in the myList character vector may start/end with non-word characters while \b meaning is context-dependent.

在线观看 R演示:

myList <- c("at","ax","CL","OZ","Gm","Kg","C100","-1.00")
myText <- "This is at Sample ax Text, which CL is OZ better and cleaned Gm, where C100 is not equal to -1.00. This is messy text Kg."
escape_for_pcre <- function(s) { return(gsub("([{[()|?$^*+.\\])", "\\\\\\1", s)) }
pat <- paste0("\\s*(?<!\\w)(?:", paste(sapply(myList, function(t) escape_for_pcre(t)), collapse = "|"), ")(?!\\w)")
cat(pat, collapse="\n")
gsub(pat, "", myText, perl=TRUE)
## => [1] "This is Sample Text, which is better and cleaned, where is not equal to. This is messy text."

详细信息

escape_for_pcre <- function(s) { return(gsub("([{[()|?$^*+.\\])", "\\\\\\1", s)) }-转义所有需要以PCRE模式转义的特殊字符
paste(sapply(myList, function(t) escape_for_pcre(t)), collapse = "|")-从搜索词向量创建一个|分隔的替代列表.

escape_for_pcre <- function(s) { return(gsub("([{[()|?$^*+.\\])", "\\\\\\1", s)) } - escapes all special chars that need escaping in a PCRE pattern
paste(sapply(myList, function(t) escape_for_pcre(t)), collapse = "|") - creats a |-separated alternative list from the search term vector.

这篇关于从字符向量中删除可能包含特殊字符的整个单词列表，而不匹配单词的各个部分的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从字符向量中删除可能包含特殊字符的整个单词列表，而不匹配单词的各个部分 [英] Remove a list of whole words that may contain special chars from a character vector without matching parts of words

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

从字符向量中删除可能包含特殊字符的整个单词列表，而不匹配单词的各个部分 [英] Remove a list of whole words that may contain special chars from a character vector without matching parts of words

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭