从字符向量中删除可能包含特殊字符的整个单词列表,而不匹配单词的各个部分 [英] Remove a list of whole words that may contain special chars from a character vector without matching parts of words

查看:103
本文介绍了从字符向量中删除可能包含特殊字符的整个单词列表,而不匹配单词的各个部分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在R中有一个单词列表,如下所示:

I have a list of words in R as shown below:

 myList <- c("at","ax","CL","OZ","Gm","Kg","C100","-1.00")

我想从文本中删除上面列表中的单词,如下所示:

And I want to remove the words which are found in the above list from the text as below:

 myText <- "This is at Sample ax Text, which CL is OZ better and cleaned Gm, where C100 is not equal to -1.00. This is messy text Kg."

在删除了不需要的myList单词后,myText应该如下所示:

After removing the unwanted myList words, the myText should look like:

  This is at Sample Text, which is better and cleaned, where is not equal to. This is messy text.

我正在使用:

  stringr::str_replace_all(myText,"[^a-zA-Z\\s]", " ")

但这对我没有帮助.我该怎么办?

But this is not helping me. What I should do??

推荐答案

您可以将PCRE regex与gsub base R函数一起使用(它也可以与str_replace_all中的ICU regex一起使用):

You may use a PCRE regex with a gsub base R function (it will also work with ICU regex in str_replace_all):

\s*(?<!\w)(?:at|ax|CL|OZ|Gm|Kg|C100|-1\.00)(?!\w)

请参见 regex演示.

详细信息

  • \s*-0个或更多空格
  • (?<!\w)-反向查找,可确保在当前位置之前没有单词char
  • (?:at|ax|CL|OZ|Gm|Kg|C100|-1\.00)-一个非捕获组,在字符向量内包含转义的项,其中包含您需要删除的单词
  • (?!\w)-否定的超前查询,可确保在当前位置之后立即没有单词char.
  • \s* - 0 or more whitespaces
  • (?<!\w) - a negative lookbehind that ensures there is no word char immediately before the current location
  • (?:at|ax|CL|OZ|Gm|Kg|C100|-1\.00) - a non-capturing group containing the escaped items inside the character vector with the words you need to remove
  • (?!\w) - a negative lookahead that ensures there is no word char immediately after the current location.

注意:我们不能在此处使用\b单词边界,因为myList字符向量中的项目可能以非单词字符开头/结尾,而

NOTE: We cannot use \b word boundary here because the items in the myList character vector may start/end with non-word characters while \b meaning is context-dependent.

在线观看 R演示:

myList <- c("at","ax","CL","OZ","Gm","Kg","C100","-1.00")
myText <- "This is at Sample ax Text, which CL is OZ better and cleaned Gm, where C100 is not equal to -1.00. This is messy text Kg."
escape_for_pcre <- function(s) { return(gsub("([{[()|?$^*+.\\])", "\\\\\\1", s)) }
pat <- paste0("\\s*(?<!\\w)(?:", paste(sapply(myList, function(t) escape_for_pcre(t)), collapse = "|"), ")(?!\\w)")
cat(pat, collapse="\n")
gsub(pat, "", myText, perl=TRUE)
## => [1] "This is Sample Text, which is better and cleaned, where is not equal to. This is messy text."

详细信息

  • escape_for_pcre <- function(s) { return(gsub("([{[()|?$^*+.\\])", "\\\\\\1", s)) }-转义所有需要以PCRE模式转义的特殊字符
  • paste(sapply(myList, function(t) escape_for_pcre(t)), collapse = "|")-从搜索词向量创建一个|分隔的替代列表.
  • escape_for_pcre <- function(s) { return(gsub("([{[()|?$^*+.\\])", "\\\\\\1", s)) } - escapes all special chars that need escaping in a PCRE pattern
  • paste(sapply(myList, function(t) escape_for_pcre(t)), collapse = "|") - creats a |-separated alternative list from the search term vector.

这篇关于从字符向量中删除可能包含特殊字符的整个单词列表,而不匹配单词的各个部分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆