从字符向量中删除可能包含特殊字符的整个单词列表,而不匹配单词的各个部分 [英] Remove a list of whole words that may contain special chars from a character vector without matching parts of words
问题描述
我在R中有一个单词列表,如下所示:
I have a list of words in R as shown below:
myList <- c("at","ax","CL","OZ","Gm","Kg","C100","-1.00")
我想从文本中删除上面列表中的单词,如下所示:
And I want to remove the words which are found in the above list from the text as below:
myText <- "This is at Sample ax Text, which CL is OZ better and cleaned Gm, where C100 is not equal to -1.00. This is messy text Kg."
在删除了不需要的myList单词后,myText应该如下所示:
After removing the unwanted myList words, the myText should look like:
This is at Sample Text, which is better and cleaned, where is not equal to. This is messy text.
我正在使用:
stringr::str_replace_all(myText,"[^a-zA-Z\\s]", " ")
但这对我没有帮助.我该怎么办?
But this is not helping me. What I should do??
推荐答案
您可以将PCRE regex与gsub
base R函数一起使用(它也可以与str_replace_all
中的ICU regex一起使用):
You may use a PCRE regex with a gsub
base R function (it will also work with ICU regex in str_replace_all
):
\s*(?<!\w)(?:at|ax|CL|OZ|Gm|Kg|C100|-1\.00)(?!\w)
请参见 regex演示.
详细信息
-
\s*
-0个或更多空格 -
(?<!\w)
-反向查找,可确保在当前位置之前没有单词char -
(?:at|ax|CL|OZ|Gm|Kg|C100|-1\.00)
-一个非捕获组,在字符向量内包含转义的项,其中包含您需要删除的单词 -
(?!\w)
-否定的超前查询,可确保在当前位置之后立即没有单词char.
\s*
- 0 or more whitespaces(?<!\w)
- a negative lookbehind that ensures there is no word char immediately before the current location(?:at|ax|CL|OZ|Gm|Kg|C100|-1\.00)
- a non-capturing group containing the escaped items inside the character vector with the words you need to remove(?!\w)
- a negative lookahead that ensures there is no word char immediately after the current location.
注意:我们不能在此处使用\b
单词边界,因为myList
字符向量中的项目可能以非单词字符开头/结尾,而
NOTE: We cannot use \b
word boundary here because the items in the myList
character vector may start/end with non-word characters while \b
meaning is context-dependent.
在线观看 R演示:
myList <- c("at","ax","CL","OZ","Gm","Kg","C100","-1.00")
myText <- "This is at Sample ax Text, which CL is OZ better and cleaned Gm, where C100 is not equal to -1.00. This is messy text Kg."
escape_for_pcre <- function(s) { return(gsub("([{[()|?$^*+.\\])", "\\\\\\1", s)) }
pat <- paste0("\\s*(?<!\\w)(?:", paste(sapply(myList, function(t) escape_for_pcre(t)), collapse = "|"), ")(?!\\w)")
cat(pat, collapse="\n")
gsub(pat, "", myText, perl=TRUE)
## => [1] "This is Sample Text, which is better and cleaned, where is not equal to. This is messy text."
详细信息
-
escape_for_pcre <- function(s) { return(gsub("([{[()|?$^*+.\\])", "\\\\\\1", s)) }
-转义所有需要以PCRE模式转义的特殊字符 -
paste(sapply(myList, function(t) escape_for_pcre(t)), collapse = "|")
-从搜索词向量创建一个|
分隔的替代列表.
escape_for_pcre <- function(s) { return(gsub("([{[()|?$^*+.\\])", "\\\\\\1", s)) }
- escapes all special chars that need escaping in a PCRE patternpaste(sapply(myList, function(t) escape_for_pcre(t)), collapse = "|")
- creats a|
-separated alternative list from the search term vector.
这篇关于从字符向量中删除可能包含特殊字符的整个单词列表,而不匹配单词的各个部分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!