使用gsub在字符串中仅保留字母数字字符和空格 [英] keep only alphanumeric characters and space in a string using gsub

查看:293
本文介绍了使用gsub在字符串中仅保留字母数字字符和空格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含字母数字字符,特殊字符和非UTF-8字符的字符串.我要去除特殊字符和非utf-8字符.

I have a string which has alphanumeric characters, special characters and non UTF-8 characters. I want to strip the special and non utf-8 characters.

这是我尝试过的:

gsub('[^0-9a-z\\s]','',"�+ Sample string here =�{�>E�BH�P<]�{�>")

但是,这会删除特殊字符(标点符号+非utf8),但输出中没有空格.

However, This removes the special characters (punctuations + non utf8) but the output has no spaces.

gsub('/[^0-9a-z\\s]/i','',"�+ Sample string here =�{�>E�BH�P<]�{�>")

结果中有空格,但仍然存在非utf8字符.

The result has spaces but there are still non utf8 characters present.

有没有解决的办法?

对于上面的示例字符串,输出应为: 示例字符串在这里

For the sample string above, output should be: Sample string here

推荐答案

您可以为此使用[:alnum:][:space:]类:

sample_string <- "�+ Sample 2 string here =�{�>E�BH�P<]�{�>"
gsub("[^[:alnum:][:space:]]","",sample_string)
#> [1] "ï Sample 2 string here ïïEïBHïPïï"

或者,您可以使用PCRE代码来引用特定的字符集:

Alternatively you can use PCRE codes to refer to specific character sets:

gsub("[^\\p{L}0-9\\s]","",sample_string, perl = TRUE)
#> [1] "ï Sample 2 string here ïïEïBHïPïï"

这两种情况都清楚地表明仍然存在的字符被认为是字母.此外,内部的EBHP仍然是字母,因此要替换的条件不正确.您不想保留所有字母,只想保留A-Z,a-z和0-9:

Both cases illustrate clearly that the characters still there, are considered letters. Also the EBHP inside are still letters, so the condition on which you're replacing is not correct. You don't want to keep all letters, you just want to keep A-Z, a-z and 0-9:

gsub("[^A-Za-z0-9 ]","",sample_string)
#> [1] " Sample 2 string here EBHP"

它仍然包含EBHP.如果您确实只想保留仅包含字母和数字的部分,则应使用相反的逻辑:选择所需的内容并使用反向引用替换所有内容,

This still contains the EBHP. If you really just want to keep a section that contains only letters and numbers, you should use the reverse logic: select what you want and replace everything but that using backreferences:

gsub(".*?([A-Za-z0-9 ]+)\\s.*","\\1", sample_string)
#> [1] " Sample 2 string here "

或者,如果您想查找一个字符串,即使不被空格限制,也可以使用边界\\b一词:

Or, if you want to find a string, even not bound by spaces, use the word boundary \\b instead:

gsub(".*?(\\b[A-Za-z0-9 ]+\\b).*","\\1", sample_string)
#> [1] "Sample 2 string here"

这里会发生什么:

  • .*?至少适合0(*),但不适合(?)的任何内容(.).这意味着gsub将尝试尽可能地减少此部分的数量.
  • ()之间的所有内容都将被存储,并可以在替换中引用\\1
  • \\b表示单词边界
  • 此字符至少后面(+)后面是A-Z,a-z,0-9或空格的任何字符.您必须这样做,因为特殊字母包含在代码表的大写和小写之间.因此,使用A-z将包括所有特殊字母(都是UTF-8 btw!)
  • 在该序列之后,至少拟合零次以删除字符串的其余部分.
  • 将后向引用\\1与正则表达式中的.*结合使用,将确保仅所需部分保留在输出中.
  • .*? fits anything (.) at least 0 times (*) but ungreedy (?). This means that gsub will try to fit the smallest amount possible by this piece.
  • everything between () will be stored and can be refered to in the replacement by \\1
  • \\b indicates a word boundary
  • This is followed at least once (+) by any character that's A-Z, a-z, 0-9 or a space. You have to do it that way, because the special letters are contained in between the upper and lowercase in the code table. So using A-z will include all special letters (which are UTF-8 btw!)
  • after that sequence,fit anything at least zero times to remove the rest of the string.
  • the backreference \\1 in combination with .* in the regex, will make sure only the required part remains in the output.

这篇关于使用gsub在字符串中仅保留字母数字字符和空格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆