如何匹配所有国际化的文字? [英] How to match all internationalized text?

查看:76
本文介绍了如何匹配所有国际化的文字?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在执行一次搜寻与破坏任务,以寻找亚马逊发现的任何令人反感的东西.过去,我通过使用iconv从"UTF-8"转换为"latin1"来解决此问题,但由于编码为未知",因此我在这里无法做到这一点:

I'm on a search-and-destroy mission for anything Amazon finds distasteful. In the past I've dealt with this by using iconv to convert from "UTF-8" to "latin1", but I can't do that here because it's encoded as "unknown":

test <- "Gwena\xeblle M"
> gsub("\xeb","", df[306,"primauthfirstname"] )
[1] "Gwenalle M"
> Encoding(df[306,"primauthfirstname"])
[1] "unknown"

那么什么正则表达式消除了所有\ x ##代码?

So what regex eliminates all the \x## codes?

推荐答案

我认为这种模式应该有效:

I believe this pattern should work:

pat <- "[\x80-\xFF]"

test <- c("Gwena\xeblle M", "\x92","\xe4","\xe1","\xeb") 
gsub(pat, "", test, perl=TRUE)
# [1] "Gwenalle M" ""           ""           ""           ""     

说明:

之所以起作用,是因为字符类"[\x00-\xFF]"将匹配格式为\x##的所有字符.但是其中的前半部分-第0到127(或者第十六个007F th)-是

It works because the character class "[\x00-\xFF]" would match all characters of the form \x##. But the first half of those -- the 0th to 127th (or 00'th to 7F'th in hex digits) -- are the ASCII characters. So it's the second half of them -- the 128th to 255th (or 80'th to FF'th in hex mode) -- that you want to search out and destroy.

这篇关于如何匹配所有国际化的文字?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆