如何匹配所有国际化的文字? [英] How to match all internationalized text?

查看：76 发布时间：2020/11/30 0:12:54 regex r internationalization

本文介绍了如何匹配所有国际化的文字?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在执行一次搜寻与破坏任务，以寻找亚马逊发现的任何令人反感的东西.过去，我通过使用iconv从"UTF-8"转换为"latin1"来解决此问题，但由于编码为未知"，因此我在这里无法做到这一点:

I'm on a search-and-destroy mission for anything Amazon finds distasteful. In the past I've dealt with this by using iconv to convert from "UTF-8" to "latin1", but I can't do that here because it's encoded as "unknown":

test <- "Gwena\xeblle M"
> gsub("\xeb","", df[306,"primauthfirstname"] )
[1] "Gwenalle M"
> Encoding(df[306,"primauthfirstname"])
[1] "unknown"

那么什么正则表达式消除了所有\ x ##代码?

So what regex eliminates all the \x## codes?

推荐答案

我认为这种模式应该有效:

I believe this pattern should work:

pat <- "[\x80-\xFF]"

test <- c("Gwena\xeblle M", "\x92","\xe4","\xe1","\xeb") 
gsub(pat, "", test, perl=TRUE)
# [1] "Gwenalle M" ""           ""           ""           ""

说明:

之所以起作用，是因为字符类"[\x00-\xFF]"将匹配格式为\x##的所有字符.但是其中的前半部分-第0到127(或者第十六个00到7F th)-是

It works because the character class "[\x00-\xFF]" would match all characters of the form \x##. But the first half of those -- the 0th to 127th (or 00'th to 7F'th in hex digits) -- are the ASCII characters. So it's the second half of them -- the 128th to 255th (or 80'th to FF'th in hex mode) -- that you want to search out and destroy.

这篇关于如何匹配所有国际化的文字?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何匹配所有国际化的文字? [英] How to match all internationalized text?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何匹配所有国际化的文字? [英] How to match all internationalized text?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭