搜索非 ASCII 字符 [英] Searching for non-ascii characters

查看:27
本文介绍了搜索非 ASCII 字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文件 a.out,其中包含多行.每行只有一个字符,Unicode 字符 U+2013 或小写字母 a-z.

I have a file, a.out, which contains a number of lines. Each line is one character only, either the unicode character U+2013 or a lower case letter a-z.

对 a.out 执行文件命令会得到 UTF-8 Unicode 文本.

Doing a file command on a.out elicits the result UTF-8 Unicode text.

locale 命令报告:

The locale command reports:

LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

如果我发出命令 grep -P -n "[^x00-xFF]" a.out 我希望只有包含 U+2013 的行被退回.如果我在 cygwin 下进行测试就是这种情况.然而,问题环境是 Oracle Linux Server 6.5 版,问题是 grep 命令不返回任何行.如果我发出 grep -P -n "[x00-xFF]" a.out 然后返回所有行.

If I issue the command grep -P -n "[^x00-xFF]" a.out I would expect only the lines containing U+2013 to be returned. And this is the case if I carry out the test under cygwin. The problem environment however is Oracle Linux Server release 6.5 and the issue is that the grep command returns no lines. If I issue grep -P -n "[x00-xFF]" a.out then all lines are returned.

我意识到[grep -P]...是高度实验性的,grep -P 可能会警告未实现的功能."但没有发出警告.

I realise that "[grep -P]...is highly experimental and grep -P may warn of unimplemented features." but no warnings are issued.

我错过了什么吗?

推荐答案

我建议避免狡猾的 grep -P 实现并使用真实的东西.这有效:

I recommend avoiding dodgy grep -P implementations and use the real thing. This works:

perl -CSD -nle 'print "$.: $_" if /P{ASCII}/' utfile1 utfile2 utfile3 ...

地点:

  • -CSD 选项表示 stdio 三重奏(stdin、stdout、stderr)和磁盘文件都应被视为 UTF-8 编码.

  • The -CSD options says that both the stdio trio (stdin, stdout, stderr) and disk files should be treated as UTF-8 encoded.

$.代表当前记录(行)号.

The $. represents the current record (line) number.

$_ 代表当前行.

P{ASCII} 匹配任何 ASCII 的代码点.

The P{ASCII} matches any code point that is not ASCII.

这篇关于搜索非 ASCII 字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆