搜索非ASCII字符 [英] Searching for non-ascii characters
问题描述
我有一个文件,a.out,其中包含许多行。每行只有一个字符,可以是unicode字符 U + 2013
或小写字母 az
。
在a.out上执行一个文件命令可以得到UTF-8 Unicode文本的结果。
语言环境命令报告: / p>
LANG = en_US.UTF-8
LC_CTYPE =en_US.UTF-8
LC_NUMERIC = en_US.UTF-8
LC_TIME =en_US.UTF-8
LC_COLLATE =en_US.UTF-8
LC_MONETARY =en_US.UTF-8
LC_MESSAGES =en_US.UTF-8
LC_PAPER =en_US.UTF-8
LC_NAME =en_US.UTF-8
LC_ADDRESS =en_US.UTF-8
LC_TELEPHONE =en_US.UTF-8
LC_MEASUREMENT =en_US.UTF-8
LC_IDENTIFICATION =en_US.UTF-8
LC_ALL =
如果我发出命令 grep -P -n[^ \x00-\xFF]a .out
我希望只返回包含 U + 2013
的行。如果我在cygwin下进行测试,就是这种情况。但问题环境是Oracle Linux Server 6.5版,问题是grep命令不会返回任何行。如果我发出 grep -P -n[\x00-\xFF]
a.out,则返回所有行。
我意识到 [grep -P]
...是高度实验性的, grep -P
可能会警告未实现的功能。但没有发出任何警告。
我缺少什么?
我建议避免躲避 grep -P
实现并使用真实的东西。这样做:
perl -CSD -nle'print$:$ _if / \ {ASCII} / 'utfile1 utfile2 utfile3 ...
其中:
-CSD
选项表示stdio三重奏(stdin,stdout,stderr)和磁盘文件应该被视为
$。
表示当前记录(行)编号。
$ _
表示当前行。
\P {ASCII}
匹配任何不是 ASCII的代码点。
I have a file, a.out, which contains a number of lines. Each line is one character only, either the unicode character U+2013
or a lower case letter a-z
.
Doing a file command on a.out elicits the result UTF-8 Unicode text.
The locale command reports:
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
If I issue the command grep -P -n "[^\x00-\xFF]" a.out
I would expect only the lines containing U+2013
to be returned. And this is the case if I carry out the test under cygwin. The problem environment however is Oracle Linux Server release 6.5 and the issue is that the grep command returns no lines. If I issue grep -P -n "[\x00-\xFF]
" a.out then all lines are returned.
I realise that "[grep -P]
...is highly experimental and grep -P
may warn of unimplemented features." but no warnings are issued.
Am I missing something?
I recommend avoiding dodgy grep -P
implementations and use the real thing. This works:
perl -CSD -nle 'print "$.: $_" if /\P{ASCII}/' utfile1 utfile2 utfile3 ...
Where:
The
-CSD
options says that both the stdio trio (stdin, stdout, stderr) and disk files should be treated as UTF-8 encoded.The
$.
represents the current record (line) number.The
$_
represents the current line.The
\P{ASCII}
matches any code point that is not ASCII.
这篇关于搜索非ASCII字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!