搜索非ASCII字符 [英] Searching for non-ascii characters

查看：355 发布时间：2018/5/28 19:22:54 linux unicode grep

本文介绍了搜索非ASCII字符的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个文件，a.out，其中包含许多行。每行只有一个字符，可以是unicode字符 U + 2013 或小写字母 az 。

在a.out上执行一个文件命令可以得到UTF-8 Unicode文本的结果。

语言环境命令报告： / p>

  LANG = en_US.UTF-8 
 LC_CTYPE =en_US.UTF-8
 LC_NUMERIC = en_US.UTF-8
 LC_TIME =en_US.UTF-8
 LC_COLLATE =en_US.UTF-8
 LC_MONETARY =en_US.UTF-8
 LC_MESSAGES =en_US.UTF-8
 LC_PAPER =en_US.UTF-8
 LC_NAME =en_US.UTF-8
 LC_ADDRESS =en_US.UTF-8
 LC_TELEPHONE =en_US.UTF-8
 LC_MEASUREMENT =en_US.UTF-8
 LC_IDENTIFICATION =en_US.UTF-8
 LC_ALL =

如果我发出命令 grep -P -n[^ \x00-\xFF]a .out 我希望只返回包含 U + 2013 的行。如果我在cygwin下进行测试，就是这种情况。但问题环境是Oracle Linux Server 6.5版，问题是grep命令不会返回任何行。如果我发出 grep -P -n[\x00-\xFF] a.out，则返回所有行。

我意识到 [grep -P] ...是高度实验性的， grep -P 可能会警告未实现的功能。但没有发出任何警告。

我缺少什么？

解决方案

我建议避免躲避 grep -P 实现并使用真实的东西。这样做：

  perl -CSD -nle'print$：$ _if / \ {ASCII} / 'utfile1 utfile2 utfile3 ...

其中：

-CSD 选项表示stdio三重奏（stdin，stdout，stderr）和磁盘文件应该被视为

$。表示当前记录（行）编号。

$ _ 表示当前行。

\P {ASCII} 匹配任何不是 ASCII的代码点。

I have a file, a.out, which contains a number of lines. Each line is one character only, either the unicode character U+2013 or a lower case letter a-z.

Doing a file command on a.out elicits the result UTF-8 Unicode text.

The locale command reports:
LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL=
If I issue the command grep -P -n "[^\x00-\xFF]" a.out I would expect only the lines containing U+2013 to be returned. And this is the case if I carry out the test under cygwin. The problem environment however is Oracle Linux Server release 6.5 and the issue is that the grep command returns no lines. If I issue grep -P -n "[\x00-\xFF]" a.out then all lines are returned.

I realise that "[grep -P]...is highly experimental and grep -P may warn of unimplemented features." but no warnings are issued.

Am I missing something?
解决方案
I recommend avoiding dodgy grep -P implementations and use the real thing. This works:
perl -CSD -nle 'print "$.: $_" if /\P{ASCII}/' utfile1 utfile2 utfile3 ...
Where:

The -CSD options says that both the stdio trio (stdin, stdout, stderr) and disk files should be treated as UTF-8 encoded.

The $. represents the current record (line) number.

The $_ represents the current line.

The \P{ASCII} matches any code point that is not ASCII.

这篇关于搜索非ASCII字符的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

搜索非ASCII字符 [英] Searching for non-ascii characters

问题描述

相关文章

服务器开发最新文章

热门教程

热门工具

登录关闭

搜索非ASCII字符 [英] Searching for non-ascii characters

问题描述

相关文章

服务器开发最新文章

热门教程

热门工具

登录 关闭

登录关闭