搜索非ASCII字符 [英] Searching for non-ascii characters

查看:355
本文介绍了搜索非ASCII字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文件,a.out,其中包含许多行。每行只有一个字符,可以是unicode字符 U + 2013 或小写字母 az



在a.out上执行一个文件命令可以得到UTF-8 Unicode文本的结果。



语言环境命令报告: / p>

  LANG = en_US.UTF-8 
LC_CTYPE =en_US.UTF-8
LC_NUMERIC = en_US.UTF-8
LC_TIME =en_US.UTF-8
LC_COLLATE =en_US.UTF-8
LC_MONETARY =en_US.UTF-8
LC_MESSAGES =en_US.UTF-8
LC_PAPER =en_US.UTF-8
LC_NAME =en_US.UTF-8
LC_ADDRESS =en_US.UTF-8
LC_TELEPHONE =en_US.UTF-8
LC_MEASUREMENT =en_US.UTF-8
LC_IDENTIFICATION =en_US.UTF-8
LC_ALL =

如果我发出命令 grep -P -n[^ \x00-\xFF]a .out 我希望只返回包含 U + 2013 的行。如果我在cygwin下进行测试,就是这种情况。但问题环境是Oracle Linux Server 6.5版,问题是grep命令不会返回任何行。如果我发出 grep -P -n[\x00-\xFF] a.out,则返回所有行。



我意识到 [grep -P] ...是高度实验性的, grep -P 可能会警告未实现的功能。但没有发出任何警告。



我缺少什么?

解决方案

我建议避免躲避 grep -P 实现并使用真实的东西。这样做:

  perl -CSD -nle'print$:$ _if / \ {ASCII} / 'utfile1 utfile2 utfile3 ... 

其中:

-CSD 选项表示stdio三重奏(stdin,stdout,stderr)和磁盘文件应该被视为

  • $。表示当前记录(行)编号。


  • $ _ 表示当前行。


  • \P {ASCII} 匹配任何不是 ASCII的代码点。



  • I have a file, a.out, which contains a number of lines. Each line is one character only, either the unicode character U+2013 or a lower case letter a-z.

    Doing a file command on a.out elicits the result UTF-8 Unicode text.

    The locale command reports:

    LANG=en_US.UTF-8
    LC_CTYPE="en_US.UTF-8"
    LC_NUMERIC="en_US.UTF-8"
    LC_TIME="en_US.UTF-8"
    LC_COLLATE="en_US.UTF-8"
    LC_MONETARY="en_US.UTF-8"
    LC_MESSAGES="en_US.UTF-8"
    LC_PAPER="en_US.UTF-8"
    LC_NAME="en_US.UTF-8"
    LC_ADDRESS="en_US.UTF-8"
    LC_TELEPHONE="en_US.UTF-8"
    LC_MEASUREMENT="en_US.UTF-8"
    LC_IDENTIFICATION="en_US.UTF-8"
    LC_ALL=
    

    If I issue the command grep -P -n "[^\x00-\xFF]" a.out I would expect only the lines containing U+2013 to be returned. And this is the case if I carry out the test under cygwin. The problem environment however is Oracle Linux Server release 6.5 and the issue is that the grep command returns no lines. If I issue grep -P -n "[\x00-\xFF]" a.out then all lines are returned.

    I realise that "[grep -P]...is highly experimental and grep -P may warn of unimplemented features." but no warnings are issued.

    Am I missing something?

    解决方案

    I recommend avoiding dodgy grep -P implementations and use the real thing. This works:

    perl -CSD -nle 'print "$.: $_" if /\P{ASCII}/' utfile1 utfile2 utfile3 ...
    

    Where:

    • The -CSD options says that both the stdio trio (stdin, stdout, stderr) and disk files should be treated as UTF-8 encoded.

    • The $. represents the current record (line) number.

    • The $_ represents the current line.

    • The \P{ASCII} matches any code point that is not ASCII.

    这篇关于搜索非ASCII字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆