utf-8编码是否混乱文件globbing和grep? [英] does utf-8 encoding messes file globbing and grep'ing?

查看:143
本文介绍了utf-8编码是否混乱文件globbing和grep?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在玩bash,经历了utf-8编码。我是新来的unicode。
以下命令(以及它们的输出)让我感到惊讶:


$ locale

LANG = fr_FR.UTF-8

LC_COLLATE =fr_FR.UTF-8

LC_CTYPE =fr_FR.UTF-8

LC_MESSAGES =fr_FR。 UTF-8

LC_MONETARY =fr_FR.UTF-8

LC_NUMERIC =fr_FR.UTF-8

LC_TIME =fr_FR.UTF- 8

LC_ALL =

$ printf'1\\\
é\\\
12\\\
123\\\
'| egrep'^(。| ...)$'

1

é

12

$ touch 1é12 123 < br>
$ ls | egrep'^(。| ...)$'

1

123


确定。两个egrep过滤器的行一行或三个字符。他们的输入是非常相似的,但输出与字符é不同。任何解释?



有关我的环境的更多细节:


$ uname -a

达尔文macbook-pro-de-admin-6.local 10.4.0达尔文内核版本10.4.0:Fri Apr 23 18:28:53 PDT 2010; root:xnu-1504.7.4〜1 / RELEASE_I386 i386

$ egrep -V

egrep(GNU grep)2.5.1



版权所有1988,1992-1999,2000,2001免费软件基金会,

这是免费软件;看到复制条件的来源。没有

保修;甚至不适用于适销性或适用于特定用途。



解决方案

任何可变长度编码都可以乱使用不知道编码的工具,并且在使用单字符通配符时考虑字节(而不是字符)(因为该工具假定字节=字符)。如果您使用文字字符,那么对于UTF-8,这并不重要,因为UTF-8的结构会阻止字符中间的匹配(假设正确编码)。



根据版本的grep应该是UTF-8。 003760.htmlrel =nofollow noreferrer> http://mailman.uib.no/public/corpora/2006-December/003760.html ,GNU grep 2.5.1及更高版本只要包含在设置适当的LANG。如果您使用的是旧版本,或GNU grep以外的其他内容,那么可能是您的问题的原因,因为é是一个双字节字符(0xC3 0xA9)。


$ b $编辑:根据您最近的评论,您的grep可能是Unicode感知的,但它不会执行任何类型的(而且我不会真的期望它是诚实的)。



0x65 0xCC 0x81是一个e,其次是组合ACUTE ACCENT(U + 0301)。这实际上是两个字符,但是由于组合字符的语义,它被渲染为一个。这样就可以使grep检测为两个字符;一个用于e,一个用于口音。



似乎可能的是,分解的Unicode是文件名实际存储在文件系统中的方式 - 否则可以存储文件为了所有的意图和目的,具有完全相同的名称,但仅在使用组合字符时不同。


I'm playing with bash, experiencing with utf-8 encoding. I'm new to unicode. The following command (well, their output) surprises me :

$ locale
LANG="fr_FR.UTF-8"
LC_COLLATE="fr_FR.UTF-8"
LC_CTYPE="fr_FR.UTF-8"
LC_MESSAGES="fr_FR.UTF-8"
LC_MONETARY="fr_FR.UTF-8"
LC_NUMERIC="fr_FR.UTF-8"
LC_TIME="fr_FR.UTF-8"
LC_ALL=
$ printf '1\né\n12\n123\n' | egrep '^(.|...)$'
1
é
12
$ touch 1 é 12 123
$ ls | egrep '^(.|...)$'
1
123

Ok. The two egrep filters lines with one or three characters. Their input is quite similar, but the output differs with the character é. Any explanation?

More details on my environment :

$ uname -a
Darwin macbook-pro-de-admin-6.local 10.4.0 Darwin Kernel Version 10.4.0: Fri Apr 23 18:28:53 PDT 2010; root:xnu-1504.7.4~1/RELEASE_I386 i386
$ egrep -V
egrep (GNU grep) 2.5.1

Copyright 1988, 1992-1999, 2000, 2001 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

解决方案

Any variable length encoding can mess with tools that is not aware of the encoding, and considers bytes, not characters, when you use single-character wildcards (because the tool assumes that byte=character). If you use literal characters, then for UTF-8, it doesn't matter since the structure of UTF-8 prevents matches in the middle of a character (assuming proper encoding).

At least some versions of grep are supposed to be UTF-8 aware, according to http://mailman.uib.no/public/corpora/2006-December/003760.html, GNU grep 2.5.1 and later is included there as long as an appropriate LANG is set. If you use an older version, however, or something other than GNU grep, that would likely be the cause of your problem, since é is a two-byte character (0xC3 0xA9).

EDIT: Based on your recent comment, your grep is probably Unicode-aware, but it does not perform any sort of Unicode normalization (and I wouldn't really expect it to, to be honest).

0x65 0xCC 0x81 is an e, followed by COMBINING ACUTE ACCENT (U+0301). This is effectively two characters, but it's rendered as one due to the semantics of combining characters. This then causes grep to detect it as two characters; one for the e and one for the accent.

It seems likely that decomposed Unicode is how the file name is actually stored in your file system - otherwise, you could store files that, for all intent and purposes, have the exact same name, but only differ in their use of combining characters.

这篇关于utf-8编码是否混乱文件globbing和grep?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆