文件名os x中的不同utf8编码 [英] Different utf8 encoding in filenames os x
问题描述
我在 .x中有一个小的shellscript
$ cat .x
u =Böhmáí
触摸$ u
ls> .list
echo$ u> .text
cat .list .text
diff .list .text
od -bc .list
od -bc .text
当我运行这个scrpit sh -x .x
(-x仅用于显示命令)
$ sh -x .x
+ u =Böhmáí
+ touchBöhmáí
+ ls
+ echoBöhmáí
+ cat .list .text
Böhmáí
Böhmáí
+列表.text
1c1
< Böhmáí
---
> Böhmáí
+ od -bc .list
0000000 102 157 314 210 150 155 141 314 201 151 314 201 012
B ö** hma ** i ** \\\
0000015
+ od -bc .text
0000000 102 303 266 150 155 303 241 303 255 012
Bö** hmá**í** \\\
0000012
相同的字符串Böhmáí
已编码在文件名中的不同字节作为文件的内容。在终端(utf8编码)中,两个变体中的字符串看起来相同
。
兔子在哪里? / p>
OS X的HFS +文件系统要求将所有文件名存储在其完全分解形式。在HFS +文件名中,ä必须编码为0x61cc88,ö必须编码为0x6fcc88。
所以这里发生的是你的shell脚本包含 Böhmáí,所以它以这种方式存储在变量 a
中,并以.text的形式存储。但是当您使用该名称创建文件(使用 touch
)时,文件系统会将其转换为实际文件名的分解表单。而当您 ls
它,它显示文件系统具有的形式:分解的表单。
I have a small shellscript in .x
$ cat .x
u="Böhmáí"
touch "$u"
ls > .list
echo "$u" >.text
cat .list .text
diff .list .text
od -bc .list
od -bc .text
When i run this scrpit sh -x .x
(-x only for showing commands)
$ sh -x .x
+ u=Böhmáí
+ touch Böhmáí
+ ls
+ echo Böhmáí
+ cat .list .text
Böhmáí
Böhmáí
+ diff .list .text
1c1
< Böhmáí
---
> Böhmáí
+ od -bc .list
0000000 102 157 314 210 150 155 141 314 201 151 314 201 012
B o ̈ ** h m a ́ ** i ́ ** \n
0000015
+ od -bc .text
0000000 102 303 266 150 155 303 241 303 255 012
B ö ** h m á ** í ** \n
0000012
The same string Böhmáí
has encoded into different bytes in the filename vs as a content of a file. In the terminal (utf8-encoded) the string looks same
in both variants.
Where is the rabbit?
(This is mostly stolen from a previous answer of mine...)
Unicode allows some accented characters to be represented in several different ways: as a "code point" representing the accented character, or as a series of code points representing the unaccented version of the character, followed by the accent(s). For example, "ä" could be represented either precomposed as U+00E4 (UTF-8 0xc3a4, Latin small letter 1 with diaeresis) or decomposed as U+0061 U+0308 (UTF-8 0x61cc88, Latin small letter a + combining diaeresis).
OS X's HFS+ filesystem requires that all filenames be stored in the UTF-8 representation of their fully decomposed form. In an HFS+ filename, "ä" MUST be encoded as 0x61cc88, and "ö" MUST be encoded as 0x6fcc88.
So what's happening here is that your shell script contains "Böhmáí" in precomposed form, so it gets stored that way in the variable a
, and stored that way in the .text file. But when you create a file with that name (with touch
), the filesystem converts it to the decomposed form for the actual filename. And when you ls
it, it shows the form the filesystem has: the decomposed form.
这篇关于文件名os x中的不同utf8编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!