文件名 os x 中的不同 utf8 编码 [英] Different utf8 encoding in filenames os x

查看:21
本文介绍了文件名 os x 中的不同 utf8 编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 .x

$ cat .x
u="Böhmáí"
touch "$u"
ls > .list
echo "$u" >.text

cat .list .text
diff .list .text
od -bc .list
od -bc .text

当我运行这个 scrpit sh -x .x(-x 仅用于显示命令)

When i run this scrpit sh -x .x (-x only for showing commands)

$ sh -x .x
+ u=Böhmáí
+ touch Böhmáí
+ ls
+ echo Böhmáí
+ cat .list .text
Böhmáí
Böhmáí
+ diff .list .text
1c1
< Böhmáí
---
> Böhmáí
+ od -bc .list
0000000   102 157 314 210 150 155 141 314 201 151 314 201 012            
           B   o   ̈    **   h   m   a   ́    **   i   ́    **  
            
0000015
+ od -bc .text
0000000   102 303 266 150 155 303 241 303 255 012                        
           B   ö  **   h   m   á  **   í  **  
                        
0000012

相同的字符串 Böhmáí 在文件名和文件内容中编码为不同的字节.在终端(utf8 编码)中,两个变体中的字符串看起来相同.

The same string Böhmáí has encoded into different bytes in the filename vs as a content of a file. In the terminal (utf8-encoded) the string looks same in both variants.

兔子在哪里?

推荐答案

(这主要是从 我之前的回答...)

(This is mostly stolen from a previous answer of mine...)

Unicode 允许一些重音字符以几种不同的方式表示:作为代表重音字符的代码点",或者作为代表字符的非重音版本的一系列代码点,后跟重音符号.例如,ä"可以表示为 U+00E4(UTF-8 0xc3a4,带分音符的拉丁小写字母 1)或分解为 U+0061 U+0308(UTF-8 0x61cc88,拉丁小写字母 a + 组合分音符)).

Unicode allows some accented characters to be represented in several different ways: as a "code point" representing the accented character, or as a series of code points representing the unaccented version of the character, followed by the accent(s). For example, "ä" could be represented either precomposed as U+00E4 (UTF-8 0xc3a4, Latin small letter 1 with diaeresis) or decomposed as U+0061 U+0308 (UTF-8 0x61cc88, Latin small letter a + combining diaeresis).

OS X 的 HFS+ 文件系统要求所有文件名都存储在其 完全分解的形式.在 HFS+ 文件名中,ä"必须编码为 0x61cc88,ö"必须编码为 0x6fcc88.

OS X's HFS+ filesystem requires that all filenames be stored in the UTF-8 representation of their fully decomposed form. In an HFS+ filename, "ä" MUST be encoded as 0x61cc88, and "ö" MUST be encoded as 0x6fcc88.

这里发生的事情是,您的 shell 脚本包含预组合形式的Böhmáí",因此它以这种方式存储在变量 a 中,并以这种方式存储在 .text 文件中.但是,当您创建具有该名称的文件(使用 touch)时,文件系统会将其转换为实际文件名的分解形式.当你 ls 它时,它显示文件系统的形式:分解形式.

So what's happening here is that your shell script contains "Böhmáí" in precomposed form, so it gets stored that way in the variable a, and stored that way in the .text file. But when you create a file with that name (with touch), the filesystem converts it to the decomposed form for the actual filename. And when you ls it, it shows the form the filesystem has: the decomposed form.

这篇关于文件名 os x 中的不同 utf8 编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆