相同变音符号(变音符)的不同UTF-8签名-两种写变音符的二进制方式 [英] Different UTF-8 signature for same diacritics (umlauts) - 2 binary ways to write umlauts

查看:104
本文介绍了相同变音符号(变音符)的不同UTF-8签名-两种写变音符的二进制方式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很大的问题,我在网上找不到任何帮助:

I have a quite big problem, where I can't find any help around in the web:

我将网页从OSX的一个页面移到Linux(两个系统都在de_DE.UTF-8中运行),并在一个非常未知的问题中运行: 不再找到某些文件,但显然存在于硬盘驱动器上(名称相同).所有这些文件都包含德国变音符号.

I moved a page from a website from OSX to Linux (both systems are running in de_DE.UTF-8) and run in an quite unknown problem: Some of the files were not found anymore, but obviously existed on the harddrive with (visibly) the same name. All those files contained german umlauts.

我拍摄了一个示例图像,从网页中复制了原始的request-uri并直接调用它-同样的错误.重写文件名后,它起作用了.是的,我没有输入错误!

I took one sample image, copied the original request-uri from the webpage and called it directly - same error. After rewriting the file-name it worked. And yes, I did not mistype it!

这让我感到惊讶,我浏览了apache日志,在其中找到了这些条目:

This surprised me and I took a look into the apache-log where I found these entries:

192.168.56.10 - - [27/Aug/2012:20:03:21 +0200] "GET /images/Sch%C3%B6ne-Lau-150x150.jpg HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/537.1"
192.168.56.10 - - [27/Aug/2012:20:03:57 +0200] "GET /images/Scho%CC%88ne-Lau-150x150.jpg HTTP/1.1" 404 4205 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/537.1"

这是我要调查的东西...这是我在UTF8可图表 http://www中发现的东西.utf8-chartable.de/:

That was something for me to investigate ... Here's what I found in the UTF8 chartable http://www.utf8-chartable.de/:

ö   c3 b6   LATIN SMALL LETTER O WITH DIAERESIS
¨   cc 88   COMBINING DIAERESIS

我认为您已经听说过死键: http://en.wikipedia.org/wiki/Dead_key ,否则,请阅读该文章.这很有趣;)

I think you've already heard of dead-keys: http://en.wikipedia.org/wiki/Dead_key If not, read the article. It's quite interesting ;)

这是否意味着OSX将所有变音符分开保存?这真的意味着OSX将ö字符另存为o和¨,而不是使用组合后的真实字符吗?

Does that mean, that OSX saves all diacritics separate to the letter? Does that really mean, that OSX saves the character ö as o and ¨ instead of using the real character that results of the combination?

如果是,您是否知道一个可以用来重命名这些文件的好的脚本?这不是我从OSX转到Linux的第一页...

If yes, do you know of a good script that I could use to rename these files? This won't be the first page I move from OSX to Linux ...

推荐答案

感谢乔恩·汉纳(Jon Hanna)在这里提供的许多背景信息!这对于获得完整的答案很重要:一种将一种形式转换为另一种形式的方法.

Thanks, Jon Hanna for much background-information here! This was important to get the full answer: a way to convert from the one to the other normalisation form.

由于我的更改是在数据库中链接的文件系统中(由于文件上传),因此我现在必须更新数据库转储.文件在移动过程中已经被重命名(也许是FTP客户端...)

As my changes are in the filesystem (because of file-upload) that is linked in the database, I now have to update my database-dump. The files got already renamed during the move (maybe by the FTP-Client ...)

在Linux上转换字符集的命令行工具为:

Command line tools to convert charsets on Linux are:

  • iconv-转换流(可能是文件)的内容
  • convmv-转换目录中的文件名

字符集utf-8-mac(如中所述http://loopkid.net/articles/2011/03/19/groking-hfs-character-encoding ),我可以在iconv中使用,似乎只存在于OSX系统上,所以我必须移动我的sql -转储到我的Mac,将其转换并移回.另一种选择是使用convmv将文件重命名为NFD,但是我认为这比将来有所帮助.

The charset utf-8-mac (as described in http://loopkid.net/articles/2011/03/19/groking-hfs-character-encoding), I could use in iconv, seems to exist just on OSX systems and so I have to move my sql-dump to my mac, convert it and move it back. Another option would be to rename the files back using convmv to NFD, but this would more hinder than help in the future, I think.

工具convmv具有内置(独立于操作系统)选项来强制执行与NFC或NFD兼容的文件名: http://www.j3e.de/linux/convmv/man/

The tool convmv has a build-in (os-independent) option to enforcing NFC- or NFD-compatible filenames: http://www.j3e.de/linux/convmv/man/

PHP本身(我的系统-Wordpress所基于的语言)在此处支持兼容性层: 在PHP中,我该如何处理HFS +和其他地方在HFS +编码文件名上的差异?在为我修复此问题之后,我将去编写一些测试,还可能向Wordpress和其他人编写错误报告.我使用的系统;)

PHP itself (the language my system - Wordpress is based on) supports a compatibility-layer here: In PHP, how do I deal with the difference in encoded filenames on HFS+ vs. elsewhere? After I fixed this issue for me, I will go and write some tests and may also write a bug-report to Wordpress and other systems I work with ;)

这篇关于相同变音符号(变音符)的不同UTF-8签名-两种写变音符的二进制方式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆