PHP中的UTF8文件名和不同的Unicode编码 [英] UTF8 Filenames in PHP and Different Unicode Encodings

查看:204
本文介绍了PHP中的UTF8文件名和不同的Unicode编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在运行linux的服务器上有一个包含Unicode字符的文件。如果我SSH到服务器并使用tab-completion来导航到包含unicode字符的文件/文件夹,我访问文件/文件夹没有问题。当我尝试通过PHP访问文件时出现问题(我正在访问文件系统的功能是 stat )。如果我将PHP脚本生成的路径输出到浏览器并将其粘贴到终端中,那么该文件也似乎存在(即使看着终端文件路径完全一样)。

I have a file containing Unicode characters on a server running linux. If I SSH into the server and use tab-completion to navigate to the file/folder containing unicode characters I have no problem accessing the file/folder. The problem arises when I try accessing the file via PHP (the function I was accessing the file system from was stat). If I output the path generated by the PHP script to the browser and paste it into the terminal the file also seems to exist (even though looking at the terminal the file paths are exactly the same).

我设置PHP使用UTF8作为其默认编码通过php_ini以及设置 mb_internal_encoding 。我检查了PHP文件路径字符串编码,它作为UTF8出来,应该是。稍微多一点,我决定把 hexdump 终端选项卡完成的é字符与 hexdump 由PHP脚本创建的常规é字符或通过键盘手动输入字符(选项+ e + e on os x)。这是结果:

I set PHP to use UTF8 as its default encoding via php_ini as well as set mb_internal_encoding. I checked the PHP filepath string encoding and it comes out as UTF8, as it should. Poking around a bit more I decided to hexdump the é character that the terminal's tab-completion and compare it to the hexdump of the 'regular' é character created by the PHP script or by manually entering in the character via keyboard (option+e+e on os x). Here is the result:


echo -n é | hexdump
0000000 cc65 0081                              
0000003
echo -n é | hexdump
0000000 a9c3                                   
0000002

允许终端中正确的文件引用的é字符是3-字节1。我不知道从哪里去,PHP应该使用什么编码?我应该通过 iconv mb_convert_encoding 将路径转换为另一个编码?

The é character that allows a correct file reference in the terminal is the 3-byte one. I'm not sure where to go from here, what encoding should I use in PHP? Should I be converting the path to another encoding via iconv or mb_convert_encoding?

推荐答案

感谢两个答案中提供的提示,我可以扼杀,并找到一些用于归一化给定字符的不同unicode分解的方法。在我遇到的情况下,我正在访问由OS X Carbon应用程序创建的文件。它是一个相当受欢迎的应用程序,因此它的文件名似乎遵循一个特定的unicode分解。

Thanks to the tips given in the two answers I was able to poke around and find some methods for normalizing the different unicode decompositions of a given character. In the situation I was faced with I was accessing files created by a OS X Carbon application. It is a fairly popular application and thus its file names seemed to adhere to a specific unicode decomposition.

在PHP 5.3中,新的函数集被引入,允许您将unicode字符串归一化到特定的分解。显然有四个分解标准,你可以分解你unicode字符串。从版本2.3开始,通过 unicode.normalize ,Python具有unicode标准化能力。关于python处理unicode字符串的这篇文章有助于理解编码/字符串处理更好一点。

In PHP 5.3 a new set of functions was introduced that allows you to normalize a unicode string to a particular decomposition. Apparently there are four decomposition standards which you can decompose you unicode string into. Python has had unicode normalization capabilties since version 2.3 via unicode.normalize. This article on python's handling of unicode strings was helpful in understanding encoding / string handling a bit better.

以下是对unicode文件路径进行归一化的快速示例:

Here is a quick example on normalizing a unicode filepath:

filePath = unicodedata.normalize('NFD', filePath)

I发现NFD格式为我的所有目的工作,我不知道这是否是unicode文件名的标准分解。

I found that the NFD format worked for all my purposes, I wonder if this is this is the standard decomposition for unicode filenames.

这篇关于PHP中的UTF8文件名和不同的Unicode编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆