glob()在Windows上无法找到具有多字节字符的文件名? [英] glob() can't find file names with multibyte characters on Windows?

查看:997
本文介绍了glob()在Windows上无法找到具有多字节字符的文件名?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在写一个文件管理器,需要扫描目录,并处理可能有多字节字符的重命名文件。我正在使用Windows / Apache PHP 5.3.8在本地进行操作,在目录中使用以下文件名:


  • filename.jpg

  • имяфайла.jpg

  • 文件件名称

  • פילענאַמע。jpg




  • 在活动的UNIX服务器上进行测试的效果很好。在Windows上使用 glob('./path/*')在本地进行测试仅返回第一个, filename.jpg
    $ b 使用 scandir(),至少会返回正确数量的文件, code> ?????????。jpg (注意:那些是常规的问号,而不是 字符)。

    <我会最终需要编写一个搜索功能,通过整个树搜索匹配模式或特定文件扩展名的文件名,我假设 glob()将是正确的工具,而不是扫描所有文件,并在应用程序代码中进行模式匹配和数组构建,如果需要的话, p>

    假设这是一个常见的问题,我马上搜索了Google和Stack Overflow,结果发现没有任何相关的东西,这是Windows的问题吗?PHP的缺点是什么?可以做什么?



    附录:不是确定这是相关的,但对于这些文件, file_exists()也返回 FALSE ,传递完整的绝对路径(使用记事本++,PHP文件本身是UTF-8编码没有BOM)。我确定路径是正确的,因为没有多字节字符的邻居文件返回 TRUE

    编辑 glob() can 找到一个名为的文件filename-äöü.jpg。以前在我的 .htaccess 文件中,我有 AddDefaultCharset utf-8 ,这是我之前没有考虑过的。 filename-äöü.jpg被打印为 filename- .jpg。删除htaccess行似乎有唯一的效果是现在的文件名正常打印。



    我删除了 .htaccess 文件完全,这是我的实际测试脚本完整(我改变了从原来的职位几个文件名):

     的print_r(SCANDIR( './上传/')); 
    print_r(glob('./ uploads / *'));

    在Windows上本地输出:

    <$ p $数组

    [0] =>。
    [1] => ..
    [2] => jpg
    [3] => jpg
    [4] => ?????? jpg
    [5] =>文件名 - äöü.jpg
    [6] =>文件名.jpg
    [7] => test?test.jpg

    数组

    [0] =>./uploads/filename-äöü.jpg
    [1] => ./uploads/filename.jpg



    $ b

    在远程UNIX服务器上输出:

      

    [0] =>。
    [1] => ..
    [2] => filename-äöü.jpg
    [ 3] => filename.jpg
    [4] => test이test.jpg
    [5] =>имяфайла.jpg
    [6] =>פילענאַמע。jpg
    [7] =>文件名.jpg

    数组

    [0] =>./uploads/filename-äöü.jpg
    [1] => ./uploads/filename.jpg
    [2] => ./uploads/test이test.jpg
    [3] =>./uploads/имяфайла.jpg
    [4] => ./uploads/פילענאַמע。jpg
    [5] => ./uploads/文件名.jpg

    由于这是一个不同的服务器,无论的平台配置可能会有所不同,所以我不知道该怎么想,而且我不能完全将其固定在Windows上(可能是我的PHP安装,ini设置或Apache配置)。任何想法?看起来像glob()函数取决于您的PHP副本是如何建立的,以及它是否被编译与一个unicode意识的WIN32 API(我不相信标准builid是。



    参见 http://www.rooftopsolutions.nl/blog/filesystem-encoding-and-php



    摘自文章评论:


    Philippe Verdy 2010-09-26 8: 53 am


    在Windows上安装PHP的输出很容易解释:
    安装了错误版本的PHP,并且使用了一个版本不是
    编译为使用Unicode版本的Win32 API。因此,PHP使用的
    文件系统调用将使用传统的ANSIAPI,因此
    与此版本的C / C ++库链接PHP将首先尝试将
    转换为UTF-8-enc在运行环境中选择的本地ANSI代码页
    中,将PHP字符串转换成PHP








    <您的Windows版本最可能不负责这个奇怪的
    的事情。实际上,这是您的PHP版本,它没有正确编译
    ,并且使用传统的ANSI版本的Win32 API(
    与Windows 95/98的旧版16位版本兼容)内核中b $ b文件系统的支持实际上并没有直接支持
    Unicode,而是在使用实际ANSI版本的
    之前,使用内部转换层将Unicode转换为
    本地ANSI代码页API)重新编译PHP使用编译器选项来使用UNICODE版本的
    Win32 API(这应该是今天的默认值,总之
    默认安装在服务器上的PHP将永远不会是Windows
    95或Windows 98 ...)

    然后Windows将能够存储UTF -16编码的文件名(包括FAT32卷上的
    ,即使在这些卷上也会使用文件系统的默认
    代码页以8.3格式生成
    别名短名称,在NTFS卷中可以避免)。

    您所描述的所有内容都是PHP的问题(错误地移植到
    Windows,或者在运行时不正确的系统版本识别):
    重读与PHP源代码解释
    编译标志的自述文件。我真的认为,Windows上的makefile应该
    能够配置和自动检测,如果它真的只需要使用
    ANSI版本的API。如果您正在编译服务器,请确保配置脚本能够有效地检测到Win32 aPI的UNICODE版本的全部
    支持,并在
    编译PHP时使用它,以及何时选择要链接的运行时库。

    我在Windows上使用PHP,编译正确,我完全不知道
    是你在文章中引用的问题。



    让我们现在忘记 永远 这些非UNICODE版本的Win32
    API不一致的
    Windows图形用户界面的本地ANSI代码页,以及文件系统API的OEM代码页,
    DOS / BIOS兼容的API,控制台API):这些非Unicode
    版本的API甚至比
    的Unicode版本的API要慢得多,而且成本更高,因为它们实际上是在使用核心Unicode API之前将
    代码页转换为Unicode(Windows上的
    情况基于NT的ke rnels正好与基于虚拟DOS扩展器(例如Windows 95/98 / ME的
    )的Windows版本中
    的情况相反)。

    当您不使用API​​的本地版本时,您的API调用将
    传递通过转换层,该转换层将转换
    Unicode与旧版ANSI或CHCP选择的OEM之一之间的字符串代码页或
    OEM代码页暗示在文件系统上:这需要额外
    在非本地版本的Win32
    API中的临时内存分配。这需要额外的时间来通过调用本地API来完成
    实际工作之前进行转换。



    总之:在Windows上安装的PHP二进制文件必须是不同的
    ,这取决于你是否编译了Windows 95/98 / SE(或者Windows 3.x的旧
    Win16s模拟层,它只支持UTF-8的最低
    )以支持Unicode所使用的Unicode子集
    ,通过从DOS
    扩展器启动Windows时选择的ANSI和OEM codapges),或者如果它是在NT上基于任何其他版本的Windows
    编译的内核。

    最好的证明,这是一个PHP的问题,而不是Windows,是
    你奇怪的结果将不会发生在其他语言,如C#,
    Javascript,VB,Perl,Ruby ... PHP在跟踪
    版本(以及过多的历史源代码怪癖和错误的
    假设,今天应该被禁用,以及一个不一致的图书馆
    有继承了老版本的
    PHP旧版本的旧版本,甚至不再官方支持
    ,由微软甚至PHP自己支持!)。

    换句话说:RTM!或者下载并安装一个二进制版本的
    用于Windows预先设置的PHP正确的设置:我真的认为
    PHP应该分发已经编译的Windows二进制文件
    默认为Unicode版本的Win32 API ,并使用
    Unicode版本的C / C ++库:在内部,PHP代码将
    在调用Win32 API之前将它的UTF-8字符串转换为UTF-16,并且从UTF中返回
    -16到UTF-8,而不是
    将PHP的内部UTF-8字符串转换回本地OEM代码页
    (用于文件系统调用)或本地ANSI代码页(用于所有其他
    Win32 API,包括注册表或进程)。


    I'm writing a file manager and need to scan directories and deal with renaming files that may have multibyte characters. I'm working on it locally on Windows/Apache PHP 5.3.8, with the following file names in a directory:

    • filename.jpg
    • имяфайла.jpg
    • file件name.jpg
    • פילענאַמע.jpg
    • 文件名.jpg

    Testing on a live UNIX server woked fine. Testing locally on Windows using glob('./path/*') returns only the first one, filename.jpg.

    Using scandir(), the correct number of files is returned at least, but I get names like ?????????.jpg (note: those are regular question marks, not the � character.

    I'll end up needing to write a "search" feature to search recursively through the entire tree for filenames matching a pattern or with a certain file extension, and I assumed glob() would be the right tool for that, rather than scan all the files and do the pattern matching and array building in the application code. I'm open to alternate suggestions if need be.

    Assuming this was a common problem, I immediately searched Google and Stack Overflow and found nothing even related. Is this a Windows issue? PHP shortcoming? What's the solution: is there anything I can do?

    Addendum: Not sure how related this is, but file_exists() is also returning FALSE for these files, passing in the full absolute path (using Notepad++, the php file itself is UTF-8 encoding no BOM). I'm certain the path is correct, as neighboring files without multibyte characters return TRUE.

    EDIT: glob() can find a file named filename-äöü.jpg. Previously in my .htaccess file, I had AddDefaultCharset utf-8, which I didn't consider before. filename-äöü.jpg was printing as filename-���.jpg. The only effect removing that htaccess line seemed to have was now that file name prints normally.

    I've deleted the .htaccess file completely, and this is my actual test script in it's entirety (I changed a couple of file names from the original post):

    print_r(scandir('./uploads/')); 
    print_r(glob('./uploads/*'));
    

    Output locally on Windows:

    Array
    (
        [0] => .
        [1] => ..
        [2] => ??? ?????.jpg
        [3] => ???.jpg
        [4] => ?????????.jpg
        [5] => filename-äöü.jpg
        [6] => filename.jpg
        [7] => test?test.jpg
    )
    Array
    (
        [0] => ./uploads/filename-äöü.jpg
        [1] => ./uploads/filename.jpg
    )
    

    Output on remote UNIX server:

    Array
    (
        [0] => .
        [1] => ..
        [2] => filename-äöü.jpg
        [3] => filename.jpg
        [4] => test이test.jpg
        [5] => имя файла.jpg
        [6] => פילענאַמע.jpg
        [7] => 文件名.jpg
    )
    Array
    (
        [0] => ./uploads/filename-äöü.jpg
        [1] => ./uploads/filename.jpg
        [2] => ./uploads/test이test.jpg
        [3] => ./uploads/имя файла.jpg
        [4] => ./uploads/פילענאַמע.jpg
        [5] => ./uploads/文件名.jpg
    )
    

    Since this is a different server, regardless of platform - configuration could be different so I'm not sure what to think, and I can't fully pin it on Windows yet (could be my PHP installation, ini settings, or Apache config). Any ideas?

    解决方案

    It looks like the glob() function depends on how your copy of PHP was built and whether it was compiled with a unicode-aware WIN32 API (I don't believe the standard builid is.

    Cf. http://www.rooftopsolutions.nl/blog/filesystem-encoding-and-php

    Excerpt from comments on the article:

    Philippe Verdy 2010-09-26 8:53 am

    The output from your PHP installation on Windows is easy to explain : you installed the wrong version of PHP, and used a version not compiled to use the Unicode version of the Win32 API. For this reason, the filesystem calls used by PHP will use the legacy "ANSI" API and so the C/C++ libraries linked with this version of PHP will first try to convert yout UTF-8-encoded PHP string into the local "ANSI" codepage selected in the running environment (see the CHCP command before starting PHP from a command line window)

    Your version of Windows is MOST PROBABLY NOT responsible of this weird thing. Actually, this is YOUR version of PHP which is not compiled correctly, and that uses the legacy ANSI version of the Win32 API (for compatibility with the legacy 16-bit versions of Windows 95/98 whose filesystem support in the kernel actually had no direct support for Unicode, but used an internal conversion layer to convert Unicode to the local ANSI codepage before using the actual ANSI version of the API).

    Recompile PHP using the compiler option to use the UNICODE version of the Win32 API (which should be the default today, and anyway always the default for PHP installed on a server that will NEVER be Windows 95 or Windows 98...)

    Then Windows will be able to store UTF-16 encoded filenames (including on FAT32 volumes, even if, on these volumes, it will also generate an aliased short name in 8.3 format using the filesystem's default codepage, something that can be avoided in NTFS volumes).

    All what you describe are problems of PHP (incorrect porting to Windows, or incorrect system version identification at runtime) : reread the README files coming with PHP sources explaining the compilation flags. I really think that the makefile on Windows should be able to configure and autodetect if it really needs to use ONLY the ANSI version of the API. If you are compiling it for a server, make sure that the Configure script will effectively detect the full support of the UNICODE version of the Win32 aPI and will use it when compiling PHP and when selecting the runtime libraries to link.

    I use PHP on Windows, correctly compiled, and I absolutely DON'T know the problems you cite in your article.

    Let's forget now forever these non-UNICODE versions of the Win32 API (which are using inconsistantly the local ANSI codepage for the Windows graphical UI, and the OEM codepage for the filesystem APIs, the DOS/BIOS-compatible APIs, the Console APIs) : these non-Unicode versions of the APIs are even MUCH slower and more costly than the Unicode versions of the APIs, because they are actually translating the codepage to Unicode before using the core Unicode APIs (the situation on Windows NT-based kernels is exactly the reverse from the situation on versions of Windows based on a virtual DOS extender, such as Windows 95/98/ME).

    When you don't use the native version of the API, your API call will pass through a thunking layer that will transcode the strings between Unicode and one of the legacy ANSI or CHCP-selected OEM codepages, or the OEM codepage hinted on the filesystem: this requires additional temporary memory allocation within the non-native version of the Win32 API. This takes additional time to convert things before doing the actual work by calling the native API.

    In summary: the PHP binary you install on Windows MUST be different depending on if you compiled it for Windows 95/98/SE (or the old Win16s emulation layer for Windows 3.x, which had a very mimimum support of UTF-8, only to support the Unicode subsets of Unicode used by the ANSI and OEM codapges selected when starting Windows from a DOS extender) or if it was compiled for any other version of Windows based on the NT kernel.

    The best proof that this is a problem of PHP and not Windows, is that your weird results will NOT occur in other languages like C#, Javascript, VB, Perl, Ruby... PHP has a very bad history in tracking versions (and too many historical source code quirks and wrong assumptions that should be disabled today, and an inconsistant library that has inherited all those quirks initially made in old versions of PHP for old versions of Windows that are even no longer officially supported, by Microsoft or even by PHP itself !).

    In other words : RTM ! Or download and install a binary version of PHP for Windows precompield with the correct settings : I really think that PHP should distribute Windows binaries already compiled by default for the Unicode version of the Win32 API, and using the Unicode version of the C/C++ libraries : internally the PHP code will convert its UTF-8 strings to UTF-16 before calling the Win32 API, and back from UTF-16 to UTF-8 when retrieving Win32 results, instead of converting PHP's internal UTF-8 strings back/to the local OEM codepage (for the filesystem calls) or the local ANSI codepage (for all other Win32 APIs, including the registry or process).

    这篇关于glob()在Windows上无法找到具有多字节字符的文件名?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆