readdir中的什么编码返回一个文件名? [英] In what encoding does readdir return a filename?

查看:207
本文介绍了readdir中的什么编码返回一个文件名?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个Perl脚本,我在执行时希望打印

 #!/ usr / bin / perl 
使用警告;
使用strict;
使用utf8;
使用编码;

使用常量filename => Bärlauch;

打开(我的$ out,'>',文件名)或死;
关闭$出;

opendir(我的$ dir,'。')或死亡;
while(我的$ filename_read = readdir($ dir)){
#$ filename_read = encode('utf8',$ filename_read);
printfound\\\
if $ filename_read eq filename;
}

脚本首先创建一个文件,名称为常量文件名。 (运行脚本后,我可以使用 ls 验证文件的存在,并且该文件不是用有趣字符创建的。)



然后脚本遍历当前工作目录中的文件,并打印 found ,如果有一个文件的名称等于刚创建的文件。这显然是这样的。



但是,它不是(Ubuntu,bash, LANG = en_US.UTF8



如果我将常量更改为 Barlauch ,它的工作原理和打印发现



取消注释 $ filename_read = encode('utf8',$ filename_read); 不会改变行为。



有没有解释这个,我该怎么做才能识别Umlaute的文件名?

解决方案

转载的问题(正如我解释的)是:


为什么 readdir 返回新创建的文件名? (这里由变量 filename 表示,设置为Bärlauch)。


(注意: filename 是一个Perl常量变量,所以这就是为什么在前面缺少 $ sigil。)



背景:



首先注意:由于程序开头的使用utf8 语句, filename 将在编译时升级为Unicode字符串,因为它包含非ASCII字符。从 utf8 pragma的文档:


启用utf8 pragma具有以下效果:不在ASCII字符集中的源
文本中的字节将被视为文本UTF的
部分-8序列。这包括大多数文字,例如
标识符名称,字符串常量和常量正则表达式
模式。


和另外,根据 perluniintro section Perl的Unicode模型


一般原则是Perl尽可能长时间保持其数据为八位
字节,但是一旦Unicodeness不能避免
,数据将被透明地升级为Unicode。



...



在内部,Perl目前使用平台(例如Latin-1)的本机八位
字符集,默认为
UTF-8来编码Unicode字符串。 / p>

filename 中的非ASCII字符是字母 A 。如果使用ISO 8859-1扩展ASCII编码(Latin-1),则将其编码为字节值 0xE4 ,请参阅此表格 at ascii-code.com
但是,如果您从 filename 中删除​​ä字符,则只会包含ASCII字符,因此即使您使用了 utf8 pragma,也不会在内部升级为Unicode。



所以 filename 现在是具有内部 UTF-8 标志的Unicode字符串(请参阅 utf8 pragma有关 UTF-8 标志的更多信息)。请注意,字母ä以UTF-8编码为两个字节 0xC3 0xA4



编写文件



写入文件时,文件名会发生什么?如果 filename 是一个Unicode字符串,它将被编码为UTF-8。但是,请注意,不必首先编码 filename encode_utf8(filename))。有关详细信息,请参阅创建具有unicode字符的文件名。因此,文件名以UTF-8编码的字节写入磁盘。



读取文件名:



尝试从磁盘, readdir 不返回Unicode字符串(设置了UTF-8标志的字符串),即使文件名包含以UTF-8编码的字节。它返回二进制或字节字符串,请参阅 perlunitut ,讨论字节串与字符(Unicode)字符串。



为什么不 readdir 返回Unicode字符串?首先,根据
perlunicode section Unicode不发生时


还有很多地方使用Unicode (在某些编码或
另一个)可以给出作为参数或收到作为结果,或两者
在Perl,但它不是。 (...)



以下是这样的接口。对于所有这些接口,Perl
目前(从v5.16.0开始)简单地假定字节串为
参数和结果。 (...)



在这些情况下,Perl不尝试解决Unicode
的角色的一个原因是答案高度依赖于
操作系统和文件系统。例如,
文件名是否可以是Unicode,并且正是什么样的编码,是
不是一个便携式的概念。 (...)




  • chdir,chmod,chown,chroot,exec,link,lstat,mkdir,rename,rmdir, symlink,truncate,unlink,utime,-X

  • %ENV

  • glob(又名< *>)
  • $ b $系统
  • readdir,readlink


所以 readdir 返回字节字符串,因为通常不可能知道文件名的编码先验。有关为什么这是不可能的背景信息,请参阅:





字符串比较:



现在,最后你尝试比较读取的文件名 $ filename_read 与变量 filename

  print found\\\
if $ filename_read eq filename;

在这种情况下, $ filename_read filename
$ filename_read 没有设置UTF-8标志(它不是什么Perl内部识别为Unicode字符串)。



现在有趣的是, eq 运算符的结果将取决于<$ c中的字节$ c> $ filename_read 是纯ASCII或不是。根据编码模块的文档:



<在$ Perl中引入Unicode支持之前,code $ eq $ / code $ operator
只是比较了两个标量代表的字符串。从
Perl 5.8开始, eq 比较两个字符串,同时考虑
的UTF8标志。



...



解码时,所产生的UTF8标志已打开 - 除非您可以明确地表示数据。


所以在你的情况下, eq 将考虑 UTF-8 标志,因为 $ file_name_read 不包含纯ASCII,因此将
考虑两个字符串相等。如果 $ filename_read filename 其中相同,只包含纯ASCII字节(和 filename 仍然设置了UTF-8标志, $ filename_read 没有设置UTF-8标志),则 eq 会将两个字符串视为相等。请参阅编码有关此行为背景的更多信息的文档中的讨论。



结论:



所以如果你相对信心所有的文件名都是UTF-8编码的,您可以通过将从 readdir 返回的字节串解码为Unicode字符串(强制设置UTF-8标志)来解决问题的解决方案:

  $ filename_read = Encode :: decode_utf8($ filename_read); 

更多详情



注意:由于Unicode允许相同字符的多个表示,所以在ä(组合DIAERESIS的小提琴A) >Bärlauch。例如,




  • U + 00E4是NFC(规范化表格规范组合)表单,

  • U + 0061.0308是NFD(Normalization Form规范分解)形式。在我的平台(Linux)上,UTF-8编码的文件名使用NFC格式存储,但在Mac OS上,它们使用NFD格式。请参阅 编码:: UTF8Mac 更多信息。这意味着如果您在Linux机器上工作,例如克隆由Mac用户创建的Git存储库,则可以在Linux计算机上轻松获取NFD编码的文件名。所以Linux文件系统不关心文件名的编码;它只是认为它是一个字节序列。因此,即使我的区域设置是en_US.UTF-8,我可以轻松地编写一个创建ISO-Latin-1编码文件名的脚本。当前的区域设置只是应用程序的指导原则,但是如果应用程序忽略了区域设置,那么它就不会阻止他们这样做。



    所以如果你不确定文件名从 readdir 返回使用NFC或NFD,您应该在解码后再分解:

     使用Unicode :: Normalize; 
    printfound\\\
    if NFD($ filename_read)eq NFD(filename);

    另请参见 Perl Unicode Cookbook 始终分解和重新构建。



    最后,要了解更多关于Locale如何与Perl一起使用Unicode的内容,可以看一下:




    Here's a Perl script that I have expected to print found when executed:

    #!/usr/bin/perl
    use warnings;
    use strict;
    use utf8;
    use Encode;
    
    use constant filename => 'Bärlauch';
    
    open (my $out, '>', filename) or die;
    close $out;
    
    opendir(my $dir, '.') or die;
    while (my $filename_read = readdir($dir)) {
    # $filename_read = encode('utf8', $filename_read);
      print "found\n" if $filename_read eq filename;
    }
    

    The script first creates a file with the name of the constant filename. (After running the script, I can verify the existence of the file with ls and the file is not created with "funny" characters.)

    Then the script iterates over the files in the the current working directory and prints found if there is a file whose name is equal to the file just created. This should obviously be the case.

    However, it doesn't (Ubuntu, bash, LANG=en_US.UTF8)

    If I change the constant to Barlauch, it works as expected and prints found.

    Uncommenting $filename_read = encode('utf8', $filename_read); does not change the behavior.

    Is there an explanation for this and what do I do have to do in order to recognize a filename with Umlaute in it?

    解决方案

    The question rephrased (as I interpret it) is:

    Why doesn't readdir return back the newly created filename? (Here, represented by the variable filename which is set to Bärlauch).

    (Note: filename is a Perl constant variable, so that's why it's missing the $ sigil in front.)

    Background:

    First note: due to the use utf8 statement in the beginning of your program, filename will be upgraded to a Unicode string at compile time, since it contain non-ASCII characters. From the documentation of the utf8 pragma:

    Enabling the utf8 pragma has the following effect: Bytes in the source text that are not in the ASCII character set will be treated as being part of a literal UTF-8 sequence. This includes most literals such as identifier names, string constants, and constant regular expression patterns.

    and also, according to perluniintro section "Perl's Unicode Model" :

    The general principle is that Perl tries to keep its data as eight-bit bytes for as long as possible, but as soon as Unicodeness cannot be avoided, the data is transparently upgraded to Unicode.

    ...

    Internally, Perl currently uses whatever the native eight-bit character set of the platform (for example Latin-1) is, defaulting to UTF-8, to encode Unicode strings.

    The non-ASCII character in filename is the letter ä. If you use ISO 8859-1 extended ASCII encoding (Latin-1), it is encoded as the byte value 0xE4, see this table at ascii-code.com. However, if you removed the ä character from filename, it would contain only ASCII characters, and therefore it would not be internally upgraded to Unicode, even if you used the utf8 pragma.

    So filename is now a Unicode string with the internal UTF-8 flag set ( see utf8 pragma for more information on the UTF-8 flag). Note that the letter ä is encoded in UTF-8 as the two bytes 0xC3 0xA4.

    Writing the file:

    When writing the file, what happens with the filename? If filename is a Unicode string, it will be encoded as UTF-8. However, note that it is not necessary to encode filename first (encode_utf8( filename )). See Creating filenames with unicode characters for more information. So the filename is written to disk as UTF-8 encoded bytes.

    Reading the filename back:

    When trying to read the filename back from disk, readdir does not return Unicode strings (strings with the UTF-8 flag set) even if the filename contains bytes encoded in UTF-8. It returns binary or byte strings, see perlunitut for a discussion of byte strings vs character (Unicode) strings.

    Why doesn't readdir return Unicode strings? First, according to perlunicode section "When Unicode Does Not Happen" :

    There are still many places where Unicode (in some encoding or another) could be given as arguments or received as results, or both in Perl, but it is not. (...)

    The following are such interfaces. For all of these interfaces Perl currently (as of v5.16.0) simply assumes byte strings both as arguments and results. (...)

    One reason that Perl does not attempt to resolve the role of Unicode in these situations is that the answers are highly dependent on the operating system and the file system(s). For example, whether filenames can be in Unicode and in exactly what kind of encoding, is not exactly a portable concept. (...)

    • chdir, chmod, chown, chroot, exec, link, lstat, mkdir, rename, rmdir, - stat, symlink, truncate, unlink, utime, -X
    • %ENV
    • glob (aka the <*>)
    • open, opendir, sysopen
    • qx (aka the backtick operator), system
    • readdir, readlink

    So readdir returns byte strings, since it is in general impossible to know the encoding of a file name a priori. For background information about why this is impossible, see for example:

    String comparison:

    Now, finally you try to compare the read filename $filename_read with the variable filename:

    print "found\n" if $filename_read eq filename;
    

    In this case the only difference between $filename_read and filename is that $filename_read does not have the UTF-8 flag set (it is not what Perl internally recognize as a "Unicode string").

    The interesting thing now is that the result of the eq operator will depend upon whether the bytes in $filename_read is pure ASCII or not. According to the documentation of the Encode module:

    Before the introduction of Unicode support in Perl, The eq operator just compared the strings represented by two scalars. Beginning with Perl 5.8, eq compares two strings with simultaneous consideration of the UTF8 flag.

    ...

    When you decode, the resulting UTF8 flag is on--unless you can unambiguously represent data.

    So in your case, eq will consider the UTF-8 flag since $file_name_read does not contain pure ASCII, and as a result it will consider the two string not equal. If $filename_read and filename where identical and did only contain pure ASCII bytes (and filename still had the UTF-8 flag set, $filename_read did not have the UTF-8 flag set), then eq would consider the two strings as equal. Se the discussion in the documentation for Encode more information regarding the background for this behavior.

    Conclusion:

    So if you are relative confident that all your filenames are UTF-8 encoded, you could solve the issue in your question by decoding the byte string returned from readdir into a Unicode string (forcing the UTF-8 flag to be set):

    $filename_read = Encode::decode_utf8( $filename_read );
    

    More details

    Note: since Unicode allows multiple representations of the same characters, there exists two forms of the ä (LATIN SMALL LETTER A WITH COMBINING DIAERESIS) in Bärlauch. For example,

    • U+00E4 is the NFC (Normalization Form canonical Composition) form,
    • U+0061.0308 is the NFD (Normalization Form canonical Decomposition) form.

    On my platform (Linux), UTF-8 encoded filenames are stored using NFC form, but on Mac OS they use NFD form. See Encode::UTF8Mac for more information. This means that if you work on a Linux machine, and for example clone a Git repository that was created by a Mac user, you can easily get NFD encoded filenames on your Linux machine. So the Linux filesystem does not care what encoding a filename is in; it just thinks of it as a sequence of bytes. Hence, I could easily write a script that created an ISO-Latin-1 encoded filename, even though my Locale is "en_US.UTF-8". The current locale settings are just guidelines for applications, but if the application ignores the locale settings it is nothing that stops them from doing that.

    So if you are unsure if filenames returned from readdir are using NFC or NFD, you should always decompose after you have decoded them:

    use Unicode::Normalize;
    print "found\n" if NFD( $filename_read ) eq NFD( filename );
    

    See also Perl Unicode Cookbook section "Always Decompose and Recompose".

    Finally, to understand more about how the Locale works together with Unicode in Perl, you could have a look at:

    这篇关于readdir中的什么编码返回一个文件名?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆