关于路径名编码的问题 [英] Question about pathname encoding
问题描述
路径名看起来不错。
#!/ usr / local / bin / perl
使用警告;
使用5.014;
使用utf8;
使用open qw(:encoding(UTF-8):std);
使用File :: Find;
使用Devel :: Peek;
使用Encode qw(decode);
我的$ string;
find(sub {$ string = $ File :: Find :: name},'Delibes,Léo');
$ string =〜s | Delibes,\ ||;
$ string =〜s | \ .. * \z ||;
我的($ s1,$ s2)= split m | / |,$ string,2;
说转储$ s1;
说转储$ s2;
#SV = PV(0x824b50)at 0x9346d8
#REFCNT = 1
#FLAGS =(PADMY,POK,pPOK,UTF8)
#PV = 0x93da30 L \303\251o\0 [UTF8L \x {e9} o]
#CUR = 4
#LEN = 16
#SV = PV(0x7a7150)at 0x934c30
#REFCNT = 1
#FLAGS =(PADMY,POK,pPOK,UTF8)
#PV = 0x7781e0Lakm\303\203\302 \251\0 [UTF8Lakm\x {c3} \x {a9}]
#CUR = 8
#LEN = 16
说$ S1;
说$ s2;
#Léo
#Lakmé
$ s1 = decode('utf-8',$ s1);
$ s2 = decode('utf-8',$ s2);
说$ s1;
说$ s2;
#L o
#Lakmé
不幸的是,您的操作系统的路径名API是另一个二进制接口,您必须使用 Encode :: encode
和编码:: decode
以获得可预测的结果。
大多数操作系统将路径名视为八位字节序列(即字节)。该序列是否应被解释为拉丁文1,UTF-8或其他字符编码是应用程序决定。因此, readdir()
返回的值只是一个八位字节序列, File :: Find
不知道您希望将路径名称作为Unicode代码点。通过将目录路径(您提供的)与您的操作系统返回的值通过 readdir(\\ )/ / code>,这就是你的代码点与八位字节混合在一起。
经验法则:每当将路径名传递给操作系统时, Encode :: encode()
以确保它是一个八位字节序列。当从操作系统获取路径名称时, Encode :: decode()
将其应用于您的应用程序所需的字符集。
您可以通过以下方式调用 find
使您的程序工作:
find(sub {...},Encode :: encode('utf8','Delibes,Léo'));
然后调用 Encode :: decode()
当使用 $ File :: Find :: name
的值时:
my $ path = Encode :: decode('utf8',$ File :: Find :: name);
为了更清楚,这是如何 $ File :: Find ::名称
已经形成:
use Encode;
#这是一种让$ dir表示为UTF-8字符串的方式
my $ dir ='L'.chr(233)'o' .CHR(256);
chop $ dir;
说dir:,d($ dir); #length = 3
#这是readdir()返回的:
我的$ leaf = encode('utf8','Lakem'。chr(233));
说leaf:,d($ leaf); #length = 7
$ File :: Find :: name = $ dir。 '/'。 $叶;
说File :: Find :: name:,d($ File :: Find :: name);
sub d {
join('',map {sprintf(%02X,ord($ _))} split('',$ _ [0]))
}
What have I done to get such a strange encoding in this path-name?
In my file manager (Dolphin) the path-name looks good.
#!/usr/local/bin/perl
use warnings;
use 5.014;
use utf8;
use open qw( :encoding(UTF-8) :std );
use File::Find;
use Devel::Peek;
use Encode qw(decode);
my $string;
find( sub { $string = $File::Find::name }, 'Delibes, Léo' );
$string =~ s|Delibes,\ ||;
$string =~ s|\..*\z||;
my ( $s1, $s2 ) = split m|/|, $string, 2;
say Dump $s1;
say Dump $s2;
# SV = PV(0x824b50) at 0x9346d8
# REFCNT = 1
# FLAGS = (PADMY,POK,pPOK,UTF8)
# PV = 0x93da30 "L\303\251o"\0 [UTF8 "L\x{e9}o"]
# CUR = 4
# LEN = 16
# SV = PV(0x7a7150) at 0x934c30
# REFCNT = 1
# FLAGS = (PADMY,POK,pPOK,UTF8)
# PV = 0x7781e0 "Lakm\303\203\302\251"\0 [UTF8 "Lakm\x{c3}\x{a9}"]
# CUR = 8
# LEN = 16
say $s1;
say $s2;
# Léo
# Lakmé
$s1 = decode( 'utf-8', $s1 );
$s2 = decode( 'utf-8', $s2 );
say $s1;
say $s2;
# L�o
# Lakmé
Unfortunately your operating system's pathname API is another "binary interface" where you will have to use Encode::encode
and Encode::decode
to get predictable results.
Most operating systems treat pathnames as a sequence of octets (i.e. bytes). Whether that sequence should be interpreted as latin-1, UTF-8 or other character encoding is an application decision. Consequently the value returned by readdir()
is simply a sequence of octets, and File::Find
doesn't know that you want the path name as Unicode code points. It forms $File::Find::name
by simply concatenating the directory path (which you supplied) with the value returned by your OS via readdir()
, and that's how you got code points mashed with octets.
Rule of thumb: Whenever passing path names to the OS, Encode::encode()
it to make sure it is a sequence of octets. When getting a path name from the OS, Encode::decode()
it to the character set that your application wants it in.
You can make your program work by calling find
this way:
find( sub { ... }, Encode::encode('utf8', 'Delibes, Léo') );
And then calling Encode::decode()
when using the value of $File::Find::name
:
my $path = Encode::decode('utf8', $File::Find::name);
To be more clear, this is how $File::Find::name
was formed:
use Encode;
# This is a way to get $dir to be represented as a UTF-8 string
my $dir = 'L' .chr(233).'o'.chr(256);
chop $dir;
say "dir: ", d($dir); # length = 3
# This is what readdir() is returning:
my $leaf = encode('utf8', 'Lakem' . chr(233));
say "leaf: ", d($leaf); # length = 7
$File::Find::name = $dir . '/' . $leaf;
say "File::Find::name: ", d($File::Find::name);
sub d {
join(' ', map { sprintf("%02X", ord($_)) } split('', $_[0]))
}
这篇关于关于路径名编码的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!