关于路径名编码的问题 [英] Question about pathname encoding

查看:154
本文介绍了关于路径名编码的问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在文件管理器(Dolphin)中获得了这样一个奇怪的编码,我做了什么呢?

路径名看起来不错。

 #!/ usr / local / bin / perl 
使用警告;
使用5.014;
使用utf8;
使用open qw(:encoding(UTF-8):std);
使用File :: Find;
使用Devel :: Peek;
使用Encode qw(decode);

我的$ string;
find(sub {$ string = $ File :: Find :: name},'Delibes,Léo');
$ string =〜s | Delibes,\ ||;
$ string =〜s | \ .. * \z ||;
我的($ s1,$ s2)= split m | / |,$ string,2;

说转储$ s1;
说转储$ s2;

#SV = PV(0x824b50)at 0x9346d8
#REFCNT = 1
#FLAGS =(PADMY,POK,pPOK,UTF8)
#PV = 0x93da30 L \303\251o\0 [UTF8L \x {e9} o]
#CUR = 4
#LEN = 16

#SV = PV(0x7a7150)at 0x934c30
#REFCNT = 1
#FLAGS =(PADMY,POK,pPOK,UTF8)
#PV = 0x7781e0Lakm\303\203\302 \251\0 [UTF8Lakm\x {c3} \x {a9}]
#CUR = 8
#LEN = 16

说$ S1;
说$ s2;

#Léo
#Lakmé

$ s1 = decode('utf-8',$ s1);
$ s2 = decode('utf-8',$ s2);

说$ s1;
说$ s2;

#L o
#Lakmé


解决方案

不幸的是,您的操作系统的路径名API是另一个二进制接口,您必须使用 Encode :: encode 编码:: decode 以获得可预测的结果。



大多数操作系统将路径名视为八位字节序列(即字节)。该序列是否应被解释为拉丁文1,UTF-8或其他字符编码是应用程序决定。因此, readdir()返回的值只是一个八位字节序列, File :: Find 不知道您希望将路径名称作为Unicode代码点。通过将目录路径(您提供的)与您的操作系统返回的值通过 readdir(\\ )/ / code>,这就是你的代码点与八位字节混合在一起。



经验法则:每当将路径名传递给操作系统时, Encode :: encode()以确保它是一个八位字节序列。当从操作系统获取路径名称时, Encode :: decode()将其应用于您的应用程序所需的字符集。



您可以通过以下方式调用 find 使您的程序工作:

  find(sub {...},Encode :: encode('utf8','Delibes,Léo')); 

然后调用 Encode :: decode()当使用 $ File :: Find :: name 的值时:

  my $ path = Encode :: decode('utf8',$ File :: Find :: name); 

为了更清楚,这是如何 $ File :: Find ::名称已经形成:

  use Encode; 

#这是一种让$ dir表示为UTF-8字符串的方式

my $ dir ='L'.chr(233)'o' .CHR(256);
chop $ dir;

说dir:,d($ dir); #length = 3

#这是readdir()返回的:

我的$ leaf = encode('utf8','Lakem'。chr(233));

说leaf:,d($ leaf); #length = 7

$ File :: Find :: name = $ dir。 '/'。 $叶;

说File :: Find :: name:,d($ File :: Find :: name);

sub d {
join('',map {sprintf(%02X,ord($ _))} split('',$ _ [0]))
}


What have I done to get such a strange encoding in this path-name?
In my file manager (Dolphin) the path-name looks good.

#!/usr/local/bin/perl
use warnings;
use 5.014;
use utf8;
use open qw( :encoding(UTF-8) :std );
use File::Find;
use Devel::Peek;
use Encode qw(decode);

my $string;
find( sub { $string = $File::Find::name }, 'Delibes, Léo' );
$string =~ s|Delibes,\ ||;
$string =~ s|\..*\z||;
my ( $s1, $s2 ) = split m|/|, $string, 2;

say Dump $s1;
say Dump $s2;

# SV = PV(0x824b50) at 0x9346d8
#   REFCNT = 1
#   FLAGS = (PADMY,POK,pPOK,UTF8)
#   PV = 0x93da30 "L\303\251o"\0 [UTF8 "L\x{e9}o"]
#   CUR = 4
#   LEN = 16

# SV = PV(0x7a7150) at 0x934c30
#   REFCNT = 1
#   FLAGS = (PADMY,POK,pPOK,UTF8)
#   PV = 0x7781e0 "Lakm\303\203\302\251"\0 [UTF8 "Lakm\x{c3}\x{a9}"]
#   CUR = 8
#   LEN = 16

say $s1;
say $s2;

# Léo
# Lakmé

$s1 = decode( 'utf-8', $s1 );
$s2 = decode( 'utf-8', $s2 );

say $s1;
say $s2;

# L�o
# Lakmé

解决方案

Unfortunately your operating system's pathname API is another "binary interface" where you will have to use Encode::encode and Encode::decode to get predictable results.

Most operating systems treat pathnames as a sequence of octets (i.e. bytes). Whether that sequence should be interpreted as latin-1, UTF-8 or other character encoding is an application decision. Consequently the value returned by readdir() is simply a sequence of octets, and File::Find doesn't know that you want the path name as Unicode code points. It forms $File::Find::name by simply concatenating the directory path (which you supplied) with the value returned by your OS via readdir(), and that's how you got code points mashed with octets.

Rule of thumb: Whenever passing path names to the OS, Encode::encode() it to make sure it is a sequence of octets. When getting a path name from the OS, Encode::decode() it to the character set that your application wants it in.

You can make your program work by calling find this way:

find( sub { ... }, Encode::encode('utf8', 'Delibes, Léo') );

And then calling Encode::decode() when using the value of $File::Find::name:

my $path = Encode::decode('utf8', $File::Find::name);

To be more clear, this is how $File::Find::name was formed:

use Encode;

# This is a way to get $dir to be represented as a UTF-8 string

my $dir = 'L' .chr(233).'o'.chr(256);
chop $dir;

say "dir: ", d($dir); # length = 3

# This is what readdir() is returning:

my $leaf = encode('utf8', 'Lakem' . chr(233));

say "leaf: ", d($leaf); # length = 7

$File::Find::name = $dir . '/' . $leaf;

say "File::Find::name: ", d($File::Find::name);

sub d {
  join(' ', map { sprintf("%02X", ord($_)) } split('', $_[0]))
}

这篇关于关于路径名编码的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆