Perl:utf8 :: decode与Encode :: decode [英] Perl: utf8::decode vs. Encode::decode

查看:159
本文介绍了Perl:utf8 :: decode与Encode :: decode的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些有趣的结果试图辨别使用 Encode :: decode(utf8,$ var) utf8之间的差异: :解码($变种)。我已经发现,对变量调用前一次最终会导致错误无法解码字符串的宽字符在...,而后一种方法将愉快地运行多次,只需返回false。



我无法理解的是如何使用长度函数返回不同的结果,具体取决于您使用哪种方法解码。出现问题是因为我正在处理外部文件中的双重编码utf8文本。为了演示这个问题,我在一行上创建了一个文本文件test.txt,其中包含以下Unicode字符:U + 00e8,U + 00ab,U + 0086,U + 000a。这些Unicode字符是Unicode字符U + 8acb的双重编码,以及换行符。文件被编码为UTF8中的磁盘。然后我运行以下perl脚本:

 #!/ usr / bin / perl 
use strict;
使用警告;
需要Encode.pm;
需要utf8.pm;

打开FILE,test.txt或死$!
我的@lines =< FILE> ;;
我的$ test = $ lines [0];

printLength:。 (长度$ test)。 \\\
;
printutf8 flag:。 utf8 :: is_utf8($ test)。 \\\
;
我的@unicode =(unpack('U *',$ test));
打印Unicode:\\\
@ unicode\\\
;
my @hex =(unpack('H *',$ test));
printHex:\\\
@ hex\\\
;

打印============== \\

$ test = Encode :: decode(utf8,$ test);
printLength:。 (长度$ test)。 \\\
;
printutf8 flag:。 utf8 :: is_utf8($ test)。 \\\
;
@unicode =(unpack('U *',$ test));
打印Unicode:\\\
@ unicode\\\
;
@hex =(unpack('H *',$ test));
printHex:\\\
@ hex\\\
;

打印============== \\

$ test = Encode :: decode(utf8,$ test);
printLength:。 (长度$ test)。 \\\
;
printutf8 flag:。 utf8 :: is_utf8($ test)。 \\\
;
@unicode =(unpack('U *',$ test));
打印Unicode:\\\
@ unicode\\\
;
@hex =(unpack('H *',$ test));

打印Hex:\\\
@ hex\\\
;

这将提供以下输出:

长度:7 
utf8标志:
Unicode:
195 168 194 171 194 139 10
十六进制:
c3a8c2abc28b0a
===== =========
长度:4
utf8标志:1
Unicode:
232 171 139 10
十六进制:
c3a8c2abc28b0a
==============
长度:2
utf8标志:1
Unicode:
35531 10
十六进制:
e8ab8b0a

这是我期望的。长度原来是7,因为perl认为$ test只是一系列字节。在解码一次之后,perl知道$ test是utf8编码的一系列字符(即使不是返回7个字节的长度,perl返回长度为4个字符,即使$ test仍然是内存中的7个字节)。第二次解码后,$ test包含4个字节,解释为2个字符,这是我期望的,因为Encode :: decode需要4个代码点,并将其解释为utf8编码的字节,从而产生2个字符。奇怪的是当我修改代码调用utf8 :: decode(替换所有$ test = Encode :: decode(utf8,$ test);与utf8 :: decode($ test))



这给出几乎相同的输出,只有长度的结果不同:

 
长度:7
utf8标志:
Unicode:
195 168 194 171 194 139 10
十六进制:
c3a8c2abc28b0a
============= =
长度:4
utf8标志:1
Unicode:
232 171 139 10
十六进制:
c3a8c2abc28b0a
===== =========
长度:4
utf8标志:1
Unicode:
35531 10
十六进制:
e8ab8b0a

似乎perl首先对解码之前的字节进行计数(如预期),然后对第一次解码后的字符进行计数,然后再次对字节进行计数第二次解码(不预期)。为什么会发生这种转变?感谢您对我们的解读功能如何运作有所了解吗?



谢谢,*** Matt

解决方案

你不应该使用 utf8 pragma模块中的函数。 其文档说:



不要使用这个编译指示来告诉Perl你的脚本是用UTF-8编写的。 =http://p3rl.org/UNI =nofollow noreferrer>始终使用编码模块,并且还会看到问题 使用Perl进行Unicode方式的清单 unpack 的级别太低,甚至不会给您进行错误检查。



您会出错角色 E8 AB 86 0A 的假设是UTF-8 双重编码的结果,字符 newline 。这是这些字符的单一UTF-8编码的表示形式。也许你身边的整个混乱来自这个错误。



长度是不适当的超载,在某些时候它决定字符长度,或字节长度。使用更好的工具,如 Devel :: Peek

 #! usr / bin / env perl 
使用strict;
使用警告FATAL => '所有';
使用Devel :: Peek qw(Dump);
使用Encode qw(decode);

我的$ test =\x {00e8} \x {00ab} \x {0086} \x {000a};
#或从文件中读取没有隐式解码的八位字节,无关紧要

转储$ test;
#FLAGS =(PADMY,POK,pPOK)
#PV = 0x8d8520\350\253\206\\\
\0

$ test = decode('UTF-8',$ test,Encode :: FB_CROAK);
转储$ test;
#FLAGS =(PADMY,POK,pPOK,UTF8)
#PV = 0xc02850\350\253\206\\\
\0 [UTF8\x {8ac6} \\\
]


I am having some interesting results trying to discern the differences between using Encode::decode("utf8", $var) and utf8::decode($var). I've already discovered that calling the former multiple times on a variable will eventually result in an error "Cannot decode string with wide characters at..." whereas the latter method will happily run as many times as you want, simply returning false.

What I'm having trouble understanding is how the length function returns different results depending on which method you use to decode. The problem arises because I am dealing with "doubly encoded" utf8 text from an outside file. To demonstrate this issue, I created a text file "test.txt" with the following Unicode characters on one line: U+00e8, U+00ab, U+0086, U+000a. These Unicode characters are the double-encoding of the Unicode character U+8acb, along with a newline character. The file was encoded to disk in UTF8. I then run the following perl script:

#!/usr/bin/perl                                                                                                                                          
use strict;
use warnings;
require "Encode.pm";
require "utf8.pm";

open FILE, "test.txt" or die $!;
my @lines = <FILE>;
my $test =  $lines[0];

print "Length: " . (length $test) . "\n";
print "utf8 flag: " . utf8::is_utf8($test) . "\n";
my @unicode = (unpack('U*', $test));
print "Unicode:\n@unicode\n";
my @hex = (unpack('H*', $test));
print "Hex:\n@hex\n";

print "==============\n";

$test = Encode::decode("utf8", $test);
print "Length: " . (length $test) . "\n";
print "utf8 flag: " . utf8::is_utf8($test) . "\n";
@unicode = (unpack('U*', $test));
print "Unicode:\n@unicode\n";
@hex = (unpack('H*', $test));
print "Hex:\n@hex\n";

print "==============\n";

$test = Encode::decode("utf8", $test);
print "Length: " . (length $test) . "\n";
print "utf8 flag: " . utf8::is_utf8($test) . "\n";
@unicode = (unpack('U*', $test));
print "Unicode:\n@unicode\n";
@hex = (unpack('H*', $test));

print "Hex:\n@hex\n";

This gives the following output:

Length: 7
utf8 flag: 
Unicode:
195 168 194 171 194 139 10
Hex:
c3a8c2abc28b0a
==============
Length: 4
utf8 flag: 1
Unicode:
232 171 139 10
Hex:
c3a8c2abc28b0a
==============
Length: 2
utf8 flag: 1
Unicode:
35531 10
Hex:
e8ab8b0a

This is what I would expect. The length is originally 7 because perl thinks that $test is just a series of bytes. After decoding once, perl knows that $test is a series of characters that are utf8-encoded (i.e. instead of returning a length of 7 bytes, perl returns a length of 4 characters, even though $test is still 7 bytes in memory). After the second decoding, $test contains 4 bytes interpreted as 2 characters, which is what I would expect since Encode::decode took the 4 code points and interpreted them as utf8-encoded bytes, resulting in 2 characters. The strange thing is when I modify the code to call utf8::decode instead (replace all $test = Encode::decode("utf8", $test); with utf8::decode($test))

This gives almost identical output, only the result of length differs:

Length: 7
utf8 flag: 
Unicode:
195 168 194 171 194 139 10
Hex:
c3a8c2abc28b0a
==============
Length: 4
utf8 flag: 1
Unicode:
232 171 139 10
Hex:
c3a8c2abc28b0a
==============
Length: 4
utf8 flag: 1
Unicode:
35531 10
Hex:
e8ab8b0a

It seems like perl first counts the bytes before decoding (as expected), then counts the characters after the first decoding, but then counts the bytes again after the second decoding (not expected). Why would this switch happen? Is there a lapse in my understanding of how these decoding functions work?

Thanks,
Matt

解决方案

You are not supposed to use the functions from the utf8 pragma module. Its documentation says so:

Do not use this pragma for anything else than telling Perl that your script is written in UTF-8.

Always use the Encode module, and also see the question Checklist for going the Unicode way with Perl. unpack is too low-level, it does not even give you error-checking.

You are going wrong with the assumption that the octects E8 AB 86 0A are the result of UTF-8 double-encoding the characters and newline. This is the representation of a single UTF-8 encoding of these characters. Perhaps the whole confusion on your side stems from that mistake.

length is unappropriately overloaded, at certain times it determines the length in characters, or the length in octets. Use better tools such as Devel::Peek.

#!/usr/bin/env perl
use strict;
use warnings FATAL => 'all';
use Devel::Peek qw(Dump);
use Encode qw(decode);

my $test = "\x{00e8}\x{00ab}\x{0086}\x{000a}";
# or read the octets without implicit decoding from a file, does not matter

Dump $test;
#  FLAGS = (PADMY,POK,pPOK)
#  PV = 0x8d8520 "\350\253\206\n"\0

$test = decode('UTF-8', $test, Encode::FB_CROAK);
Dump $test;
#  FLAGS = (PADMY,POK,pPOK,UTF8)
#  PV = 0xc02850 "\350\253\206\n"\0 [UTF8 "\x{8ac6}\n"]

这篇关于Perl:utf8 :: decode与Encode :: decode的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆