perl Encode :: Guess有无提示-检测utf8 [英] perl Encode::Guess with and without hints - detecting utf8

查看:119
本文介绍了perl Encode :: Guess有无提示-检测utf8的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对Encode :: Guess感到困惑.假设这是我的perl代码:

I am confused about Encode::Guess. Suppose this is my perl code:

use strict; 
use warnings;
use 5.18.2;
use Encode;
use Encode::Guess qw/utf8 iso-8859-1/;
use open IO => ':encoding(UTF-8)', ':std';
my $str1 = "1 = educa\x{c3}\x{a7}\x{c3}\x{a3}o";
my $str2 =  "2 = educa\x{e7}\x{e3}o";

say "A: ".&fixEnc($str1);
say "B: ".&fixEnc($str1,'hint');
say "C: ".&fixEnc($str2);
say "D: ".&fixEnc($str2,'hint');
say "";

sub fixEnc() {
    my $data = $_[0];
    my $enc = "";
    if ($_[1]) {
        $enc = guess_encoding($data,qw/utf8 iso-8859-1/);
    } else {
        $enc = guess_encoding($data);
    };
    if (!ref($enc)) {
        return "ERROR: Can't guess: $enc for $data";
    } else {
        my $utf8 = decode($enc->name, $data);
        $utf8 = "encoding guess: ".$enc->name."; result: $utf8";
        return $utf8;
    };
};

它产生:

A1: ERROR: Can't guess: iso-8859-1 or utf8 for 1 = educação
B2: ERROR: Can't guess: utf8 or iso-8859-1 for 1 = educação
C1: encoding guess: iso-8859-1; result: 2 = educação
D1: encoding guess: iso-8859-1; result: 2 = educação

现在,如果我替换为使用Encode :: Guess qw/utf8 iso-8859-1/;"通过'使用Encode :: Guess;'我知道了

Now if I replace 'use Encode::Guess qw/utf8 iso-8859-1/;' by ' use Encode::Guess;' I get

A2: encoding guess: utf8; result: 1 = educação
B2: ERROR: Can't guess: iso-8859-1 or utf8 for 1 = educação
C2: ERROR: Can't guess: No appropriate encodings found! for 2 = educação
D2: encoding guess: iso-8859-1; result: 2 = educação

是什么导致差异?特别是当我提示utf8时,为什么没有猜到utf8?

What causes the difference? In particular, why is utf8 not guessed when I hint with utf8?

我在下面发布了答案.基本上,认识到Guess会采用字符编码,并且不会讲葡萄牙语! educação"(不是葡萄牙语)是上面字符串1的正确拉丁语1版本,Guess无法将其与UTF8版本educação区别开来(不同于葡萄牙语).

I have posted an answer below. Basically, the realisation is that Guess goes by character encodings and doesn't speak Portuguese! 'educação', while not Portuguese is the correct latin-1 version of string 1 above that Guess cannot distinguish from the UTF8 version educação (unlike a Portuguese speaker).

推荐答案

我认为这是怎么回事.使用use Encode::Guess qw/utf8 iso-8859-1/;时,提示"没有区别(很抱歉,不清楚!),所以我们只有

I think this is what's going on. With use Encode::Guess qw/utf8 iso-8859-1/; the 'hint' makes no difference (sorry for being unclear!), so we only have

A1/B1: ERROR: Can't guess: iso-8859-1 or utf8 for 1 = educação

和 C1/D1:编码猜测:iso-8859-1;结果:2 =educação

and C1/D1: encoding guess: iso-8859-1; result: 2 = educação

对于A1/B2,字符串可以是UTF8(educação),也可以是latin1(educaçÃo).第二个看起来不正确,但是Encode :: Guess无法分辨-Guess进行字符编码并且不会讲葡萄牙语!

For A1/B2, the string could be UTF8 (educação) or it could be latin1 (educação). The 2nd one looks incorrect, but Encode::Guess cannot tell - Guess goes by character encodings and doesn't speak Portuguese!

现在,如果我替换为使用Encode :: Guess qw/utf8 iso-8859-1/;"通过使用Encode :: Guess;"我知道了

Now if I replace 'use Encode::Guess qw/utf8 iso-8859-1/;' by 'use Encode::Guess;' I get

A2: encoding guess: utf8; result: 1 = educação

latin-1不再是一个选项(它不是默认选项的一部分),因此结果显示为utf8.

latin-1 is no longer an option (it's not part of the default), so the result comes out as utf8.

B2: ERROR: Can't guess: iso-8859-1 or utf8 for 1 = educação

在B2中,点击成功后,我们又回到了上述情况,而Guess无法决定.

In B2, with the hit, we're back in the above scenario, and Guess cannot decide.

对于C2:

C2: ERROR: Can't guess: No appropriate encodings found! for 2 = educação

这很有意义,因为latin-1不是默认值的一部分.最后进入D2

this makes sense, as latin-1 isn't part of the default. Finally in D2

D2: encoding guess: iso-8859-1; result: 2 = educação

提示

latin-1,因此可以检测到编码.

latin-1 is hinted, so the encoding is detected.

这篇关于perl Encode :: Guess有无提示-检测utf8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆