使用带有"utf8"的Encode :: encode [英] Using Encode::encode with "utf8"

查看:116
本文介绍了使用带有"utf8"的Encode :: encode的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如您所知,在Perl中,"utf8"表示Perl对UTF-8的理解较松散,这允许从技术上讲在UTF-8中不是有效代码点的字符.相比之下,"UTF-8"(或"utf-8")是Perl对UTF-8的更严格的理解,不允许有无效的代码点.

So as you probably know, in Perl "utf8" means Perl's looser understanding of UTF-8 which allows characters that technically aren't valid code points in UTF-8. By contrast "UTF-8" (or "utf-8") is Perl's stricter understanding of UTF-8 which doesn't allow invalid code points.

我有一些与此区别有关的用法问题:

I have a few usage questions related to this distinction:

    默认情况下,
  1. Encode :: encode将用替换字符替换无效字符.即使您传递宽松的"utf8"作为编码,这是真的吗?

  1. Encode::encode by default will replace invalid characters with a substitution character. Is that true even if you are passing the looser "utf8" as the encoding?

使用"UTF-8"读取和写入open的文件时会发生什么?字符替换会发生在不良字符上还是其他事情发生?

What happens when you read and write files which were open'd using "UTF-8"? Does character substitution happen to bad characters or does something else happen?

open与像'>:utf8'这样的层和像'>:encoding(utf8)'这样的层一起使用有什么区别?两种方法都可以与'utf8'和'UTF-8'一起使用吗?

What is the difference between using open with a layer like '>:utf8' and a layer like '>:encoding(utf8)' ? Can both approaches be used with both 'utf8' and 'UTF-8'?

推荐答案

                   ╔════════════════════════════════════════════╤══════════════════════╗
                   ║                                            │                      ║
                   ║                  On Read                   │       On Write       ║
                   ║                                            │                      ║
        Perl       ╟─────────────────────┬──────────────────────┼──────────────────────╢
        5.26       ║                     │                      │                      ║
                   ║ Invalid encoding    │ Outside of Unicode,  │ Outside of Unicode,  ║
                   ║ other than sequence │ Unicode nonchar, or  │ Unicode nonchar, or  ║
                   ║ length              │ Unicode surrogate    │ Unicode surrogate    ║
                   ║                     │                      │                      ║
╔══════════════════╬═════════════════════╪══════════════════════╪══════════════════════╣
║                  ║                     │                      │                      ║
║ :encoding(UTF-8) ║ Warns and Replaces  │ Warns and Replaces   │ Warns and Replaces   ║
║                  ║                     │                      │                      ║
╟──────────────────╫─────────────────────┼──────────────────────┼──────────────────────╢
║                  ║                     │                      │                      ║
║ :encoding(utf8)  ║ Warns and Replaces  │ Accepts              │ Warns and Outputs    ║
║                  ║                     │                      │                      ║
╟──────────────────╫─────────────────────┼──────────────────────┼──────────────────────╢
║                  ║                     │                      │                      ║
║ :utf8            ║ Corrupt scalar      │ Accepts              │ Warns and Outputs    ║
║                  ║                     │                      │                      ║
╚══════════════════╩═════════════════════╧══════════════════════╧══════════════════════╝

如果在查看上表时遇到问题,请单击此处

请注意,:encoding(UTF-8)实际上使用utf8进行解码,然后检查结果字符是否在可接受的范围内.这样可以减少由于输入错误而导致的错误消息数量,因此很好.

Note that :encoding(UTF-8) actually decodes using utf8, then checks if the resulting character is in the acceptable range. This reduces the number of error messages for bad input, so it's good.

(编码名称不区分大小写.)

(Encoding names are case-insensitive.)

用于生成上表的测试:

  • :encoding(UTF-8)

  • :encoding(UTF-8)

$ printf "\xC3\xA9\n\xEF\xBF\xBF\n\xED\xA0\x80\n\xF8\x88\x80\x80\x80\n\x80\n" |
   perl -MB -nle'
      use open ":std", ":encoding(UTF-8)";
      my $sv = B::svref_2object(\$_);
      printf "%vX%s (internal: %vX, UTF8=%d)\n", $_, length($_)==1 ? "" : " = $_", $sv->PVX, utf8::is_utf8($_);
   '
utf8 "\xFFFF" does not map to Unicode.
utf8 "\xD800" does not map to Unicode.
utf8 "\x200000" does not map to Unicode.
utf8 "\x80" does not map to Unicode.
E9 (internal: C3.A9, UTF8=1)
5C.78.7B.46.46.46.46.7D = \x{FFFF} (internal: 5C.78.7B.46.46.46.46.7D, UTF8=1)
5C.78.7B.44.38.30.30.7D = \x{D800} (internal: 5C.78.7B.44.38.30.30.7D, UTF8=1)
5C.78.7B.32.30.30.30.30.30.7D = \x{200000} (internal: 5C.78.7B.32.30.30.30.30.30.7D, UTF8=1)
5C.78.38.30 = \x80 (internal: 5C.78.38.30, UTF8=1)

  • :encoding(utf8)

  • :encoding(utf8)

    $ printf "\xC3\xA9\n\xEF\xBF\xBF\n\xED\xA0\x80\n\xF8\x88\x80\x80\x80\n\x80\n" |
       perl -MB -nle'
          use open ":std", ":encoding(utf8)";
          my $sv = B::svref_2object(\$_);
          printf "%vX%s (internal: %vX, UTF8=%d)\n", $_, length($_)==1 ? "" : " = $_", $sv->PVX, utf8::is_utf8($_);
       '
    utf8 "\x80" does not map to Unicode.
    E9 (internal: C3.A9, UTF8=1)
    FFFF (internal: EF.BF.BF, UTF8=1)
    D800 (internal: ED.A0.80, UTF8=1)
    200000 (internal: F8.88.80.80.80, UTF8=1)
    5C.78.38.30 = \x80 (internal: 5C.78.38.30, UTF8=1)
    

  • :utf8

  • :utf8

    $ printf "\xC3\xA9\n\xEF\xBF\xBF\n\xED\xA0\x80\n\xF8\x88\x80\x80\x80\n\x80\n" |
       perl -MB -nle'
          use open ":std", ":utf8";
          my $sv = B::svref_2object(\$_);
          printf "%vX%s (internal: %vX, UTF8=%d)\n", $_, length($_)==1 ? "" : " = $_", $sv->PVX, utf8::is_utf8($_);
       '
    E9 (internal: C3.A9, UTF8=1)
    FFFF (internal: EF.BF.BF, UTF8=1)
    D800 (internal: ED.A0.80, UTF8=1)
    200000 (internal: F8.88.80.80.80, UTF8=1)
    Malformed UTF-8 character: \x80 (unexpected continuation byte 0x80, with no preceding start byte) in printf at -e line 4, <> line 5.
    0 (internal: 80, UTF8=1)
    

    • :encoding(UTF-8)

    • :encoding(UTF-8)

    $ perl -e'
       use open ":std", ":encoding(UTF-8)";
       print "\x{E9}\n";
       print "\x{FFFF}\n";
       print "\x{D800}\n";
       print "\x{20_0000}\n";
    ' >a
    Unicode non-character U+FFFF is not recommended for open interchange in print at -e line 4.
    Unicode surrogate U+D800 is illegal in UTF-8 at -e line 5.
    Code point 0x200000 is not Unicode, may not be portable in print at -e line 6.
    "\x{ffff}" does not map to utf8.
    "\x{d800}" does not map to utf8.
    "\x{200000}" does not map to utf8.
    
    $ od -t c a
    0000000 303 251  \n   \   x   {   F   F   F   F   }  \n   \   x   {   D
    0000020   8   0   0   }  \n   \   x   {   2   0   0   0   0   0   }  \n
    0000040
    
    $ cat a
    é
    \x{FFFF}
    \x{D800}
    \x{200000}
    

  • :encoding(utf8)

  • :encoding(utf8)

    $ perl -e'
       use open ":std", ":encoding(utf8)";
       print "\x{E9}\n";
       print "\x{FFFF}\n";
       print "\x{D800}\n";
       print "\x{20_0000}\n";
    ' >a
    Unicode surrogate U+D800 is illegal in UTF-8 at -e line 4.
    Code point 0x200000 is not Unicode, may not be portable in print at -e line 5.
    
    $ od -t c a
    0000000 303 251  \n 355 240 200  \n 370 210 200 200 200  \n
    0000015
    
    $ cat a
    é
    ▒
    ▒
    

  • :utf8

  • :utf8

    结果与:encoding(utf8)相同.

    使用Perl 5.26测试.

    Tested using Perl 5.26.

    Encode :: encode默认情况下会将无效字符替换为替换字符.即使您将宽松的"utf8"作为编码传递,这是真的吗?

    Perl字符串是32位或64位字符的字符串,具体取决于内部版本. utf8可以编码任何72位整数.因此,它能够对所有可能要求编码的字符进行编码.

    Perl strings are strings of 32-bit or 64-bit characters depending on the build. utf8 can encode any 72-bit integer. It is therefore capable of encoding all characters it can be asked to encode.

    这篇关于使用带有"utf8"的Encode :: encode的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆