使用 Encode::encode 和“utf8" [英] Using Encode::encode with "utf8"

查看:32
本文介绍了使用 Encode::encode 和“utf8"的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,您可能知道,在 Perl 中,utf8"意味着 Perl 对 UTF-8 的理解较为宽松,它允许字符在技术上不是 UTF-8 中的有效代码点.相比之下,UTF-8"(或utf-8")是 Perl 对 UTF-8 的更严格理解,不允许无效代码点.

So as you probably know, in Perl "utf8" means Perl's looser understanding of UTF-8 which allows characters that technically aren't valid code points in UTF-8. By contrast "UTF-8" (or "utf-8") is Perl's stricter understanding of UTF-8 which doesn't allow invalid code points.

我有几个与此区别相关的用法问题:

I have a few usage questions related to this distinction:

  1. Encode::encode 默认会用替换字符替换无效字符.即使您将更宽松的utf8"作为编码传递,也是如此吗?

  1. Encode::encode by default will replace invalid characters with a substitution character. Is that true even if you are passing the looser "utf8" as the encoding?

当您读写使用UTF-8"打开的文件时会发生什么?字符替换发生在坏字符上还是发生了其他事情?

What happens when you read and write files which were open'd using "UTF-8"? Does character substitution happen to bad characters or does something else happen?

open 与像 '>:utf8' 这样的层和像 '>:encoding(utf8)' 这样的层一起使用有什么区别?'utf8' 和 'UTF-8' 可以同时使用这两种方法吗?

What is the difference between using open with a layer like '>:utf8' and a layer like '>:encoding(utf8)' ? Can both approaches be used with both 'utf8' and 'UTF-8'?

推荐答案

                   ╔════════════════════════════════════════════╤══════════════════════╗
                   ║                                            │                      ║
                   ║                  On Read                   │       On Write       ║
                   ║                                            │                      ║
        Perl       ╟─────────────────────┬──────────────────────┼──────────────────────╢
        5.26       ║                     │                      │                      ║
                   ║ Invalid encoding    │ Outside of Unicode,  │ Outside of Unicode,  ║
                   ║ other than sequence │ Unicode nonchar, or  │ Unicode nonchar, or  ║
                   ║ length              │ Unicode surrogate    │ Unicode surrogate    ║
                   ║                     │                      │                      ║
╔══════════════════╬═════════════════════╪══════════════════════╪══════════════════════╣
║                  ║                     │                      │                      ║
║ :encoding(UTF-8) ║ Warns and Replaces  │ Warns and Replaces   │ Warns and Replaces   ║
║                  ║                     │                      │                      ║
╟──────────────────╫─────────────────────┼──────────────────────┼──────────────────────╢
║                  ║                     │                      │                      ║
║ :encoding(utf8)  ║ Warns and Replaces  │ Accepts              │ Warns and Outputs    ║
║                  ║                     │                      │                      ║
╟──────────────────╫─────────────────────┼──────────────────────┼──────────────────────╢
║                  ║                     │                      │                      ║
║ :utf8            ║ Corrupt scalar      │ Accepts              │ Warns and Outputs    ║
║                  ║                     │                      │                      ║
╚══════════════════╩═════════════════════╧══════════════════════╧══════════════════════╝

如果您无法查看上表,请单击此处

请注意,:encoding(UTF-8) 实际上使用 utf8 进行解码,然后检查结果字符是否在可接受的范围内.这减少了错误输入的错误消息数量,因此很好.

Note that :encoding(UTF-8) actually decodes using utf8, then checks if the resulting character is in the acceptable range. This reduces the number of error messages for bad input, so it's good.

(编码名称不区分大小写.)

(Encoding names are case-insensitive.)

用于生成上表的测试:

  • :encoding(UTF-8)

$ printf "xC3xA9
xEFxBFxBF
xEDxA0x80
xF8x88x80x80x80
x80
" |
   perl -MB -nle'
      use open ":std", ":encoding(UTF-8)";
      my $sv = B::svref_2object($_);
      printf "%vX%s (internal: %vX, UTF8=%d)
", $_, length($_)==1 ? "" : " = $_", $sv->PVX, utf8::is_utf8($_);
   '
utf8 "xFFFF" does not map to Unicode.
utf8 "xD800" does not map to Unicode.
utf8 "x200000" does not map to Unicode.
utf8 "x80" does not map to Unicode.
E9 (internal: C3.A9, UTF8=1)
5C.78.7B.46.46.46.46.7D = x{FFFF} (internal: 5C.78.7B.46.46.46.46.7D, UTF8=1)
5C.78.7B.44.38.30.30.7D = x{D800} (internal: 5C.78.7B.44.38.30.30.7D, UTF8=1)
5C.78.7B.32.30.30.30.30.30.7D = x{200000} (internal: 5C.78.7B.32.30.30.30.30.30.7D, UTF8=1)
5C.78.38.30 = x80 (internal: 5C.78.38.30, UTF8=1)

  • :encoding(utf8)

    $ printf "xC3xA9
    xEFxBFxBF
    xEDxA0x80
    xF8x88x80x80x80
    x80
    " |
       perl -MB -nle'
          use open ":std", ":encoding(utf8)";
          my $sv = B::svref_2object($_);
          printf "%vX%s (internal: %vX, UTF8=%d)
    ", $_, length($_)==1 ? "" : " = $_", $sv->PVX, utf8::is_utf8($_);
       '
    utf8 "x80" does not map to Unicode.
    E9 (internal: C3.A9, UTF8=1)
    FFFF (internal: EF.BF.BF, UTF8=1)
    D800 (internal: ED.A0.80, UTF8=1)
    200000 (internal: F8.88.80.80.80, UTF8=1)
    5C.78.38.30 = x80 (internal: 5C.78.38.30, UTF8=1)
    

  • :utf8

    $ printf "xC3xA9
    xEFxBFxBF
    xEDxA0x80
    xF8x88x80x80x80
    x80
    " |
       perl -MB -nle'
          use open ":std", ":utf8";
          my $sv = B::svref_2object($_);
          printf "%vX%s (internal: %vX, UTF8=%d)
    ", $_, length($_)==1 ? "" : " = $_", $sv->PVX, utf8::is_utf8($_);
       '
    E9 (internal: C3.A9, UTF8=1)
    FFFF (internal: EF.BF.BF, UTF8=1)
    D800 (internal: ED.A0.80, UTF8=1)
    200000 (internal: F8.88.80.80.80, UTF8=1)
    Malformed UTF-8 character: x80 (unexpected continuation byte 0x80, with no preceding start byte) in printf at -e line 4, <> line 5.
    0 (internal: 80, UTF8=1)
    

    • :encoding(UTF-8)

    $ perl -e'
       use open ":std", ":encoding(UTF-8)";
       print "x{E9}
    ";
       print "x{FFFF}
    ";
       print "x{D800}
    ";
       print "x{20_0000}
    ";
    ' >a
    Unicode non-character U+FFFF is not recommended for open interchange in print at -e line 4.
    Unicode surrogate U+D800 is illegal in UTF-8 at -e line 5.
    Code point 0x200000 is not Unicode, may not be portable in print at -e line 6.
    "x{ffff}" does not map to utf8.
    "x{d800}" does not map to utf8.
    "x{200000}" does not map to utf8.
    
    $ od -t c a
    0000000 303 251  
          x   {   F   F   F   F   }  
          x   {   D
    0000020   8   0   0   }  
          x   {   2   0   0   0   0   0   }  
    
    0000040
    
    $ cat a
    é
    x{FFFF}
    x{D800}
    x{200000}
    

  • :encoding(utf8)

    $ perl -e'
       use open ":std", ":encoding(utf8)";
       print "x{E9}
    ";
       print "x{FFFF}
    ";
       print "x{D800}
    ";
       print "x{20_0000}
    ";
    ' >a
    Unicode surrogate U+D800 is illegal in UTF-8 at -e line 4.
    Code point 0x200000 is not Unicode, may not be portable in print at -e line 5.
    
    $ od -t c a
    0000000 303 251  
     355 240 200  
     370 210 200 200 200  
    
    0000015
    
    $ cat a
    é
    ▒
    ▒
    

  • :utf8

    结果与 :encoding(utf8) 相同.

    使用 Perl 5.26 测试.

    Tested using Perl 5.26.

    Encode::encode 默认会用替换字符替换无效字符.即使您将更宽松的utf8"作为编码传递也是如此吗?

    Perl 字符串是 32 位或 64 位字符的字符串,具体取决于构建.utf8 可以编码任何 72 位整数.因此,它能够编码所有可以被要求编码的字符.

    Perl strings are strings of 32-bit or 64-bit characters depending on the build. utf8 can encode any 72-bit integer. It is therefore capable of encoding all characters it can be asked to encode.

    这篇关于使用 Encode::encode 和“utf8"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆