将UTF-8字节流转换为Unicode [英] Convert UTF-8 byte stream to Unicode

查看:199
本文介绍了将UTF-8字节流转换为Unicode的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何轻松创建从UTF-8 bytestream到Unicode代码点阵列的映射?为了澄清,例如我有字节序列:

How can I easily create a mapping from a UTF-8 bytestream to a Unicode codepoint array? To clarify, if for example I have the byte sequence:

c3 a5 76 aa e2 82 ac

映射应该生成两个相同长度的数组;一个具有UTF-8字节序列,另一个具有相应的Unicode码点。然后,阵列可以并排打印,如下所示:

The mapping should produce two arrays of the same length; one with UTF-8 byte sequences, and the other with the corresponding Unicode codepoint. Then, the arrays could be printed side-by-side like:

UTF8                UNICODE             
----------------------------------------
C3 A5               000000E5            
76                  00000076            
AA                  0000FFFD            
E2 82 AC            000020AC            


推荐答案

适用于流的解决方案:

use READ_SIZE => 64*1024;

my $buf = '';
while (1) {
   my $rv = sysread($fh, $buf, READ_SIZE, length($buf));
   die("Read error: $!\n") if !defined($rv);
   last if !$rv;

   while (length($buf)) {
      if ($buf =~ s/
         ^
         ( [\x00-\x7F]
         | [\xC2-\xDF] [\x80-\xBF]
         | \xE0        [\xA0-\xBF] [\x80-\xBF]
         | [\xE1-\xEF] [\x80-\xBF] [\x80-\xBF]
         | \xF0        [\x90-\xBF] [\x80-\xBF] [\x80-\xBF]
         | [\xF1-\xF7] [\x80-\xBF] [\x80-\xBF] [\x80-\xBF]
         )
      //x) {
         # Something valid
         my $utf8 = $1;
         utf8::decode( my $ucp = $utf8 );
         handle($utf8, $ucp);
      }

      elsif ($buf =~ s/
         ^
         (?: [\xC2-\xDF]
         |   \xE0            [\xA0-\xBF]?
         |   [\xE1-\xEF]     [\x80-\xBF]?
         |   \xF0        (?: [\x90-\xBF] [\x80-\xBF]? )?
         |   [\xF1-\xF7] (?: [\x80-\xBF] [\x80-\xBF]? )?
         )
         \z
      //x) {
         # Something possibly valid
         last;
      }

      else {
         # Something invalid
         handle(substr($buf, 0, 1, ''), "\x{FFFD}");
      }
}

while (length($buf)) {
   handle(substr($buf, 0, 1, ''), "\x{FFFD}");
}

以上只返回U + FFFD为什么编码:: decode('UTF-8',$ bytes)认为不合格。换句话说,当它遇到以下情况时,它只返回U + FFFD:

The above only returns U+FFFD for what Encode::decode('UTF-8', $bytes) considered ill-formed. In other words, it only returns U+FFFD when it encounters on of the following:


  • 一个意想不到的继续字节。

  • 一个开始字节后面没有足够的连续字节。

  • 超编码的第一个字节。

还需要后解码检查来返回U + FFFD为什么 Encode :: decode('UTF-8',$ bytes)另有违法。

Post-decoding checks are still needed to return U+FFFD for what Encode::decode('UTF-8', $bytes) considers otherwise illegal.

这篇关于将UTF-8字节流转换为Unicode的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆