将UTF-8字节流转换为Unicode [英] Convert UTF-8 byte stream to Unicode

查看：199 发布时间：2017/8/16 23:23:05 perl unicode encoding utf-8

本文介绍了将UTF-8字节流转换为Unicode的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

如何轻松创建从UTF-8 bytestream到Unicode代码点阵列的映射？为了澄清，例如我有字节序列：

How can I easily create a mapping from a UTF-8 bytestream to a Unicode codepoint array? To clarify, if for example I have the byte sequence:

c3 a5 76 aa e2 82 ac

映射应该生成两个相同长度的数组;一个具有UTF-8字节序列，另一个具有相应的Unicode码点。然后，阵列可以并排打印，如下所示：

The mapping should produce two arrays of the same length; one with UTF-8 byte sequences, and the other with the corresponding Unicode codepoint. Then, the arrays could be printed side-by-side like:

UTF8                UNICODE             
----------------------------------------
C3 A5               000000E5            
76                  00000076            
AA                  0000FFFD            
E2 82 AC            000020AC

推荐答案

适用于流的解决方案：

use READ_SIZE => 64*1024;

my $buf = '';
while (1) {
   my $rv = sysread($fh, $buf, READ_SIZE, length($buf));
   die("Read error: $!\n") if !defined($rv);
   last if !$rv;

   while (length($buf)) {
      if ($buf =~ s/
         ^
         ( [\x00-\x7F]
         | [\xC2-\xDF] [\x80-\xBF]
         | \xE0        [\xA0-\xBF] [\x80-\xBF]
         | [\xE1-\xEF] [\x80-\xBF] [\x80-\xBF]
         | \xF0        [\x90-\xBF] [\x80-\xBF] [\x80-\xBF]
         | [\xF1-\xF7] [\x80-\xBF] [\x80-\xBF] [\x80-\xBF]
         )
      //x) {
         # Something valid
         my $utf8 = $1;
         utf8::decode( my $ucp = $utf8 );
         handle($utf8, $ucp);
      }

      elsif ($buf =~ s/
         ^
         (?: [\xC2-\xDF]
         |   \xE0            [\xA0-\xBF]?
         |   [\xE1-\xEF]     [\x80-\xBF]?
         |   \xF0        (?: [\x90-\xBF] [\x80-\xBF]? )?
         |   [\xF1-\xF7] (?: [\x80-\xBF] [\x80-\xBF]? )?
         )
         \z
      //x) {
         # Something possibly valid
         last;
      }

      else {
         # Something invalid
         handle(substr($buf, 0, 1, ''), "\x{FFFD}");
      }
}

while (length($buf)) {
   handle(substr($buf, 0, 1, ''), "\x{FFFD}");
}

以上只返回U + FFFD为什么编码:: decode（'UTF-8'，$ bytes）认为不合格。换句话说，当它遇到以下情况时，它只返回U + FFFD：

The above only returns U+FFFD for what Encode::decode('UTF-8', $bytes) considered ill-formed. In other words, it only returns U+FFFD when it encounters on of the following:

一个意想不到的继续字节。

一个开始字节后面没有足够的连续字节。

超编码的第一个字节。

还需要后解码检查来返回U + FFFD为什么 Encode :: decode（'UTF-8'，$ bytes）另有违法。

Post-decoding checks are still needed to return U+FFFD for what Encode::decode('UTF-8', $bytes) considers otherwise illegal.

这篇关于将UTF-8字节流转换为Unicode的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

将UTF-8字节流转换为Unicode [英] Convert UTF-8 byte stream to Unicode

问题描述

推荐答案

相关文章

开发方法最新文章

热门教程

热门工具

登录关闭

将UTF-8字节流转换为Unicode [英] Convert UTF-8 byte stream to Unicode

问题描述

推荐答案

相关文章

开发方法最新文章

热门教程

热门工具

登录 关闭

登录关闭