将UTF-8字节流转换为Unicode [英] Convert UTF-8 byte stream to Unicode
问题描述
如何轻松创建从UTF-8 bytestream到Unicode代码点阵列的映射?为了澄清,例如我有字节序列:
How can I easily create a mapping from a UTF-8 bytestream to a Unicode codepoint array? To clarify, if for example I have the byte sequence:
c3 a5 76 aa e2 82 ac
映射应该生成两个相同长度的数组;一个具有UTF-8字节序列,另一个具有相应的Unicode码点。然后,阵列可以并排打印,如下所示:
The mapping should produce two arrays of the same length; one with UTF-8 byte sequences, and the other with the corresponding Unicode codepoint. Then, the arrays could be printed side-by-side like:
UTF8 UNICODE
----------------------------------------
C3 A5 000000E5
76 00000076
AA 0000FFFD
E2 82 AC 000020AC
推荐答案
适用于流的解决方案:
use READ_SIZE => 64*1024;
my $buf = '';
while (1) {
my $rv = sysread($fh, $buf, READ_SIZE, length($buf));
die("Read error: $!\n") if !defined($rv);
last if !$rv;
while (length($buf)) {
if ($buf =~ s/
^
( [\x00-\x7F]
| [\xC2-\xDF] [\x80-\xBF]
| \xE0 [\xA0-\xBF] [\x80-\xBF]
| [\xE1-\xEF] [\x80-\xBF] [\x80-\xBF]
| \xF0 [\x90-\xBF] [\x80-\xBF] [\x80-\xBF]
| [\xF1-\xF7] [\x80-\xBF] [\x80-\xBF] [\x80-\xBF]
)
//x) {
# Something valid
my $utf8 = $1;
utf8::decode( my $ucp = $utf8 );
handle($utf8, $ucp);
}
elsif ($buf =~ s/
^
(?: [\xC2-\xDF]
| \xE0 [\xA0-\xBF]?
| [\xE1-\xEF] [\x80-\xBF]?
| \xF0 (?: [\x90-\xBF] [\x80-\xBF]? )?
| [\xF1-\xF7] (?: [\x80-\xBF] [\x80-\xBF]? )?
)
\z
//x) {
# Something possibly valid
last;
}
else {
# Something invalid
handle(substr($buf, 0, 1, ''), "\x{FFFD}");
}
}
while (length($buf)) {
handle(substr($buf, 0, 1, ''), "\x{FFFD}");
}
以上只返回U + FFFD为什么编码:: decode('UTF-8',$ bytes)
认为不合格。换句话说,当它遇到以下情况时,它只返回U + FFFD:
The above only returns U+FFFD for what Encode::decode('UTF-8', $bytes)
considered ill-formed. In other words, it only returns U+FFFD when it encounters on of the following:
- 一个意想不到的继续字节。
- 一个开始字节后面没有足够的连续字节。
- 超编码的第一个字节。
还需要后解码检查来返回U + FFFD为什么 Encode :: decode('UTF-8',$ bytes)
另有违法。
Post-decoding checks are still needed to return U+FFFD for what Encode::decode('UTF-8', $bytes)
considers otherwise illegal.
这篇关于将UTF-8字节流转换为Unicode的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!