如何将UTF8字符串转换为字节数组? [英] How to convert UTF8 string to byte array?

查看:322
本文介绍了如何将UTF8字符串转换为字节数组?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

.charCodeAt 函数返回caracter的unicode代码。但我想得到字节数组。我知道,如果charcode超过127,那么该字符将存储在两个或更多字节中。

The .charCodeAt function returns with the unicode code of the caracter. But I would like to get the byte array instead. I know, if the charcode is over 127, then the character is stored in two or more bytes.

var arr=[];
for(var i=0; i<str.length; i++) {
    arr.push(str.charCodeAt(i))
}


推荐答案

UTF-8编码Unicode的逻辑基本上是:

The logic of encoding Unicode in UTF-8 is basically:


  • 每个字符最多可使用4个字节。使用最少的字节数。

  • U + 007F以下的字符用单个字节编码。

  • 对于多字节序列,数字在第一个字节中前导1位给出字符的字节数。第一个字节的其余位可用于编码字符的位。

  • 连续字节以10开头,其他6位编码字符的位。

  • Up to 4 bytes per character can be used. The fewest number of bytes possible is used.
  • Characters up to U+007F are encoded with a single byte.
  • For multibyte sequences, the number of leading 1 bits in the first byte gives the number of bytes for the character. The rest of the bits of the first byte can be used to encode bits of the character.
  • The continuation bytes begin with 10, and the other 6 bits encode bits of the character.

这是我用UTF-8编写JavaScript UTF-16字符串时编写的函数:

Here's a function I wrote a while back for encoding a JavaScript UTF-16 string in UTF-8:

function toUTF8Array(str) {
    var utf8 = [];
    for (var i=0; i < str.length; i++) {
        var charcode = str.charCodeAt(i);
        if (charcode < 0x80) utf8.push(charcode);
        else if (charcode < 0x800) {
            utf8.push(0xc0 | (charcode >> 6), 
                      0x80 | (charcode & 0x3f));
        }
        else if (charcode < 0xd800 || charcode >= 0xe000) {
            utf8.push(0xe0 | (charcode >> 12), 
                      0x80 | ((charcode>>6) & 0x3f), 
                      0x80 | (charcode & 0x3f));
        }
        // surrogate pair
        else {
            i++;
            // UTF-16 encodes 0x10000-0x10FFFF by
            // subtracting 0x10000 and splitting the
            // 20 bits of 0x0-0xFFFFF into two halves
            charcode = 0x10000 + (((charcode & 0x3ff)<<10)
                      | (str.charCodeAt(i) & 0x3ff));
            utf8.push(0xf0 | (charcode >>18), 
                      0x80 | ((charcode>>12) & 0x3f), 
                      0x80 | ((charcode>>6) & 0x3f), 
                      0x80 | (charcode & 0x3f));
        }
    }
    return utf8;
}

这篇关于如何将UTF8字符串转换为字节数组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆