BMP 之外的 JavaScript 字符串 [英] JavaScript strings outside of the BMP

查看:22
本文介绍了BMP 之外的 JavaScript 字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

BMP 是 基本多语言平面

根据 JavaScript:好的部分:

JavaScript 是在 Unicode 是 16 位字符集的时候构建的,因此 JavaScript 中的所有字符都是 16 位宽.

JavaScript was built at a time when Unicode was a 16-bit character set, so all characters in JavaScript are 16 bits wide.

这让我相信 JavaScript 使用 UCS-2(而不是 UTF-16!)并且最多只能处理 U+FFFF 的字符.

This leads me to believe that JavaScript uses UCS-2 (not UTF-16!) and can only handle characters up to U+FFFF.

进一步调查证实了这一点:

Further investigation confirms this:

> String.fromCharCode(0x20001);

fromCharCode 方法在返回 Unicode 字符时似乎只使用最低 16 位.尝试获取 U+20001(CJK 统一表意文字 20001)反而返回 U+0001.

The fromCharCode method seems to only use the lowest 16 bits when returning the Unicode character. Trying to get U+20001 (CJK unified ideograph 20001) instead returns U+0001.

问题:是否有可能在 JavaScript 中处理 BMP 后的字符?

Question: is it at all possible to handle post-BMP characters in JavaScript?

2011-07-31:来自 Unicode Support Shootout 的第 12 张幻灯片:好,坏,&(大部分)丑陋很好地涵盖了与此相关的问题:

2011-07-31: slide twelve from Unicode Support Shootout: The Good, The Bad, & the (mostly) Ugly covers issues related to this quite well:

推荐答案

取决于您所说的支持"是什么意思.您当然可以使用代理将非 UCS-2 字符放入 JS 字符串中,如果可以,浏览器会显示它们.

Depends what you mean by ‘support’. You can certainly put non-UCS-2 characters in a JS string using surrogates, and browsers will display them if they can.

但是,JS 字符串中的每一项都是一个单独的 UTF-16 代码单元.没有处理完整字符的语言级别支持:所有标准字符串成员(lengthsplitslice 等)都处理代码单位而不是字符,所以会很高兴地拆分代理对或持有无效的代理序列.

But, each item in a JS string is a separate UTF-16 code unit. There is no language-level support for handling full characters: all the standard String members (length, split, slice etc) all deal with code units not characters, so will quite happily split surrogate pairs or hold invalid surrogate sequences.

如果您想要代理感知方法,恐怕您将不得不自己开始编写它们!例如:

If you want surrogate-aware methods, I'm afraid you're going to have to start writing them yourself! For example:

String.prototype.getCodePointLength= function() {
    return this.length-this.split(/[uD800-uDBFF][uDC00-uDFFF]/g).length+1;
};

String.fromCodePoint= function() {
    var chars= Array.prototype.slice.call(arguments);
    for (var i= chars.length; i-->0;) {
        var n = chars[i]-0x10000;
        if (n>=0)
            chars.splice(i, 1, 0xD800+(n>>10), 0xDC00+(n&0x3FF));
    }
    return String.fromCharCode.apply(null, chars);
};

这篇关于BMP 之外的 JavaScript 字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆