JavaScript 中的字符串长度(以字节为单位) [英] String length in bytes in JavaScript
问题描述
在我的 JavaScript 代码中,我需要以这种格式向服务器撰写消息:
In my JavaScript code I need to compose a message to server in this format:
<size in bytes>CRLF
<data>CRLF
示例:
3
foo
数据可能包含 unicode 字符.我需要将它们作为 UTF-8 发送.
The data may contain unicode characters. I need to send them as UTF-8.
我正在寻找最跨浏览器的方式来计算 JavaScript 中字符串的长度(以字节为单位).
I'm looking for the most cross-browser way to calculate the length of the string in bytes in JavaScript.
我试过用这个来组合我的有效载荷:
I've tried this to compose my payload:
return unescape(encodeURIComponent(str)).length + "
" + str + "
"
但它没有为我提供旧浏览器的准确结果(或者,这些浏览器中的字符串可能是 UTF-16?).
But it does not give me accurate results for the older browsers (or, maybe the strings in those browsers in UTF-16?).
有什么线索吗?
更新:
示例:以字节为单位的字符串长度 ЭЭХ!Naïve?
在 UTF-8 中是 15 个字节,但有些浏览器报告的是 23 个字节.
Example: length in bytes of the string ЭЭХ! Naïve?
in UTF-8 is 15 bytes, but some browsers report 23 bytes instead.
推荐答案
JavaScript 本身没有办法做到这一点.(参见 Riccardo Galli 对现代方法的回答.)
用于历史参考或 TextEncoder API 仍然不可用.
For historical reference or where TextEncoder APIs are still unavailable.
如果你知道字符编码,你可以自己计算.
If you know the character encoding, you can calculate it yourself though.
encodeURIComponent
假设 UTF-8 作为字符编码,所以如果你需要这种编码,你可以这样做,
encodeURIComponent
assumes UTF-8 as the character encoding, so if you need that encoding, you can do,
function lengthInUtf8Bytes(str) {
// Matches only the 10.. bytes that are non-initial characters in a multi-byte sequence.
var m = encodeURIComponent(str).match(/%[89ABab]/g);
return str.length + (m ? m.length : 0);
}
这应该可以工作,因为 UTF-8 编码多字节序列的方式.第一个编码字节总是以单个字节序列的高位 0 或第一个十六进制数字为 C、D、E 或 F 的字节开始.第二个和后续字节是前两位为 10 的字节. 这些是您要在 UTF-8 中计算的额外字节.
This should work because of the way UTF-8 encodes multi-byte sequences. The first encoded byte always starts with either a high bit of zero for a single byte sequence, or a byte whose first hex digit is C, D, E, or F. The second and subsequent bytes are the ones whose first two bits are 10. Those are the extra bytes you want to count in UTF-8.
维基百科中的表格更清楚
Bits Last code point Byte 1 Byte 2 Byte 3
7 U+007F 0xxxxxxx
11 U+07FF 110xxxxx 10xxxxxx
16 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
...
如果你需要了解页面编码,你可以使用这个技巧:
If instead you need to understand the page encoding, you can use this trick:
function lengthInPageEncoding(s) {
var a = document.createElement('A');
a.href = '#' + s;
var sEncoded = a.href;
sEncoded = sEncoded.substring(sEncoded.indexOf('#') + 1);
var m = sEncoded.match(/%[0-9a-f]{2}/g);
return sEncoded.length - (m ? m.length * 2 : 0);
}
这篇关于JavaScript 中的字符串长度(以字节为单位)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!