JavaScript 中的字符串长度(以字节为单位) [英] String length in bytes in JavaScript

查看:28
本文介绍了JavaScript 中的字符串长度(以字节为单位)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的 JavaScript 代码中,我需要以这种格式向服务器撰写消息:

In my JavaScript code I need to compose a message to server in this format:

<size in bytes>CRLF
<data>CRLF

示例:

3
foo

数据可能包含 unicode 字符.我需要将它们作为 UTF-8 发送.

The data may contain unicode characters. I need to send them as UTF-8.

我正在寻找最跨浏览器的方式来计算 JavaScript 中字符串的长度(以字节为单位).

I'm looking for the most cross-browser way to calculate the length of the string in bytes in JavaScript.

我试过用这个来组合我的有效载荷:

I've tried this to compose my payload:

return unescape(encodeURIComponent(str)).length + "
" + str + "
"

但它没有为我提供旧浏览器的准确结果(或者,这些浏览器中的字符串可能是 UTF-16?).

But it does not give me accurate results for the older browsers (or, maybe the strings in those browsers in UTF-16?).

有什么线索吗?

更新:

示例:以字节为单位的字符串长度 ЭЭХ!Naïve? 在 UTF-8 中是 15 个字节,但有些浏览器报告的是 23 个字节.

Example: length in bytes of the string ЭЭХ! Naïve? in UTF-8 is 15 bytes, but some browsers report 23 bytes instead.

推荐答案

JavaScript 本身没有办法做到这一点.(参见 Riccardo Galli 对现代方法的回答.)

用于历史参考或 TextEncoder API 仍然不可用.

For historical reference or where TextEncoder APIs are still unavailable.

如果你知道字符编码,你可以自己计算.

If you know the character encoding, you can calculate it yourself though.

encodeURIComponent 假设 UTF-8 作为字符编码,所以如果你需要这种编码,你可以这样做,

encodeURIComponent assumes UTF-8 as the character encoding, so if you need that encoding, you can do,

function lengthInUtf8Bytes(str) {
  // Matches only the 10.. bytes that are non-initial characters in a multi-byte sequence.
  var m = encodeURIComponent(str).match(/%[89ABab]/g);
  return str.length + (m ? m.length : 0);
}

这应该可以工作,因为 UTF-8 编码多字节序列的方式.第一个编码字节总是以单个字节序列的高位 0 或第一个十六进制数字为 C、D、E 或 F 的字节开始.第二个和后续字节是前两位为 10 的字节. 这些是您要在 UTF-8 中计算的额外字节.

This should work because of the way UTF-8 encodes multi-byte sequences. The first encoded byte always starts with either a high bit of zero for a single byte sequence, or a byte whose first hex digit is C, D, E, or F. The second and subsequent bytes are the ones whose first two bits are 10. Those are the extra bytes you want to count in UTF-8.

维基百科中的表格更清楚

Bits        Last code point Byte 1          Byte 2          Byte 3
  7         U+007F          0xxxxxxx
 11         U+07FF          110xxxxx        10xxxxxx
 16         U+FFFF          1110xxxx        10xxxxxx        10xxxxxx
...

如果你需要了解页面编码,你可以使用这个技巧:

If instead you need to understand the page encoding, you can use this trick:

function lengthInPageEncoding(s) {
  var a = document.createElement('A');
  a.href = '#' + s;
  var sEncoded = a.href;
  sEncoded = sEncoded.substring(sEncoded.indexOf('#') + 1);
  var m = sEncoded.match(/%[0-9a-f]{2}/g);
  return sEncoded.length - (m ? m.length * 2 : 0);
}

这篇关于JavaScript 中的字符串长度(以字节为单位)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆