以 UTF-8 字符串存储二进制数据 [英] Storing binary data in UTF-8 string

查看:27
本文介绍了以 UTF-8 字符串存储二进制数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用 WebSocket 传输二进制数据,但您只能使用 WebSockets 传输 UTF-8 字符串.

I want to use a WebSocket to transfer binary data, but you can only use WebSockets to transfer UTF-8 strings.

使用 base64 编码是一种选择,但我的理解是当您的文本可能从一种格式转换为另一种格式时,base64 是最理想的.在这种情况下,我知道数据将始终是 UTF-8,那么是否有更好的方法在 UTF-8 字符串中编码二进制数据,而无需支付 base64 的 33% 大小溢价?

Encoding it using base64 is one option, but my understanding is that base64 is most desirable when your text might be converted from one format to another. In this case, I know the data will always be UTF-8, so is there a better way of encoding binary data in a UTF-8 string without paying base64's 33% size premium?

这个问题主要是学术性的,因为最终可能会向 WebSocket 添加二进制支持,同时 base64 是一个完美的替代方案.

This question is mostly academic, as binary support will probably be added to WebSocket eventually, and base64 is a perfectly cromulent alternative in the meantime.

推荐答案

您可以使用 Base-128 编码而不是 Base-64 编码.与 1/3 相反,这只会导致 1/7 的开销.

You could use a Base-128 encoding instead of a Base-64 encoding. That will only result in an overhead of 1/7 in opposite to 1/3.

这个想法是使用可以在 UTF-8 (0–127) 中以单个字节表示的所有 Unicode 代码点.这意味着所有字节都以 0 开头,因此数据还剩下 7 位:

The idea is to use all Unicode code points that can be represented in a single byte in UTF-8 (0–127). That means all bytes begin with a 0 so there are seven bits left for the data:

0‍xxxxxxx

这导致使用 8 个输出字节对 7 个输入字节进行编码的编码:

That results in an encoding where 7 input bytes are encoded using 8 output bytes:

input:  aaaaaaaa bbbbbbbb cccccccc dddddddd eeeeeeee ffffffff gggggggg
output: 0aaaaaaa 0abbbbbb 0bbccccc 0cccdddd 0ddddeee 0eeeeeff 0ffffffg 0ggggggg

所以输出输入比是 8/7.

So the output to input ratio is 8/7.

这篇关于以 UTF-8 字符串存储二进制数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆