有效的二进制到字符串格式设置(例如base64,但适用于UTF8/UTF16)? [英] Efficient binary-to-string formatting (like base64, but for UTF8/UTF16)?

查看:113
本文介绍了有效的二进制到字符串格式设置(例如base64,但适用于UTF8/UTF16)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有很多束二进制数据,范围从16到4096字节,这些数据需要存储到数据库中,并且应该很容易作为一个单位进行比较(例如,只有长度匹配且所有字节都为两批数据时才批处理)比赛).字符串很适合这样做,但是由于字符编码/重新解释问题,盲目地将二进制数据转换为字符串很容易引起问题.

I have many bunches of binary data, ranging from 16 to 4096 bytes, which need to be stored to a database and which should be easily comparable as a unit (e.g. two bunches of data batch only if the lengths match and all bytes match). Strings are nice for that, but converting binary data blindly to a string is apt to cause problems due to character encoding/reinterpretation issues.

在以7位ASCII为标准格式的时代,Base64是一种用于存储字符串的常用方法.其33%的空间损失有点烦人,但并不可怕.不幸的是,如果使用UTF-16,则空间损失为166%(8个字节存储3个字节),这似乎很糟糕.

Base64 was a common method for storing strings in an era when 7-bit ASCII was the norm; its 33% space penalty was a little annoying, but not horrible. Unfortunately, if one is using UTF-16, the space penalty is 166% (8 bytes to store 3) which seems pretty icky.

是否存在用于将二进制数据存储在有效Unicode字符串中的通用存储方法,从而可以在UTF-16中提高效率(并且希望在UTF-8中不会太恐怖)? base-32768编码将以16个字符存储240位,这将占用32个字节的UTF-16或48个字节的UTF-8.相比之下,base64编码将使用40个字符,这将占用80个字节的UTF-16或40个字节的UTF-8.一种旨在在UTF-8或UTF-16中占用相同空间的方法可能会在三个字符中存储48位,而在UTF-8或UTF-16中将占用8个字节,因此在UTF-8或UTF-16中的40个字节中存储240位-8或UTF-16.

Is there any common storage method for storing binary data in a valid Unicode string which will allow better efficiency in UTF-16 (and hopefully not be too horrible in UTF-8)? A base-32768 coding would store 240 bits in sixteen characters, which would take 32 bytes of UTF-16 or 48 bytes of UTF-8. By comparison, base64 coding would use 40 characters, which would take 80 bytes of UTF-16 or 40 bytes of UTF-8. An approach which was designed to take the same space in UTF-8 or UTF-16 might store 48 bits in three characters that would take eight bytes in either UTF-8 or UTF-16, thus storing 240 bits in 40 bytes of either UTF-8 or UTF-16.

有没有类似的标准?

推荐答案

Base32768 确实可以满足您的要求.抱歉,它已经存在了五年.

Base32768 does exactly what you wanted. Sorry it took five years to exist.

用法(这是JavaScript,尽管将base32768模块移植到另一种编程语言非常实用)

Usage (this is JavaScript, although porting the base32768 module to another programming language is eminently practical):

var base32768 = require("base32768");

var buf = new Buffer("d41d8cd98f00b204e9800998ecf842", "hex"); // 15 bytes

var str = base32768.encode(buf); 
console.log(str); // "迎裶垠⢀䳬Ɇ垙鸂", 8 code points

var buf2 = base32768.decode(str);
console.log(buf.equals(buf2)); // true

Base32768从基本多语言平面"中选择32,768个字符.每个字符用UTF-16表示时要占用2个字节,而用UTF-8表示时要占用3个字节,正是您所描述的效率特征:240位可以以16个字符存储,即32个字节的UTF-16或48个字节的UTF- 8. (除了偶尔的填充字符,类似于在Base64中看到的=填充.)

Base32768 selects 32,768 characters from the Basic Multilingual Plane. Each character takes 2 bytes when represented as UTF-16 or 3 bytes when represented as UTF-8, giving exactly the efficiency characteristics you describe: 240 bits can be stored in 16 characters i.e. 32 bytes of UTF-16 or 48 bytes of UTF-8. (Except for the occasional padding character, analogous to the = padding seen in Base64.)

这是通过将输入字节(即8位无符号数字)切成15位无符号数字并将每个结果的15位数字分配给32,768个字符之一来完成的.

This is done by dicing the input bytes (i.e. 8-bit unsigned numbers) into 15-bit unsigned numbers and assigning each resulting 15-bit number to one of the 32,768 characters.

请注意,所选字符也是安全的"-没有空格,控制字符,结合了变音符号或对规范化损坏的敏感性.

Note that the characters chosen are also "safe" - no whitespace, control characters, combining diacritics or susceptibility to normalization corruption.

这篇关于有效的二进制到字符串格式设置(例如base64,但适用于UTF8/UTF16)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆