为什么BSON字符串(不是Cstring / ename)之后的尾随0x00字节? [英] Why the trailing 0x00 byte after BSON string (not Cstring/ename)?

查看:71
本文介绍了为什么BSON字符串(不是Cstring / ename)之后的尾随0x00字节?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

显然,对于bson cstring,尾随字节用于确定字符串的长度,因此它是:(byte *) \x00 。它们用作正则表达式模式,rexegs选项和ename,它们不长,也不在迭代中使用,因此长度不是必需的,但随后出现...



bson字符串写为: int32(byte *) \x00



具有以下规定: int32是(byte *)+1中的字节数(对于尾随的'\x00')。 (byte *)是零个或多个UTF-8编码字符。



但是为什么要使用尾随零字节呢?如果我们使用utf-8编码的字符串长度,则对于字节数据工作流程就足够了,0x00字节仅添加了不需要的字节。我缺少什么吗?

解决方案

字符串长度和null终止符的原因都是双重的:与现有C的兼容性样式字符串和性能。



为了提高性能,MongoDB需要能够快速访问文档中的特定字段,而无需遍历整个BSON。这一点特别重要,如果您要查找的字段接近大型文档(例如16 MB)末尾。将字符串的长度编码为关于字符串类型的首个信息之一,它只需跳过该字节数即可到达下一个字段。否则,它将需要遍历整个字符串,直到找到字符串的结尾。



为了兼容,MongoDB用C ++编写,其中字符串空终止。由于可以对空终止符进行编码,因此可以切断该空终止符以节省一个字节,但是要使该字符串脱离BSON成为C ++可以使用的格式,则需要再次添加该空值。这将需要专门的字符串处理例程,这唯一的好处就是节省了一个字节。



总的来说,决定浪费单个字节是可以接受的折衷方案。 / p>

obviously, for bson cstring the trailing byte is used to determine the length of the string, so it is: (byte*) "\x00". They are used as regex patterns, rexegs options and ename, which are not long / used in iterations, so the length is not necessary, but then comes...

bson string is written as: int32 (byte*) "\x00"

with specification as follows: The int32 is the number bytes in the (byte*) + 1 (for the trailing '\x00'). The (byte*) is zero or more UTF-8 encoded characters.

but why the use of trailing zero byte? if we have the utf-8 encoded string length, it is sufficient for the byte data workflow, and the 0x00 byte just adds an unneeded byte. Am I missing something?

解决方案

The reasoning for both the length of the string and the null terminator is twofold: compatibility with existing C-style strings, and performance.

For performance, MongoDB needs to be able to quickly go to a specific field in a document without iterating through the whole BSON. This is important especially if you're looking for a field that is close to the end of a large (say 16 MB) document. With the length of the string encoded as one of the first information on a string type, it can just skip that number of bytes and get to the next field. Otherwise, it will need to iterate over the whole string until it finds the end of the string.

For compatibility, MongoDB is written in C++, where strings are null terminated. It can cut off that null terminator to save one byte since the length is encoded, but getting that string out of BSON into a format that's usable by C++ would require tacking on that null again. This will need specialized string handling routine that's the only advantage is saving a single byte.

Overall, it was decided that "wasting" a single byte is an acceptable tradeoff.

这篇关于为什么BSON字符串(不是Cstring / ename)之后的尾随0x00字节?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆