寻找关于“组varint编码/解码”的更多细节在Jeff的幻灯片中提出 [英] Looking for more details about "Group varint encoding/decoding" presented in Jeff's slides
问题描述
我注意到,在Jeff的幻灯片构建大规模信息检索系统的挑战,也可以从这里下载: http://research.google.com/people/jeff/WSDM09-keynote.pdf ,提到了一种名为group varint encoding的整数压缩方法。据说比7位每字节整数编码(2X以上)快得多。我对这个很感兴趣,并寻找这个的实现,或任何更多的细节,可以帮助我自己实现这个。
I noticed that in Jeff's slides "Challenges in Building Large-Scale Information Retrieval Systems", which can also be downloaded here: http://research.google.com/people/jeff/WSDM09-keynote.pdf, a method of integers compression called "group varint encoding" was mentioned. It was said much faster than 7 bits per byte integer encoding (2X more). I am very interested in this and looking for an implementation of this, or any more details that could help me implement this by myself.
我不是一个职业和新的这和任何帮助是欢迎的!
I am not a pro and new to this, and any help is welcome!
推荐答案
这是指可变整数编码,其中用于存储的位数序列化时的整数不固定为4字节。 协议缓冲区文档中的varint 有一个很好的说明
That's referring to "variable integer encoding", where the number of bits used to store an integer when serialized is not fixed at 4 bytes. There is a good description of varint in the protocol buffer documentation.
它用于编码 Google协议缓冲,您可以浏览协议缓冲区源代码。
It is used in encoding Google's protocol buffers, and you can browse the protocol buffer source code.
CodedOutputStream
包含确切的编码函数 WriteVarint32FallbackToArrayInline < a>:
The CodedOutputStream
contains the exact encoding function WriteVarint32FallbackToArrayInline:
inline uint8* CodedOutputStream::WriteVarint32FallbackToArrayInline(
uint32 value, uint8* target) {
target[0] = static_cast<uint8>(value | 0x80);
if (value >= (1 << 7)) {
target[1] = static_cast<uint8>((value >> 7) | 0x80);
if (value >= (1 << 14)) {
target[2] = static_cast<uint8>((value >> 14) | 0x80);
if (value >= (1 << 21)) {
target[3] = static_cast<uint8>((value >> 21) | 0x80);
if (value >= (1 << 28)) {
target[4] = static_cast<uint8>(value >> 28);
return target + 5;
} else {
target[3] &= 0x7F;
return target + 4;
}
} else {
target[2] &= 0x7F;
return target + 3;
}
} else {
target[1] &= 0x7F;
return target + 2;
}
} else {
target[0] &= 0x7F;
return target + 1;
}
}
级联 code> s只会在
目标
数组的末尾添加额外的字节,如果 value
那些额外的字节。 0x80
屏蔽正在写入的字节,并且值
向下移动。从我可以告诉, 0x7f
掩码导致它表示最后一个字节的编码。 (当OR'ing 0x80
时,最高位总是 1
,最后一个字节清除最高位因此,当你读取varints,你读取,直到你得到一个字节在最高位为零。
The cascading if
s will only add additional bytes onto the end of the target
array if the magnitude of value
warrants those extra bytes. The 0x80
masks the byte being written, and the value
is shifted down. From what I can tell, the 0x7f
mask causes it to signify the "last byte of encoding". (When OR'ing 0x80
, the highest bit will always be 1
, then the last byte clears the highest bit (by AND'ing 0x7f
). So, when reading varints you read until you get a byte with a zero in the highest bit.
我只是意识到你问关于组VarInt编码具体。对不起,该代码是关于基本的VarInt编码(仍然比7位)。基本的想法看起来是类似的不幸的是,
I just realized you asked about "Group VarInt encoding" specifically. Sorry, that code was about basic VarInt encoding (still faster than 7-bit). The basic idea looks to be similar. Unfortunately, it's not what's being used to store 64bit numbers in protocol buffers. I wouldn't be surprised if that code was open sourced somewhere though.
使用这些想法并不意味着在协议缓冲区中存储64位的数据。 varint
和幻灯片中Group varint的图表,你不应该太难以自己做:)
Using the ideas from varint
and the diagrams of "Group varint" from the slides, it shouldn't be too too hard to cook up your own :)
这是另一个描述 Group VarInt compression ,其中包含解码代码。不幸的是,它们涉及公开可用的实现,但是它们不提供引用。
Here is another page describing Group VarInt compression, which contains decoding code. Unfortunately they allude to publicly available implementations, but they don't provide references.
void DecodeGroupVarInt(const byte* compressed, int size, uint32_t* uncompressed) {
const uint32_t MASK[4] = { 0xFF, 0xFFFF, 0xFFFFFF, 0xFFFFFFFF };
const byte* limit = compressed + size;
uint32_t current_value = 0;
while (compressed != limit) {
const uint32_t selector = *compressed++;
const uint32_t selector1 = (selector & 3);
current_value += *((uint32_t*)(compressed)) & MASK[selector1];
*uncompressed++ = current_value;
compressed += selector1 + 1;
const uint32_t selector2 = ((selector >> 2) & 3);
current_value += *((uint32_t*)(compressed)) & MASK[selector2];
*uncompressed++ = current_value;
compressed += selector2 + 1;
const uint32_t selector3 = ((selector >> 4) & 3);
current_value += *((uint32_t*)(compressed)) & MASK[selector3];
*uncompressed++ = current_value;
compressed += selector3 + 1;
const uint32_t selector4 = (selector >> 6);
current_value += *((uint32_t*)(compressed)) & MASK[selector4];
*uncompressed++ = current_value;
compressed += selector4 + 1;
}
}
这篇关于寻找关于“组varint编码/解码”的更多细节在Jeff的幻灯片中提出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!