寻找关于“组varint编码/解码”的更多细节在Jeff的幻灯片中提出 [英] Looking for more details about "Group varint encoding/decoding" presented in Jeff's slides

查看:164
本文介绍了寻找关于“组varint编码/解码”的更多细节在Jeff的幻灯片中提出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我注意到,在Jeff的幻灯片构建大规模信息检索系统的挑战,也可以从这里下载: http://research.google.com/people/jeff/WSDM09-keynote.pdf ,提到了一种名为group varint encoding的整数压缩方法。据说比7位每字节整数编码(2X以上)快得多。我对这个很感兴趣,并寻找这个的实现,或任何更多的细节,可以帮助我自己实现这个。

I noticed that in Jeff's slides "Challenges in Building Large-Scale Information Retrieval Systems", which can also be downloaded here: http://research.google.com/people/jeff/WSDM09-keynote.pdf, a method of integers compression called "group varint encoding" was mentioned. It was said much faster than 7 bits per byte integer encoding (2X more). I am very interested in this and looking for an implementation of this, or any more details that could help me implement this by myself.

我不是一个职业和新的这和任何帮助是欢迎的!

I am not a pro and new to this, and any help is welcome!

推荐答案

这是指可变整数编码,其中用于存储的位数序列化时的整数不固定为4字节。 协议缓冲区文档中的varint 有一个很好的说明

That's referring to "variable integer encoding", where the number of bits used to store an integer when serialized is not fixed at 4 bytes. There is a good description of varint in the protocol buffer documentation.

它用于编码 Google协议缓冲,您可以浏览协议缓冲区源代码

It is used in encoding Google's protocol buffers, and you can browse the protocol buffer source code.

CodedOutputStream 包含确切的编码函数 WriteVarint32FallbackToArrayInline < a>:

The CodedOutputStream contains the exact encoding function WriteVarint32FallbackToArrayInline:

inline uint8* CodedOutputStream::WriteVarint32FallbackToArrayInline(
    uint32 value, uint8* target) {
  target[0] = static_cast<uint8>(value | 0x80);
  if (value >= (1 << 7)) {
    target[1] = static_cast<uint8>((value >>  7) | 0x80);
    if (value >= (1 << 14)) {
      target[2] = static_cast<uint8>((value >> 14) | 0x80);
      if (value >= (1 << 21)) {
        target[3] = static_cast<uint8>((value >> 21) | 0x80);
        if (value >= (1 << 28)) {
          target[4] = static_cast<uint8>(value >> 28);
          return target + 5;
        } else {
          target[3] &= 0x7F;
          return target + 4;
        }
      } else {
        target[2] &= 0x7F;
        return target + 3;
      }
    } else {
      target[1] &= 0x7F;
      return target + 2;
    }
  } else {
    target[0] &= 0x7F;
    return target + 1;
  }
}

级联 code> s只会在目标数组的末尾添加额外的字节,如果 value 那些额外的字节。 0x80 屏蔽正在写入的字节,并且向下移动。从我可以告诉, 0x7f 掩码导致它表示最后一个字节的编码。 (当OR'ing 0x80 时,最高位总是 1 ,最后一个字节清除最高位因此,当你读取varints,你读取,直到你得到一个字节在最高位为零。

The cascading ifs will only add additional bytes onto the end of the target array if the magnitude of value warrants those extra bytes. The 0x80 masks the byte being written, and the value is shifted down. From what I can tell, the 0x7f mask causes it to signify the "last byte of encoding". (When OR'ing 0x80, the highest bit will always be 1, then the last byte clears the highest bit (by AND'ing 0x7f). So, when reading varints you read until you get a byte with a zero in the highest bit.

我只是意识到你问关于组VarInt编码具体。对不起,该代码是关于基本的VarInt编码(仍然比7位)。基本的想法看起来是类似的不幸的是,

I just realized you asked about "Group VarInt encoding" specifically. Sorry, that code was about basic VarInt encoding (still faster than 7-bit). The basic idea looks to be similar. Unfortunately, it's not what's being used to store 64bit numbers in protocol buffers. I wouldn't be surprised if that code was open sourced somewhere though.

使用这些想法并不意味着在协议缓冲区中存储64位的数据。 varint 和幻灯片中Group varint的图表,你不应该太难以自己做:)

Using the ideas from varint and the diagrams of "Group varint" from the slides, it shouldn't be too too hard to cook up your own :)

这是另一个描述 Group VarInt compression ,其中包含解码代码。不幸的是,它们涉及公开可用的实现,但是它们不提供引用。

Here is another page describing Group VarInt compression, which contains decoding code. Unfortunately they allude to publicly available implementations, but they don't provide references.

void DecodeGroupVarInt(const byte* compressed, int size, uint32_t* uncompressed) {
  const uint32_t MASK[4] = { 0xFF, 0xFFFF, 0xFFFFFF, 0xFFFFFFFF };
  const byte* limit = compressed + size;
  uint32_t current_value = 0;
  while (compressed != limit) {
    const uint32_t selector = *compressed++;
    const uint32_t selector1 = (selector & 3);
    current_value += *((uint32_t*)(compressed)) & MASK[selector1];
    *uncompressed++ = current_value;
    compressed += selector1 + 1;
    const uint32_t selector2 = ((selector >> 2) & 3);
    current_value += *((uint32_t*)(compressed)) & MASK[selector2];
    *uncompressed++ = current_value;
    compressed += selector2 + 1;
    const uint32_t selector3 = ((selector >> 4) & 3);
    current_value += *((uint32_t*)(compressed)) & MASK[selector3];
    *uncompressed++ = current_value;
    compressed += selector3 + 1;
    const uint32_t selector4 = (selector >> 6);
    current_value += *((uint32_t*)(compressed)) & MASK[selector4];
    *uncompressed++ = current_value;
    compressed += selector4 + 1;
  }
}

这篇关于寻找关于“组varint编码/解码”的更多细节在Jeff的幻灯片中提出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆