这是最有效的方式来提取任意范围位从字的连续序列? [英] Which is the most efficient way to extract an arbitrary range of bits from a contiguous sequence of words?

查看:176
本文介绍了这是最有效的方式来提取任意范围位从字的连续序列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我们有一个的std ::矢量,或任何其他序列容器(有时这将是一个双端队列),其中存储 uint64_t中元素。

Suppose we have an std::vector, or any other sequence container (sometimes it will be a deque), which store uint64_t elements.

现在,让我们来看看这个载体作为尺寸()* 64 连续位序列。我需要找到由比特构成的字给定 [开始,结束)的范围,因为结束 - 开始< = 64 所以一句话适合。

Now, let's see this vector as a sequence of size() * 64 contiguous bits. I need to find the word formed by the bits in a given [begin, end) range, given that end - begin <= 64 so it fits in a word.

我现在将溶液找到两个单词,其部分将形成的结果,并分别口罩和将它们组合。因为我需要这是尽可能提高效率,我试图code一切都没有任何如果跳转到不会造成分支错误predictions,所以例如,code工作在这两种情况下,当整个范围内适合的词或当它跨越了两个词,而不采取不同的路径。要做到这一点,我需要$ C C的$ 移轴线 shiftr 的功能,它做什么,但通过移动一个字指定的金额,如&GT;&GT; &LT;&LT; 运营商,但适当地处理时的情况下 N 大于64,这将是未定义行为除外。

The solution I have right now finds the two words whose parts will form the result, and separately masks and combines them. Since I need this to be as efficient as possible, I've tried to code everything without any if branch to not cause branch mispredictions, so for example the code works in both cases when the entire range fits into a word or when it spans two words, without taking different paths. To do this I needed to code those shiftl and shiftr functions, which do nothing but shifting a word by the specified amount, like the >> and << operators, but gracefully handling the case when n is greater than 64, which would be undefined behavior otherwise.

另外一点是,的get()函数,$ C $现在CD,作品也为空的范围,在数学意义上,如:不仅当开始==结束,而且如果开始>结束,这是需要由主算法是调用这个函数。同样,我试图做到这一点并不简单地分支,并在这种情况下返回零。

Another point is that the get() function, as coded now, works also for empty ranges, in a mathematical sense, e.g. not only if begin == end, but also if begin > end, which is required by the main algorithm that is calling this function. Again, I've tried to do this without simply branching and returning zero in that case.

不过,也期待在大会code,这一切似乎过于复杂,执行这样一个看似简单的任务。这code运行在性能关键算法,这是运行慢了一点。 的valgrind 告诉我们,这个函数被调用2.3亿次,占总量的执行时间的40%,所以我真的需要,使其更快。

However, also looking at the assembly code, all this seems far too complex to perform such a seemingly simple task. This code runs in a performance-critical algorithm, which is running a bit too slow. valgrind told us this function is called 230 million times and accounts for the 40% of the total execution time, so I would really need to make it faster.

所以,你能不能帮我找到一个更简单和/或更有效的方式来完成这项任务? 我不在乎的的很多有关可移植性。使用86 SIMD内在解决方案(SSE3 / 4 / AVX ECC ...)或编译器内建都行,只要我可以与 G ++ 和<$ C $编译它们C>铛。

So can you help me to find a simpler and/or more efficient way to accomplish this task? I don't care too much about portability. Solutions using x86 SIMD intrinsics (SSE3/4/AVX ecc...) or compiler builtins are ok, as far as I can compile them with both g++ and clang.

我目前的code包含如下:

My current code is included below:

using word_type = uint64_t;
const size_t W = 64;

// Shift right, but without being undefined behaviour if n >= 64
word_type shiftr(word_type val, size_t n)
{
    uint64_t good = n < W;

    return good * (val >> (n * good));
}

// Shift left, but without being undefined behaviour if n >= 64
word_type shiftl(word_type val, size_t n)
{
    uint64_t good = n < W;

    return good * (val << (n * good));
}

// Mask the word preserving only the lower n bits.
word_type lowbits(word_type val, size_t n)
{
    word_type mask = shiftr(word_type(-1), W - n);

    return val & mask;
}

// Struct for return values of locate()
struct range_location_t {
    size_t lindex; // The word where is located the 'begin' position
    size_t hindex; // The word where is located the 'end' position
    size_t lbegin; // The position of 'begin' into its word
    size_t llen;   // The length of the lower part of the word
    size_t hlen;   // The length of the higher part of the word
};

// Locate the one or two words that will make up the result
range_location_t locate(size_t begin, size_t end)
{
    size_t lindex = begin / W;
    size_t hindex = end / W;
    size_t lbegin = begin % W;
    size_t hend   = end % W;

    size_t len = (end - begin) * size_t(begin <= end);
    size_t hlen = hend * (hindex > lindex);
    size_t llen = len - hlen;

    return { lindex, hindex, lbegin, llen, hlen };
}

// Main function.
template<typename Container>
word_type get(Container const&container, size_t begin, size_t end)
{
    assert(begin < container.size() * W);
    assert(end <= container.size() * W);

    range_location_t loc = locate(begin, end);

    word_type low = lowbits(container[loc.lindex] >> loc.lbegin, loc.llen);

    word_type high = shiftl(lowbits(container[loc.hindex], loc.hlen), loc.llen);

    return high | low;
}

非常感谢你。

Thank you very much.

推荐答案

由于在聊天宣布,我添加了一个精致的答案。它包含三个部分,他们每个人的后跟部分的描述。

As announced in the chat, I add a refined answer. It contains three parts, each of them followed by a description of that part.

第1部分,get.h,是我的解决方案,但广义和一个校正。

The 1st part, get.h, is my solution, but generalized and with one correction.

第二部分,got.h,是因为张贴在的问题,推广以及与任何无符号类型的STL容器中运行原来的算法。

The 2nd part, got.h, is the original algorithm as posted in the question, generalized as well to work with any STL container of any unsigned type.

第三部分,main.cpp中,包含了单元测试验证其正确性和衡量绩效。

The 3rd part, main.cpp, contains unit tests which verify the correctness and measure performance.

#include <cstddef>

using std::size_t;

template < typename C >
typename C::value_type get ( C const &container, size_t lo, size_t hi )
{

   typedef typename C::value_type item; // a container entry
   static unsigned const bits = (unsigned)sizeof(item)*8u; // bits in an item
   static size_t const mask = ~(size_t)0u/bits*bits; // huge multiple of bits

   // everthing above has been computed at compile time. Now do some work:

   size_t lo_adr = (lo       ) / bits; // the index in the container of ...
   size_t hi_adr = (hi-(hi>0)) / bits; // ... the lower or higher item needed

   // we read container[hi_adr] first and possibly delete the highest bits:

   unsigned hi_shift = (unsigned)(mask-hi)%bits;
   item hi_val = container[hi_adr] << hi_shift >> hi_shift;

   // if all bits are in the same item, we delete the lower bits and are done:

   unsigned lo_shift = (unsigned)lo%bits;
   if ( hi_adr <= lo_adr ) return (hi_val>>lo_shift) * (lo<hi);

   // else we have to read the lower item as well, and combine both

   return ( hi_val<<bits-lo_shift | container[lo_adr]>>lo_shift );

}

第1部分,get.h以上,是我原来的解决方案,但广义与无符号整数类型的任意STL容器的工作。因此,你可以使用和测试它的32位整数或128位的整数为好。我仍然使用无符号数非常小的数字,但你可以通过为size_t以及更换。该算法是几乎没有变化,用小的校正 - 如果LO是位在容器中的总数量,我的previous的get()将访问正上方的容器的大小的项。现在,这是固定的。

The 1st part, get.h above, is my original solution, but generalized to work with any STL containers of unsigned integer types. Thus you can use and test it for 32-bit integers or 128-bit integers as well. I still use unsigned for very small numbers, but you may as well replace them by size_t. The algorithm is nearly unchanged, with a small correction - if lo was the total number of bits in the container, my previous get() would access an item just above the container size. This is fixed now.

#include <cstddef>

using std::size_t;

// Shift right, but without being undefined behaviour if n >= 64
template < typename val_type >
val_type shiftr(val_type val, size_t n)
{
   val_type good = n < sizeof(val_type)*8;
   return good * (val >> (n * good));
}

// Shift left, but without being undefined behaviour if n >= 64
template < typename val_type >
val_type shiftl(val_type val, size_t n)
{
   val_type good = n < sizeof(val_type)*8;
   return good * (val << (n * good));
}

// Mask the word preserving only the lower n bits.
template < typename val_type >
val_type lowbits(val_type val, size_t n)
{
    val_type mask = shiftr<val_type>((val_type)(-1), sizeof(val_type)*8 - n);
    return val & mask;
}

// Struct for return values of locate()
struct range_location_t {
   size_t lindex; // The word where is located the 'begin' position
   size_t hindex; // The word where is located the 'end' position
   size_t lbegin; // The position of 'begin' into its word
   size_t llen;   // The length of the lower part of the word
   size_t hlen;   // The length of the higher part of the word
};

// Locate the one or two words that will make up the result
template < typename val_type >
range_location_t locate(size_t begin, size_t end)
{
   size_t lindex = begin / (sizeof(val_type)*8);
   size_t hindex = end / (sizeof(val_type)*8);
   size_t lbegin = begin % (sizeof(val_type)*8);
   size_t hend   = end % (sizeof(val_type)*8);

   size_t len = (end - begin) * size_t(begin <= end);
   size_t hlen = hend * (hindex > lindex);
   size_t llen = len - hlen;

   range_location_t l = { lindex, hindex, lbegin, llen, hlen };
   return l;
}

// Main function.
template < typename C >
typename C::value_type got ( C const&container, size_t begin, size_t end )
{
   typedef typename C::value_type val_type;
   range_location_t loc = locate<val_type>(begin, end);
   val_type low = lowbits<val_type>(container[loc.lindex] >> loc.lbegin, loc.llen);
   val_type high = shiftl<val_type>(lowbits<val_type>(container[loc.hindex], loc.hlen), loc.llen);
   return high | low;
}

这第二部分,got.h以上,是原来的算法问题,推广以及接受任何无符号整型任何STL容器。像get.h,这个版本没有使用,只是定义了容器类型单一的模板参数的任何定义,因此,它可以很容易地用于其他项目的大小或容器类型的测试。

This 2nd part, got.h above, is the original algorithm in the question, generalized as well to accept any STL containers of any unsigned integer types. Like get.h, this version does not use any definitions except the single template parameter that defines the container type, thus it can easily be tested for other item sizes or container types.

#include <vector>
#include <cstddef>
#include <stdint.h>
#include <stdio.h>
#include <sys/time.h>
#include <sys/resource.h>
#include "get.h"
#include "got.h"

template < typename Container > class Test {

   typedef typename Container::value_type val_type;
   typedef val_type (*fun_type) ( Container const &, size_t, size_t );
   typedef void (Test::*fun_test) ( unsigned, unsigned );
   static unsigned const total_bits = 256; // number of bits in the container
   static unsigned const entry_bits = (unsigned)sizeof(val_type)*8u;

   Container _container;
   fun_type _function;
   bool _failed;

   void get_value ( unsigned lo, unsigned hi ) {
      _function(_container,lo,hi); // we call this several times ...
      _function(_container,lo,hi); // ... because we measure ...
      _function(_container,lo,hi); // ... the performance ...
      _function(_container,lo,hi); // ... of _function, ....
      _function(_container,lo,hi); // ... not the performance ...
      _function(_container,lo,hi); // ... of get_value and ...
      _function(_container,lo,hi); // ... of the loop that ...
      _function(_container,lo,hi); // ... calls get_value.
   }

   void verify ( unsigned lo, unsigned hi ) {
      val_type value = _function(_container,lo,hi);
      if ( lo < hi ) {
         for ( unsigned i=lo; i<hi; i++ ) {
            val_type val = _container[i/entry_bits] >> i%entry_bits & 1u;
            if ( val != (value&1u) ) {
               printf("lo=%d hi=%d [%d] is'nt %d\n",lo,hi,i,(unsigned)val);
               _failed = true;
            }
            value >>= 1u;
         }
      }
      if ( value ) {
         printf("lo=%d hi=%d value contains high bits set to 1\n",lo,hi);
         _failed = true;
      }
   }

   void run ( fun_test fun ) {
      for ( unsigned lo=0; lo<total_bits; lo++ ) {
         unsigned h0 = 0;
         if ( lo > entry_bits ) h0 = lo - (entry_bits+1);
         unsigned h1 = lo+64;
         if ( h1 > total_bits ) h1 = total_bits;
         for ( unsigned hi=h0; hi<=h1; hi++ ) {
            (this->*fun)(lo,hi);
         }
      }
   }

   static uint64_t time_used ( ) {
      struct rusage ru;
      getrusage(RUSAGE_THREAD,&ru);
      struct timeval t = ru.ru_utime;
      return (uint64_t) t.tv_sec*1000 + t.tv_usec/1000;
   }

public:

   Test ( fun_type function ): _function(function), _failed() {
      val_type entry;
      unsigned index = 0; // position in the whole bit array
      unsigned value = 0; // last value assigned to a bit
      static char const entropy[] = "The quick brown Fox jumps over the lazy Dog";
      do {
         if ( ! (index%entry_bits) ) entry = 0;
         entry <<= 1;
         entry |= value ^= 1u & entropy[index/7%sizeof(entropy)] >> index%7;
         ++index;
         if ( ! (index%entry_bits) ) _container.push_back(entry);
      } while ( index < total_bits );
   }

   bool correctness() {
      _failed = false;
      run(&Test::verify);
      return !_failed;
   }

   void performance() {
      uint64_t t1 = time_used();
      for ( unsigned i=0; i<999; i++ ) run(&Test::get_value);
      uint64_t t2 = time_used();
      printf("used %d ms\n",(unsigned)(t2-t1));
   }

   void operator() ( char const * name ) {
      printf("testing %s\n",name);
      correctness();
      performance();
   }

};

int main()
{
   typedef typename std::vector<uint64_t> Container;
   Test<Container> test(get<Container>); test("get");
   Test<Container> tost(got<Container>); tost("got");
}

第三部分,main.cpp中上方,包含一个类的单元测试,并将其应用于get.h和got.h,那就是,我的解决方案,这个问题原来code,略作修改。单元测试验证的正确性和测量速度。它们通过创建256位的容器,用一些数据填充它,读取的所有可能的位截面达一样多的位配合到容器条目加大量的病理情况下,以及验证每一结果的正确性验证正确性。他们测量速度再次读取相同的部分经常和报告使用用户空间线程的时间。

The 3rd part, main.cpp above, contains a class of unit tests and applies them to get.h and got.h, that is, to my solution and to the original code of the question, slightly modified. The unit tests verify correctness and measure speed. They verify correctness by creating a container of 256 bits, filling it with some data, reading all possible bit sections of up to as many bits as fit into a container entry plus lots of pathological cases, and verifying the correctness of each of the results. They measure speed by again reading the same sections quite often and reporting the thread's time used in user space.

这篇关于这是最有效的方式来提取任意范围位从字的连续序列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆