选择set< int>对矢量< bool>与向量< boolean_t>用作位图(bitset / bit array) [英] Choosing between set<int> vs. vector<bool> vs. vector<boolean_t> to use as a bitmap (bitset / bit array)

查看:160
本文介绍了选择set< int>对矢量< bool>与向量< boolean_t>用作位图(bitset / bit array)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定一系列索引(标识符),其中我想将每个索引映射为布尔值,即:

  // interface pseudocode 
interface bitmap {
bool identifier_is_set(unsigned int id_idx)const;
void set_identifier(unsigned int id_idx,bool val)const;
};

,以便我可以设置和查询每个ID你愿意使用来实现这个吗?



我认为这被称为位数组或位图或bitset,如果我错了,请纠正我。



假设最大标识符是预定的并且不大于1e6(1m),可能小得多(10k-100k)。(这意味着sizeof(int)* maximum_id_idx使用的大小容易适合存储器。)



我看到的可能的解决方案是:




  • std :: set< size_t> - 添加或删除该集合的标识符为必需。这将允许任意大的标识符,只要我们有一个稀疏的位图。

  • std :: vector< bool> 适当的最大值,为每个id_idx存储true或false。

  • std :: vector< char> 患有奇怪的 std :: vector< bool> 问题。使用少于向量< int> 的内存。

  • std :: vector< int> - 使用 int 作为布尔标志,以使用容器使用机器的自然字大小。



请回答您希望使用的容器类型,以及为什么, (特别是考虑到查询位图的性能方面)。



注意: 向量设置的接口用法无关紧要,因为它会隐藏在它的包类之后。 / p>

编辑:添加到关于std :: bitset的讨论:std :: bitset将整个数组大小合并到对象中,这是一个sizeof(std :: bitset< ; 1m>)将是一个大约1/8兆字节的大小,这使得一个巨大的单个对象,并使你不能放在堆栈上的东西(这可能或可能不相关)。

解决方案

不知道你正在运行这个代码的平台和你的访问模式,很难说 vector< bool> code>将比向量< char> (或向量< int> )更快,甚至 set< int> unordered_set< int>



例如,如果你有一个非常稀疏的数组,线性搜索向量< int> 包含索引集可能是最好的答案。 (请参阅Mike Abrash关于优化Pixomatic for x86的文章



另一方面,你可能有一个稀疏的数组。有点稀疏,我的意思是设置元素的数量远大于L1或L2。



例如,在某些平台上,可变位移位是令人难以置信的昂贵。因此,如果你查询一组随机的标识符,你执行这个操作的频率越高,一个向量< char> 向量< int& 变成比 bitset <...> 向量< bool> 。 (后两种使用位移位到查找位)。另一方面,如果你按顺序迭代稀疏位向量,并且只想要位被设置,则可以优化该迭代以去除可变移位的开销。



此时,您可能还想知道稀疏标识符的实际分布情况。如果它们被聚集,你需要知道最佳内存读取大小和一次读取字符之间的权衡。这将决定是否更频繁地访问缓存会抵消非原生大小的数据中的读取。



如果标识符是分散的,您可以通过使用散列set( unordered_set< int> ),而不是位向量。这取决于负载。


Given a range of indexes (identifiers), where I want to map each index to a boolean value, that is:

// interface pseudocode
interface bitmap {
  bool identifier_is_set(unsigned int id_idx) const;
  void set_identifier(unsigned int id_idx, bool val) const;
};

so that I can set and query for each ID (index) if it is set or not, what would you prefer to use to implement this?

I think this is called a bit array or bitmap or bitset, correct me if I'm wrong.

Assume that the maximum identifier is predetermined and not greater than 1e6 (1m), possibly much smaller (10k - 100k). (Which means the size used by sizeof(int)*maximum_id_idx easily fits into memory.)

Possible solutions I see so far:

  • std::set<size_t> - Add or erase the identifier to this set as neccessary. This would allow for arbitrarily large identifiers as long as we have a sparse bitmap.
  • std::vector<bool> - Sized to the appropriate maximum value, storing true or false for each id_idx.
  • std::vector<char> - Same thing, but not suffering from weird std::vector<bool> problems. Uses less memory than vector<int>.
  • std::vector<int> - Using an int as the boolean flag to have a container using the natural word size of the machine. (No clue if that could make a difference.)

Please answer which container type you would prefer and why, given the maximum id restriction cited above and especially considering performance aspects of querying the bitmap (inserting performance does not matter).

Note: The interface usage of vector vs. set does not matter, as it will be hidden behind it's wrapping class anyway.

EDIT: To add to the discussion about std::bitset : std::bitset will incorporate the whole array size into the object, that is a sizeof(std::bitset<1m>) will be a size of approx 1/8 megabyte, which makes for a huge single object and makes for something you cannot put on the stack anymore (which may or may not be relevant).

解决方案

Without knowing the platform you are running this code on and your access patterns, it's hard to say whether vector<bool> will be faster than vector<char> (or vector<int>) or even set<int> or unordered_set<int>.

For example, if you have an extremely sparse array, a linear search of a vector<int> that just contains the indices set might be the best answer. (See Mike Abrash's article on optimizing Pixomatic for x86.)

On the other hand, you might have a somewhat sparse array. By somewhat sparse, I mean that the number of set elements is much greater than L1 or L2. In that case, more low-level details start to come into play, as well as your actual access patterns.

For example, on some platforms, variable bit shifting is incredibly expensive. So, if you are querying a random set of identifiers, the more frequently you do this, the more a vector<char> or vector<int> becomes a better idea than bitset<...> or vector<bool>. (The latter two use bit shifts to lookup bits.) On the other hand, if you are iterating through the sparse bit vector in order and just want the bits set, you can optimize that iteration to get rid of the overhead of variable shifts.

At this point, you might also want to know how your sparse identifiers are actually distributed. If they are clumped, you need to know the tradeoff between the optimal memory read size and reading a char at a time. That will dictate whether hitting the cache more often will offset reading in non-native sized data.

If the identifiers are scattered, you may get a significant win by using a hash set (unordered_set<int>) instead of a bit vector. That depends on the load, however.

这篇关于选择set&lt; int&gt;对矢量&lt; bool&gt;与向量&lt; boolean_t&gt;用作位图(bitset / bit array)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆