当您只关心速度时,如何存储二进制数据? [英] How to store binary data when you only care about speed?
问题描述
目前我的实现使用 std :: vector< int>
。我想知道如果我可以通过更改我的数据结构。我只是进行插入和搜索(我不改变位)。
所有相关问题我发现提到 std :: vector< char> ;
, std :: vector< bool>
和 std :: bitset
提到使用这样的结构应该得到的空间利益。
当C ++中的二进制数据是主流时,什么是适当的数据结构?
我打算用二进制数据填充我的数据结构,然后做很多连续的搜索(我的意思是我不是真的关心点的第i个坐标,如果我正在访问一个点,我将连续访问其所有坐标)。我将计算彼此之间的汉密距离。
参考的位置可能是驾驶力。所以很明显,您将单个点的 D
坐标表示为连续的位向量。 std :: bitset< D>
将是一个合理的选择。
然而,下一个重要的事情是您可以轻松地看到4KB的优势。这意味着您不应该选择一个点,并将其与所有其他N-1点进行比较。相反,每组4KB的组点,并比较这些组。两种方式都是 O(N * N)
,但第二种将会快得多。
你可以通过使用三角不等式来击败 O(N * N)
- Hamming(a,b)+ Hamming(b,c)汉明(a,c)
。我只是想知道如何。这可能取决于您希望输出的方式。天真的输出将是N * N组的距离,这是不可避免的 O(N * N)
。
I have N points in D dimensions, where let's say N is 1 million and D 1 hundred. All my points have binary coordinates, i.e. {0, 1}^D, and I am only interested in speed.
Currently my implementation uses std::vector<int>
. I am wondering if I could benefit in terms of faster execution by changing my data-structure. I am only doing insertions and searches (I don't change the bits).
All related questions I found mention std::vector<char>
, std::vector<bool>
and std::bitset
, but all mention the space benefits one should get by using such structures.
What's the appropriate data structure, when speed is of main concern, for binary data in C++?
I intend to populate my data structure with the binary data and then do a lot of contiguous searches (I mean that I don't really care for the i-th coordinate of a point, if I am accessing a point I will access all of its coordinates continuously). I will compute the Hamming distance between each other.
Locality of reference will likely be the driving force. So it's fairly obvious that you represent the D
coordinates of a single point as a contiguous bitvector. std::bitset<D>
would be a logical choice.
However, the next important thing to realize is that you see locality benefits easily up to 4KB. This means that you should not pick a single point and compare it against all other N-1 points. Instead, group points in sets of 4KB each, and compare those groups. Both ways are O(N*N)
, but the second will be much faster.
You may be able to beat O(N*N)
by use of the triangle inequality - Hamming(a,b)+Hamming(b,c) >= Hamming (a,c)
. I'm just wondering how. It probably depends on how you want your output. The naive output would be a N*N set of distances, and that's unavoidably O(N*N)
.
这篇关于当您只关心速度时,如何存储二进制数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!