当您只关心速度时,如何存储二进制数据? [英] How to store binary data when you only care about speed?

查看:156
本文介绍了当您只关心速度时,如何存储二进制数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在D维度上有N点,我们说N是100万,而D是100。我所有的点都有二进制坐标,即{0,1} ^ D,我只对速度感兴趣



目前我的实现使用 std :: vector< int> 。我想知道如果我可以通过更改我的数据结构。我只是进行插入和搜索(我不改变位)。



所有相关问题我发现提到 std :: vector< char> ; std :: vector< bool> std :: bitset 提到使用这样的结构应该得到的空间利益。



当C ++中的二进制数据是主流时,什么是适当的数据结构?






我打算用二进制数据填充我的数据结构,然后做很多连续的搜索(我的意思是我不是真的关心点的第i个坐标,如果我正在访问一个点,我将连续访问其所有坐标)。我将计算彼此之间的汉密距离。

解决方案

参考的位置可能是驾驶力。所以很明显,您将单个点的 D 坐标表示为连续的位向量。 std :: bitset< D> 将是一个合理的选择。



然而,下一个重要的事情是您可以轻松地看到4KB的优势。这意味着您不应该选择一个点,并将其与所有其他N-1点进行比较。相反,每组4KB的组点,并比较这些组。两种方式都是 O(N * N),但第二种将会快得多。



你可以通过使用三角不等式来击败 O(N * N) - Hamming(a,b)+ Hamming(b,c)汉明(a,c)。我只是想知道如何。这可能取决于您希望输出的方式。天真的输出将是N * N组的距离,这是不可避免的 O(N * N)


I have N points in D dimensions, where let's say N is 1 million and D 1 hundred. All my points have binary coordinates, i.e. {0, 1}^D, and I am only interested in speed.

Currently my implementation uses std::vector<int>. I am wondering if I could benefit in terms of faster execution by changing my . I am only doing insertions and searches (I don't change the bits).

All related questions I found mention std::vector<char>, std::vector<bool> and std::bitset, but all mention the space benefits one should get by using such structures.

What's the appropriate data structure, when speed is of main concern, for binary data in C++?


I intend to populate my data structure with the binary data and then do a lot of contiguous searches (I mean that I don't really care for the i-th coordinate of a point, if I am accessing a point I will access all of its coordinates continuously). I will compute the Hamming distance between each other.

解决方案

Locality of reference will likely be the driving force. So it's fairly obvious that you represent the D coordinates of a single point as a contiguous bitvector. std::bitset<D> would be a logical choice.

However, the next important thing to realize is that you see locality benefits easily up to 4KB. This means that you should not pick a single point and compare it against all other N-1 points. Instead, group points in sets of 4KB each, and compare those groups. Both ways are O(N*N), but the second will be much faster.

You may be able to beat O(N*N) by use of the triangle inequality - Hamming(a,b)+Hamming(b,c) >= Hamming (a,c). I'm just wondering how. It probably depends on how you want your output. The naive output would be a N*N set of distances, and that's unavoidably O(N*N).

这篇关于当您只关心速度时,如何存储二进制数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆