计算未排序数据中唯一对和非唯一对实例的数量 [英] Counting number of unique pairs and instances of non-unique pairs in unsorted data
问题描述
我有以下形式的数据:
ID ATTR
3 10
1 20
1 20
4 30
... ...
ID和Attr未排序的地方,可能包含重复项。 ID的范围是1-20,000左右,并且ATTR是unsigned int。我可能一次需要处理10万至50万对。
Where ID and Attr are unsorted and may contain duplicates. The range for the IDs are 1-20,000 or so, and ATTR are unsigned int. There may be anywhere between 100,000 and 500,000 pairs that I need to process at a single time.
我正在寻找:
- 唯一对的数量。
- 非唯一对弹出的次数。
因此在上述数据中,我想知道(1,20)出现了两次并且有3对唯一的对。
So in the above data, I'd want to know that (1,20) appeared twice and that there were 3 unique pairs.
我目前在以幼稚的方式使用哈希表。我保留一个唯一对的计数器,如果要插入的项已经存在,则递减计数器。我还保留了一组非唯一ID的ID。 (所有初次接触)
I'm currently using a hash table in my naive approach. I keep a counter of unique pairs, and decrement the counter if the item I am inserting is already there. I also keep an array of IDs of the non-unique pairs. (All on first encounters)
性能和尺寸都受到同等关注。考虑到性能和尺寸方面的问题,我确实可以接受较高的误报率(例如0.5%)。 (我也使用频谱绽放实现了这一点)
Performance and size are about equal concerns. I'm actually OK with a relatively high (say 0.5%) rate of false positives given the performance and size concerns. (I've also implemented this using a spectral bloom)
我并不聪明,所以我敢肯定那里有更好的解决方案,我会想听听您最喜欢的哈希表实现/其他想法。 :)
I'm not that smart, so I'm sure there's a better solution out there, and I'd like to hear about your favorite hash table implementations/any other ideas. :)
推荐答案
哈希表,其键如< id> =< attr>
是解决此问题的绝佳方法。我想,如果您可以容忍错误,那么您可以花得更小/更快。但是您真的需要这样做吗?
A hash table with keys like <id>=<attr>
is an excellent solution to this problem. If you can tolerate errors, you can get smaller/faster with a bloom, I guess. But do you really need to do that?
这篇关于计算未排序数据中唯一对和非唯一对实例的数量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!