Java:优化大型重复检测的哈希集 [英] Java: optimize hashset for large-scale duplicate detection

查看:134
本文介绍了Java:优化大型重复检测的哈希集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理一个我在处理大量Twitter的项目;目标是在处理它们时删除重复项。我有tweet ID,它们以格式166471306949304320



的格式输入我一直在使用 HashSet< String> 为此,这可以正常工作一段时间。但是当我到达大约1000万个项目时,我被大大地陷入僵局,最终会得到一个GC错误,大概来自于rehashing。我尝试使用



tweetids = new HashSet< String>(220000,0.80F); / / code定义更好的大小/ >



,这让它变得更远一些,但仍然令人难以置信的慢(大约1000万,它花费3倍的时间来处理)。如何优化?鉴于我有一个大概的想法,到底应该有多少个项目(在这种情况下,大约20-22万),我应该创建一个只重新浪费两三次的HashSet,或者这样的开销会产生太多的时间处罚?如果我不使用String,或者如果我定义了一个不同的HashCode函数(在这种情况下是一个String的特定实例,我不知道该怎么做),情况会更好吗?这部分实现代码在下面。

  tweetids = new HashSet< String>(220000,0.80F); //在构造函数
duplicateates = 0;
...
// In循环:For(each tweet)
String twid =(String)tweet_twitter_data.get(id);
//检查我们还没有处理这个tweet已经
如果(!(tweetids.add(twid))){
duplicateates ++;
继续;
}

解决方案



感谢您的建议,我解决了这个问题。问题是哈希表示所需的内存量;首先, HashSet< String> 仅仅是巨大的,不需要,因为 String.hashCode()对于这个比例是过高的。接下来我尝试了一个Trie,但是它刚刚超过100万个条目坠毁;重新分配阵列是有问题的。我使用一个 HashSet< Long> 更好的效果,几乎可以实现,但速度衰减,最终在最后一刻处理(约1900万)。该解决方案来自标准库,并使用 Trove 。它完成了2200万条记录比没有检查重复的几分钟快。最终实现很简单,看起来像这样:

  import gnu.trove.set.hash.TLongHashSet; 
...
TLongHashSet tweetids; // class variable
...
tweetids = new TLongHashSet(23000000,0.80F); //在构造函数
...
// inside for(each record)
String twid =(String)tweet_twitter_data.get(id);
if(!(tweetids.add(Long.parseLong(twid)))){
duplicateates ++;
继续;
}


解决方案

您可能想要超越Java集合框架。我做了一些记忆密集的处理,你会遇到几个问题


  1. 大的hashmaps和哈希集的桶的数量将是
    导致很多开销(内存)。您可以通过使用
    某种自定义散列函数和一个模数的例子来影响这一点。 50000

  2. 使用Java中的16位字符表示字符串。您可以通过使用utf-8编码的字节数组来为大多数脚本减半。

  3. HashMaps通常是相当浪费的数据结构,而HashSets基本上只是一个薄薄的包装器。

鉴于此,请查看trove或番石榴替代品。此外,您的ids看起来像长辈。那些是64位,比字符串表示还要小一些。



你可能想要考虑的一个替代方法是使用bloom过滤器(番石榴有一个体面的实现)。一个绽放过滤器会告诉你,如果有东西绝对不是一个集合,并且合理的确定性(小于100%),如果包含东西。结合某些基于磁盘的解决方案(例如数据库,mapdb,mecached,...)应该能够合理地工作。您可以缓冲传入的新ID,批量写入,并使用bloom过滤器来检查是否需要查看数据库,从而避免在大多数时间内进行昂贵的查找。


I am working on a project where I am processing a lot of tweets; the goal is to remove duplicates as I process them. I have the tweet IDs, which come in as strings of the format "166471306949304320"

I have been using a HashSet<String> for this, which works fine for a while. But by the time I get to around 10 million items I am drastically bogged down and eventually get a GC error, presumably from the rehashing. I tried defining a better size/load with

tweetids = new HashSet<String>(220000,0.80F);

and that lets it get a little farther, but is still excruciatingly slow (by around 10 million it is taking 3x as long to process). How can I optimize this? Given that I have an approximate idea of how many items should be in the set by the end (in this case, around 20-22 million) should I create a HashSet that rehashes only two or three times, or would the overhead for such a set incur too many time-penalties? Would things work better if I wasn't using a String, or if I define a different HashCode function (which, in this case of a particular instance of a String, I'm not sure how to do)? This portion of the implementation code is below.

tweetids = new HashSet<String>(220000,0.80F); // in constructor
duplicates = 0;
...
// In loop: For(each tweet)
String twid = (String) tweet_twitter_data.get("id");
// Check that we have not processed this tweet already
if (!(tweetids.add(twid))){
    duplicates++;
    continue; 
}

SOLUTION

Thanks to your recommendations, I solved it. The problem was the amount of memory required for the hash representations; first, HashSet<String> was simply enormous and uncalled for because the String.hashCode() is exorbitant for this scale. Next I tried a Trie, but it crashed at just over 1 million entries; reallocating the arrays was problematic. I used a HashSet<Long> to better effect and almost made it, but speed decayed and it finally crashed on the last leg of the processing (around 19 million). The solution came with departing from the standard library and using Trove. It finished 22 million records a few minutes faster than not checking duplicates at all. Final implementation was simple, and looked like this:

import gnu.trove.set.hash.TLongHashSet;
...
    TLongHashSet tweetids; // class variable
... 
    tweetids = new TLongHashSet(23000000,0.80F); // in constructor
...
    // inside for(each record)
    String twid = (String) tweet_twitter_data.get("id");
    if (!(tweetids.add(Long.parseLong(twid)))) {
        duplicates++;
        continue; 
    }

解决方案

You may want to look beyond the Java collections framework. I've done some memory intensive processing and you will face several problems

  1. The number of buckets for large hashmaps and hash sets is going to cause a lot of overhead (memory). You can influence this by using some kind of custom hash function and a modulo of e.g. 50000
  2. Strings are represented using 16 bit characters in Java. You can halve that by using utf-8 encoded byte arrays for most scripts.
  3. HashMaps are in general quite wasteful data structures and HashSets are basically just a thin wrapper around those.

Given that, take a look at trove or guava for alternatives. Also, your ids look like longs. Those are 64 bit, quite a bit smaller than the string representation.

An alternative you might want to consider is using bloom filters (guava has a decent implementation). A bloom filter would tell you if something is definitely not in a set and with reasonable certainty (less than 100%) if something is contained. That combined with some disk based solution (e.g. database, mapdb, mecached, ...) should work reasonably well. You could buffer up incoming new ids, write them in batches, and use the bloom filter to check if you need to look in the database and thus avoid expensive lookups most of the time.

这篇关于Java:优化大型重复检测的哈希集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆