HashSet<T>.removeAll 方法出奇的慢 [英] The HashSet<T>.removeAll method is surprisingly slow

查看:23
本文介绍了HashSet<T>.removeAll 方法出奇的慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Jon Skeet 最近在他的博客上提出了一个有趣的编程话题:我的抽象中有一个漏洞,亲爱的丽莎,亲爱的丽莎"(强调):

Jon Skeet recently raised an interesting programming topic on his blog: "There's a hole in my abstraction, dear Liza, dear Liza" (emphasis added):

我有一个集合——一个 HashSet,事实上.我想从中删除一些项目……许多项目可能不存在.事实上,在我们的测试案例中,removals"中的项目none集合将在原始集合中.这听起来——实际上——非常容易编码.毕竟,我们有 Set.removeAll 来帮助我们,对吧?

I have a set – a HashSet, in fact. I want to remove some items from it… and many of the items may well not exist. In fact, in our test case, none of the items in the "removals" collection will be in the original set. This sounds – and indeed is – extremely easy to code. After all, we’ve got Set<T>.removeAll to help us, right?

我们指定源"的大小设置和删除"的大小在命令行上收集,并构建它们.源集只包含非负整数;移除集仅包含负整数.我们使用 System.currentTimeMillis() 测量删除所有元素所需的时间,这不是世界上最准确的秒表,但在这种情况下已经足够了,正如您将看到的.代码如下:

We specify the size of the "source" set and the size of the "removals" collection on the command line, and build both of them. The source set contains only non-negative integers; the removals set contains only negative integers. We measure how long it takes to remove all the elements using System.currentTimeMillis(), which isn’t the world most accurate stopwatch but is more than adequate in this case, as you’ll see. Here’s the code:

import java.util.*;
public class Test 
{ 
    public static void main(String[] args) 
    { 
       int sourceSize = Integer.parseInt(args[0]); 
       int removalsSize = Integer.parseInt(args[1]); 
        
       Set<Integer> source = new HashSet<Integer>(); 
       Collection<Integer> removals = new ArrayList<Integer>(); 
        
       for (int i = 0; i < sourceSize; i++) 
       { 
           source.add(i); 
       } 
       for (int i = 1; i <= removalsSize; i++) 
       { 
           removals.add(-i); 
       } 
        
       long start = System.currentTimeMillis(); 
       source.removeAll(removals); 
       long end = System.currentTimeMillis(); 
       System.out.println("Time taken: " + (end - start) + "ms"); 
    }
}

让我们先给它一个简单的工作:一个包含 100 个项目的源集,100 个要删除:

Let’s start off by giving it an easy job: a source set of 100 items, and 100 to remove:

c:UsersJonTest>java Test 100 100
Time taken: 1ms

好的,所以我们没想到它会很慢……显然我们可以加快速度.一个 100 万个项目的来源和 300,000 个要删除的项目怎么样?

Okay, so we hadn’t expected it to be slow… clearly we can ramp things up a bit. How about a source of one million items and 300,000 items to remove?

c:UsersJonTest>java Test 1000000 300000
Time taken: 38ms

嗯.看来还是挺快的.现在我觉得我有点残忍,要求它完成所有删除操作.让我们简化一下——300,000 个源项和 300,000 个移除项:

Hmm. That still seems pretty speedy. Now I feel I’ve been a little bit cruel, asking it to do all that removing. Let’s make it a bit easier – 300,000 source items and 300,000 removals:

c:UsersJonTest>java Test 300000 300000
Time taken: 178131ms

对不起?将近三分钟?哎呀!与我们在 38 毫秒内管理的集合相比,从 较小 集合中删除项目肯定应该更容易?

Excuse me? Nearly three minutes? Yikes! Surely it ought to be easier to remove items from a smaller collection than the one we managed in 38ms?

有人可以解释为什么会发生这种情况吗?为什么 HashSet.removeAll 方法这么慢?

Can someone explain why this is happening? Why is the HashSet<T>.removeAll method so slow?

推荐答案

该行为(在某种程度上)记录在 javadoc:

The behaviour is (somewhat) documented in the javadoc:

此实现通过对每个集合调用 size 方法来确定此集合和指定集合中的较小者.如果这个集合的元素较少,那么实现会遍历这个集合,依次检查迭代器返回的每个元素,看是否包含它在指定的集合中.如果包含,则使用迭代器的 remove 方法将其从该集合中删除.如果指定集合的​​元素较少,则实现将迭代指定集合,使用该集合的 remove 方法从该集合中移除迭代器返回的每个元素.

This implementation determines which is the smaller of this set and the specified collection, by invoking the size method on each. If this set has fewer elements, then the implementation iterates over this set, checking each element returned by the iterator in turn to see if it is contained in the specified collection. If it is so contained, it is removed from this set with the iterator's remove method. If the specified collection has fewer elements, then the implementation iterates over the specified collection, removing from this set each element returned by the iterator, using this set's remove method.

这在实践中意味着什么,当您调用 source.removeAll(removals); 时:

What this means in practice, when you call source.removeAll(removals);:

  • 如果removals集合的大小小于source,则HashSetremove方法> 被调用,速度很快.

  • if the removals collection is of a smaller size than source, the remove method of HashSet is called, which is fast.

如果 removals 集合的大小等于或大于 source,则调用 removals.contains,即ArrayList 慢.

if the removals collection is of equal or larger size than the source, then removals.contains is called, which is slow for an ArrayList.

快速修复:

Collection<Integer> removals = new HashSet<Integer>();

请注意,有一个与您描述的非常相似的一个未解决的错误.底线似乎是它可能是一个糟糕的选择,但无法更改,因为它记录在 javadoc 中.

Note that there is an open bug that is very similar to what you describe. The bottom line seems to be that it is probably a poor choice but can't be changed because it is documented in the javadoc.

作为参考,这是removeAll的代码(在Java 8中 - 其他版本未检查):

For reference, this is the code of removeAll (in Java 8 - haven't checked other versions):

public boolean removeAll(Collection<?> c) {
    Objects.requireNonNull(c);
    boolean modified = false;

    if (size() > c.size()) {
        for (Iterator<?> i = c.iterator(); i.hasNext(); )
            modified |= remove(i.next());
    } else {
        for (Iterator<?> i = iterator(); i.hasNext(); ) {
            if (c.contains(i.next())) {
                i.remove();
                modified = true;
            }
        }
    }
    return modified;
}

这篇关于HashSet&lt;T&gt;.removeAll 方法出奇的慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆