从2套中找到共同元素的最佳方法是什么? [英] What is the best way to find common elements from 2 sets?

查看:63
本文介绍了从2套中找到共同元素的最佳方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最近我接受了一次采访,并被问到一个问题。

Recently I had an interview and I was asked one question.

我有两套唱片,每套唱片约有100万条。
我必须找到2组元素。

I have 2 sets with around 1 Million records each. I have to find the common element in 2 sets.

我的回答:

我会创建一个新的空集。我给了他下面的解决方案,但他对此不满意。他说有1百万条记录,因此解决方案不好。

I will create a new empty Set. And i gave him below solution but he was not happy with it. He said there are 1 million records so the solution won't be good.

public Set<Integer> commonElements(Set<Integer> s1, Set<Integer> s2) {
    Set<Integer> res = new HashSet<>();
     for (Integer temp : s1) {
        if(s2.contains(temp)) {
            res.add(temp);
        }
     }
     return res;
}

那么,解决此问题的更好方法是什么?

What is the better way to solve this problem then?

推荐答案

首先:为了确定两个集合的交集,您绝对必须查看以下位置的所有条目:两组中的至少一组(以判断是否在另一组中)。周围没有任何魔法可以告诉您,小于 O(min(size(s1),size(s2))。Period。

First of all: in order determine the intersection of two sets, you absolutely have to look at all entries of at least one of the two sets (to figure whether it is in the other set). There is no magic around that would tell you that in less than O(min(size(s1), size(s2)). Period.

接下来要告诉面试官的事情是: 100万条目。你一定是在开玩笑。现在是2019年。任何体面的硬件在不到一秒钟的时间内就会处理两次100万。

The next thing to tell the interviewer: "1 million entries. You must be kidding. It is 2019. Any decent piece of hardware crunches two 1-million in less than a second".

然后,您简要地提到有多种内置方法可以解决此问题,以及各种第三方库,但您可以避免其他两个答案所犯的错误:指向能够计算相交的库

Then you briefly mention that there are various built-in ways to solve this, as well as various 3rd party libraries. But you avoid the mistake that the other two answers make: pointing to library that does compute the intersect is not at all something you sell as "solution" to this question.

您看到的关于编码的信息:java Set接口的 easy 解决方案: s1.retainAll(s2)计算这两个集合的连接,因为它从s1中删除了
的所有元素

You see, regarding coding: the java Set interface has an easy solution to that: s1.retainAll(s2) computes the join of the two sets, as it removes all elements from s1 that aren't in s2.

很显然,您必须在访谈中提到这将修改s1。

Obviously, you have to mention within the interview that this will modify s1.

万一要求烦恼是修改s1或s2,您的解决方案是可行的方法,并且运行时间成本无可奈何。如果全部使用,您可以为这两个集合调用 size(),并为迭代条目较少的那个集合。

In case that the requirement is to not modify s1 or s2, your solution is a viable way to go, and there isn't anything one can do about the runtime cost. If it all, you could call size() for both sets and iterate the one that has less entries.

或者,您也可以

Set<String> result = new HashSet<>(s1);
return result.retain(s2);

但是最后,您必须迭代一个一个,每个迭代元素确定它是否在第二组中。

but in the end, you have to iterate one set and for each element determine whether it is in the second set.

但是,当然,对这些问题的真实答案总是总是向面试官表明您能够将问题分解为不同的方面。您概述了基本约束,概述了不同的解决方案,并讨论了它们的优缺点。以我为例,我希望您坐下来,也许编写这样的程序:

But of course, the real answer to such questions is always always always to show the interviewer that you are able to dissect the problem into its different aspects. You outline basic constraints, you outline different solutions and discuss their pros and cons. Me for example, I would expect you to sit down and maybe write a program like this:

public class Numbers {    
    private final static int numberOfEntries = 20_000_000;
    private final static int maxRandom = numberOfEntries;

    private Set<Integer> s1;
    private Set<Integer> s2;

    @Before
    public void setUp() throws Exception {
        Random random = new Random(42);
        s1 = fillWithRandomEntries(random, numberOfEntries);
        s2 = fillWithRandomEntries(random, numberOfEntries);
    }

    private static Set<Integer> fillWithRandomEntries(Random random, int entries) {
        Set<Integer> rv = new HashSet<>();
        for (int i = 0; i < entries; i++) {
            rv.add(random.nextInt(maxRandom));
        }
        return rv;
    }

    @Test
    public void classic() {
        long start = System.currentTimeMillis();
        HashSet<Integer> intersection = new HashSet<>();
          s1.forEach((i) -> {
           if (s2.contains(i))
             intersection.add(i);
        });
        long end = System.currentTimeMillis();
        System.out.println("foreach duration: " + (end-start) + " ms");
        System.out.println("intersection.size() = " + intersection.size());
    }


    @Test
    public void retainAll() {
        long start = System.currentTimeMillis();
        s1.retainAll(s2);
        long end = System.currentTimeMillis();
        System.out.println("Retain all duration: " + (end-start) + " ms");
        System.out.println("intersection.size() = " + s1.size());
    }

    @Test
    public void streams() {
        long start = System.currentTimeMillis();
        Set<Integer> intersection = s1.stream().filter(i -> s2.contains(i)).collect(Collectors.toSet());
        long end = System.currentTimeMillis();
        System.out.println("streaming: " + (end-start) + " ms");
        System.out.println("intersection.size() = " + intersection.size());
    }

    @Test
    public void parallelStreams() {
        long start = System.currentTimeMillis();
        Set<Integer> intersection = s1.parallelStream().filter(i -> s2.contains(i)).collect(Collectors.toSet());
        long end = System.currentTimeMillis();
        System.out.println("parallel streaming: " + (end-start) + " ms");
        System.out.println("intersection.size() = " + intersection.size());
    }
}

这里的第一个观察结果:我决定使用<超过2000万的条目。我以200万开始,但所有三个测试都将在500毫秒以下运行。这是在我的Mac Book Pro上打印的2000万张照片:

The first observation here: I decided to run with 20 million entries. I started with 2 million, but all three tests would run well below 500 ms. Here is the print out for 20 million on my Mac Book Pro:

foreach duration: 9304 ms
intersection.size() = 7990888 
streaming: 9356 ms
intersection.size() = 7990888
Retain all duration: 685 ms
intersection.size() = 7990888
parallel streaming: 6998 ms
intersection.size() = 7990888

正如预期的那样:所有相交处都有大小相同(因为我为随机数生成器植入了种子,以获得可比的结果)。

As expected: all intersects have the same size (because I seeded the random number generator to get to comparable results).

令人惊讶的是:就地修改s1是迄今为止最便宜的选择。它以10的因数击败了流媒体。另外请注意:并行流媒体在这里更快。当运行100万个条目时,顺序流会更快。

And surprise: modifying s1 in place ... is by far the cheapest option. It beats streaming by a factor of 10. Also note: the parallel streaming is quicker here. When running with 1 million entries, the sequential stream was faster.

因此,我最初提到提到 100万个条目不是性能问题。 。这是一条非常重要的声明,因为它告诉面试官您不是那些浪费时间来微优化不存在的绩效问题的人之一。

Therefore I initially mentioned to mention "1 million entries is not a performance problem". That is a very important statement, as it tells the interviewer that you are not one of those people wasting hours to micro-optimize non-existing performance issues.

这篇关于从2套中找到共同元素的最佳方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆