查找四种十亿的人给出的整数 [英] Find an integer not among four billion given ones

查看:141
本文介绍了查找四种十亿的人给出的整数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个面试问题:

鉴于具有四个十亿整数的输入文件,提供一种算法以产生不包含在该文件中的一个整数。假设你有1 吉布内存。你会做什么,如果你只有10 NBSP跟进;内存MIB

Given an input file with four billion integers, provide an algorithm to generate an integer which is not contained in the file. Assume you have 1 GiB memory. Follow up with what you would do if you have only 10 MiB of memory.

我的分析:

的文件的大小为4 * 10 9 * 4字节= 16 吉布

The size of the file is 4 * 109 * 4 bytes = 16 GiB.

我们可以做外部排序,因此,我们结识了整数的范围。我的问题是什么是检测缺失的整数排序大整数集的最好的方式?

We can do external sorting, thus we get to know the range of the integers. My question is what is the best way to detect the missing integer in the sorted big integer sets?

我的理解(阅读所有的答案后):

My understanding(after reading all answers):

假设我们所谈论的32位整数。有2 ^ 32 = 4 * 10 9 不同的整数。

Assuming we are talking about 32-bit integers. There are 2^32 = 4*109 distinct integers.

案例一:我们有1 吉布= 1 * 10 9 字节* 8位/字节= 8十亿位的内存。   解决方法:如果我们用一个位重新presenting一个不同的整数,这就够了。我们不这样做   需要排序。   实施:

Case 1: we have 1 GiB = 1 * 109 bytes * 8 bits/byte = 8 billion bits memory. Solution: if we use one bit representing one distinct integer, it is enough. we don't need sort. Implementation:

int radix = 8;
byte[] bitfield = new byte[0xffffffff/radix];
void F() throws FileNotFoundException{
    Scanner in = new Scanner(new FileReader("a.txt"));
    while(in.hasNextInt()){
        int n = in.nextInt();
        bitfield[n/radix] |= (1 << (n%radix));
    }

    for(int i = 0; i< bitfield.lenght; i++){
        for(int j =0; j<radix; j++){
            if( (bitfield[i] & (1<<j)) == 0) System.out.print(i*radix+j);
        }
    }
}

案例2:10&NBSP; MB内存= 10 * 10 6 * 8位= 80万比特

Case 2: 10 MB memory = 10 * 106 * 8 bits = 80 million bits

Solution: For all possible 16-bit prefixes, there are 2^16 number of
integers = 65536, we need 2^16 * 4 * 8 = 2 million bits. We need build
65536 buckets. For each bucket, we need 4 bytes holding all possibilities because
 the worst case is all the 4 billion integers belong to the same bucket.

Step 1: Build the counter of each bucket through the first pass through the file.
Step 2: Scan the buckets, find the first one who has less than 65536 hit.
Step 3: Build new buckets whose high 16-bit prefixes are we found in step2
through second pass of the file
Step 4: Scan the buckets built in step3, find the first bucket which doesnt
have a hit.

The code is very similar to above one.

结论:     我们通过增加文件通过减少内存。

Conclusion: We decrease memory through increasing file pass.

一个澄清那些迟到:现在的问题,是问,不说恰好有一个整数,不包含在文件中 - 至少这不是大多数人除$ P $如何PT它。在评论跟帖很多评论有关任务的变化,虽然。不幸的是的意见出台它的评论跟帖是由它的作者后来删除了,所以现在看起来孤立的答复只是误会了一切。这是非常令人困惑。对不起。

A clarification for those arriving late: The question, as asked, does not say that there is exactly one integer that is not contained in the file -- at least that's not how most people interpret it. Many comments in the comment thread are about that variation of the task, though. Unfortunately the comment that introduced it to the comment thread was later deleted by its author, so now it looks like the orphaned replies to it just misunderstood everything. It's very confusing. Sorry.

推荐答案

据统计获悉算法解决使用更少的传球比确定性的方法这个问题。

Statistically informed algorithms solve this problem using fewer passes than deterministic approaches.

如果非常大的整数允许,然后可以生成一个数字,很可能是在O(1)时间是唯一的。伪随机128位的整数像 GUID 将只与现有四十亿整数名冲突的集不到一出每64个十亿十亿十亿情况。

If very large integers are allowed then one can generate a number that is likely to be unique in O(1) time. A pseudo-random 128-bit integer like a GUID will only collide with one of the existing four billion integers in the set in less than one out of every 64 billion billion billion cases.

如果整数限制在32位,然后可以生成一个数字,很可能是在使用超过10 NBSP更不用说单传唯一的; MB。该伪随机的32位整数将与4十亿现有整数之一碰撞的几率为约93%(4E9 / 2 ^ 32)。这1000的伪随机整数都将发生碰撞的几率不到12,000十亿十亿十亿(赔率的一碰撞^ 1000)。因此,如果一个程序维护包含通过已知的整数1000的伪随机候选人和迭代的数据结构,消除了来自考生的比赛,它是所有,但一定要找到至少一个整数,它是不是在文件中。

If integers are limited to 32 bits then one can generate a number that is likely to be unique in a single pass using much less than 10 MB. The odds that a pseudo-random 32-bit integer will collide with one of the 4 billion existing integers is about 93% (4e9 / 2^32). The odds that 1000 pseudo-random integers will all collide is less than one in 12,000 billion billion billion (odds-of-one-collision ^ 1000). So if a program maintains a data structure containing 1000 pseudo-random candidates and iterates through the known integers, eliminating matches from the candidates, it is all but certain to find at least one integer that is not in the file.

这篇关于查找四种十亿的人给出的整数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆