生成一个不超过40亿个给定整数的整数 [英] Generate an integer that is not among four billion given ones

查看:108
本文介绍了生成一个不超过40亿个给定整数的整数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我被问到了这个面试问题:

I have been given this interview question:

给出具有40亿个整数的输入文件,提供一种算法来生成文件中不包含的整数.假设您有1 GB的内存.如果只有10 MB的内存,请继续执行下一步操作.

Given an input file with four billion integers, provide an algorithm to generate an integer which is not contained in the file. Assume you have 1 GB memory. Follow up with what you would do if you have only 10 MB of memory.

我的分析:

文件大小为4×10 9 ×4字节= 16GB.

My analysis:

The size of the file is 4×109×4 bytes = 16 GB.

我们可以进行外部排序,从而使我们知道整数的范围.

We can do external sorting, thus letting us know the range of the integers.

我的问题是在排序的大整数集中检测丢失的整数的最佳方法是什么?

My question is what is the best way to detect the missing integer in the sorted big integer sets?

假设我们正在谈论32位整数,那么有2 32 = 4 * 10 9 个不同的整数.

Assuming we are talking about 32-bit integers, there are 232 = 4*109 distinct integers.

如果我们使用一位代表一个不同的整数,那就足够了.我们不需要排序.

If we use one bit representing one distinct integer, it is enough. we don't need sort.

int radix = 8;
byte[] bitfield = new byte[0xffffffff/radix];
void F() throws FileNotFoundException{
    Scanner in = new Scanner(new FileReader("a.txt"));
    while(in.hasNextInt()){
        int n = in.nextInt();
        bitfield[n/radix] |= (1 << (n%radix));
    }

    for(int i = 0; i< bitfield.lenght; i++){
        for(int j =0; j<radix; j++){
            if( (bitfield[i] & (1<<j)) == 0) System.out.print(i*radix+j);
        }
    }
}

情况2:10 MB内存= 10 * 10 6 * 8位= 8000万位

Case 2: 10 MB memory = 10 * 106 * 8 bits = 80 million bits

解决方案:

对于所有可能的16位前缀,有2 16 个 整数= 65536,我们需要2 16 * 4 * 8 = 2百万个位.我们需要构建65536个存储桶.对于每个存储桶,我们需要4个字节来保存所有可能性,因为最坏的情况是所有40亿个整数都属于同一个存储桶.

Solution:

For all possible 16-bit prefixes, there are 216 number of integers = 65536, we need 216 * 4 * 8 = 2 million bits. We need build 65536 buckets. For each bucket, we need 4 bytes holding all possibilities because the worst case is all the 4 billion integers belong to the same bucket.

  1. 通过第一次遍历文件来构建每个存储桶的计数器.
  2. 扫描存储桶,找到第一个命中率低于65536的桶.
  3. 构建新的存储桶,我们在步骤2中发现了高16位前缀 通过文件的第二遍
  4. 扫描步骤3中内置的存储桶,找到第一个没有存储的存储桶 受到打击.
  1. Build the counter of each bucket through the first pass through the file.
  2. Scan the buckets, find the first one who has less than 65536 hit.
  3. Build new buckets whose high 16-bit prefixes are we found in step2 through second pass of the file
  4. Scan the buckets built in step3, find the first bucket which doesnt have a hit.

代码与上面的代码非常相似.

The code is very similar to above one.

结论: 我们通过增加文件传递来减少内存.

Conclusion: We decrease memory through increasing file pass.

对于迟到的人的澄清:问的问题不是说文件中没有正好包含一个整数,至少大多数人不是这样解释的.不过,评论线程中的许多评论 都与任务的变化有关.不幸的是,引入该注释的注释后来被其作者删除了,因此现在看来,孤立的答复似乎误解了所有内容.非常令人困惑,对不起.

A clarification for those arriving late: The question, as asked, does not say that there is exactly one integer that is not contained in the file—at least that's not how most people interpret it. Many comments in the comment thread are about that variation of the task, though. Unfortunately the comment that introduced it to the comment thread was later deleted by its author, so now it looks like the orphaned replies to it just misunderstood everything. It's very confusing, sorry.

推荐答案

假设整数"表示32位:10 MB的空间足以让您计算有多少个数字在输入文件中使用任何给定的16位前缀,对于所有可能的16位前缀,一次通过输入文件.至少有一个水桶命中次数少于2 16 次.再进行一遍,以查找该存储桶中哪些可能的数字已被使用.

Assuming that "integer" means 32 bits: 10 MB of space is more than enough for you to count how many numbers there are in the input file with any given 16-bit prefix, for all possible 16-bit prefixes in one pass through the input file. At least one of the buckets will have be hit less than 216 times. Do a second pass to find of which of the possible numbers in that bucket are used already.

如果它表示大于32位,但仍为有限大小:请执行上述操作,忽略所有碰巧落在(有符号或无符号;您选择的)32位范围之外的所有输入数字

If it means more than 32 bits, but still of bounded size: Do as above, ignoring all input numbers that happen to fall outside the (signed or unsigned; your choice) 32-bit range.

如果整数"表示数学整数:请仔细阅读输入内容,并跟踪您见过的最长数字的最大数字长度.完成后,输出最大值加一个一个具有一位数字的随机数. (文件中的一个数字可能是一个大于10 MB的数字,要准确表示,但如果输入是文件,则您至少可以表示适合的任何内容的 length 它).

If "integer" means mathematical integer: Read through the input once and keep track of the largest number length of the longest number you've ever seen. When you're done, output the maximum plus one a random number that has one more digit. (One of the numbers in the file may be a bignum that takes more than 10 MB to represent exactly, but if the input is a file, then you can at least represent the length of anything that fits in it).

这篇关于生成一个不超过40亿个给定整数的整数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆