储存一千电话号码最有效的方法 [英] Most efficient way to store thousand telephone numbers

查看:158
本文介绍了储存一千电话号码最有效的方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个谷歌的面试问题:

This is a google interview question:

有千元左右被存储在每个具有10个数字的电话号码。你可以假设每个前5位跨越千年的数字相同。你必须执行以下操作: 一个。搜索如果给定数量存在。 湾打印所有数

There are around thousand phone numbers to be stored each having 10 digits. You can assume first 5 digits of each to be same across thousand numbers. You have to perform the following operations: a. Search if a given number exists. b. Print all the number

什么是最有效的节省空间的方式来做到这一点?

What is the most efficient space saving way to do this ?

我回答哈希表,后来霍夫曼编码,但我的面试官说我不会在正确的方向前进。请帮我在这里。

I answered hash table and later huffman coding but my interviewer said I was not going in right direction. Please help me here.

可以用一个后缀特里的帮助?

Could using a suffix trie help?

在理想情况下的1000个号码存储需要每个号码4字节,因此在所有则需4000个字节来存储1000号。从数量上看,我想减少存储到< 4000字节,这就是我的面试官向我解释说。

Ideally 1000 numbers storing takes 4 bytes per number so in all it would take 4000 bytes to store 1000 number. Quantitatively, I wish to reduce the storage to < 4000 bytes, this is what my interviewer explained to me.

推荐答案

下面是一个改进<一个href="http://stackoverflow.com/questions/7685649/most-efficient-way-to-store-thousand-telephone-numbers/7685746#7685746">aix's回答。考虑使用三个层次的数据结构:首先是一个常数前五位数字(17位);所以从这里,每一个电话号码仅剩下五个数字左。我们使用一种方法和17查看这些剩下的五位数字为那些位的17位二进制整数和存储的 K 的 - 的 K = M 的用不同的方法,确定的 K 的末尾,以尽量减少所需的空间。

Here's an improvement to aix's answer. Consider using three "layers" for the data structure: the first is a constant for the first five digits (17 bits); so from here on, each phone number has only the remaining five digits left. We view these remaining five digits as 17-bit binary integers and store k of those bits using one method and 17 - k = m with a different method, determining k at the end to minimize the required space.

我们首先排序的电话号码(全部降低到5位小数)。然后,我们看看有多少人的电话号码存在的二进制数第一的 M 的位都是0,对于电话多少个号码的第一个为其的 M 的位最多0 ... 01,进行电话多少个号码第一的 M 的位至多0 ... 10,等等,多达电话号码的数量为其第一的 M 位为1 ... 11 - 这最后数为1000(十进制)。有2 ^ M 的这种计数,每个数最多为1000。如果我们忽略了最后一个(因为我们知道这是1000无论如何),我们可以存储在一个连续的块中的所有这些数字(2 ^ M 的 - 1)* 10位。 (10比特是足够用于存储数小于1024。)

We first sort the phone numbers (all reduced to 5 decimal digits). Then we count how many phone numbers there are for which the binary number consisting of the first m bits is all 0, for how many phone numbers the first m bits are at most 0...01, for how many phone numbers the first m bits are at most 0...10, etcetera, up to the count of phone numbers for which the first m bits are 1...11 - this last count is 1000(decimal). There are 2^m such counts and each count is at most 1000. If we omit the last one (because we know it is 1000 anyway), we can store all of these numbers in a contiguous block of (2^m - 1) * 10 bits. (10 bits is enough for storing a number less than 1024.)

最后的 K 的所有(减少)电话号码位被连续存储在存储器中;因此,如果 K 的是,比方说,如图7所示,那么前7位的存储器该块(比特0直通6)对应于最后7位的第一个(还原的)电话号码,位7至13对应于最后7位的第二(减)的电话号码,等等。这要求1000 *的 K 的位,总共17 +(2 ^(17 - 的 K 的) - 1)* 10 + 1000 *的 K 的,其中达到其最低11287对的 K = 10,所以我们可以存储在CEIL所有的电话号码(八分之一万一千二百八十七)= 1411字节。

The last k bits of all (reduced) phone numbers are stored contiguously in memory; so if k is, say, 7, then the first 7 bits of this block of memory (bits 0 thru 6) correspond to the last 7 bits of the first (reduced) phone number, bits 7 thru 13 correspond to the last 7 bits of the second (reduced) phone number, etcetera. This requires 1000 * k bits for a total of 17 + (2^(17 - k) - 1) * 10 + 1000 * k, which attains its minimum 11287 for k = 10. So we can store all phone numbers in ceil(11287/8)=1411 bytes.

另外可以节省空间,通过观察,没有我们的数字可以与如启动1111111(二进制),因为这与启动最低的数字是130048,我们只有5个十进制数字。这使我们能够刮胡子几个条目掉内存的第一个块:而不是2 ^ M 的 - 1计数,我们只需要CEIL(二分之九万九千九百九十九^ K 的)。这指的是式变为

Additional space can be saved by observing that none of our numbers can start with e.g. 1111111(binary), because the lowest number that starts with that is 130048 and we have only five decimal digits. This allows us to shave a few entries off the first block of memory: instead of 2^m - 1 counts, we need only ceil(99999/2^k). That means the formula becomes

17 + CEIL(二分之九万九千九百九十九^ K 的)* 10 + 1000 *的 K

17 + ceil(99999/2^k) * 10 + 1000 * k

这令人惊讶的达到其最低10997为的 K = 9的 K = 10,或CEIL(8分之10997)= 1375字节。

which amazingly enough attains its minimum 10997 for both k = 9 and k = 10, or ceil(10997/8) = 1375 bytes.

如果我们想知道某个手机号码是否在我们的设置,我们首先检查前五个二进制数字匹配,我们已经存储了五位数。然后我们分裂其余五位数到它的顶部的 = 7位(这是,比方说,在的比特数的 M 的)及其较低的 K = 10位(数的 K 的)。我们现在发现的数量的的[M-1]的下降电话号码,其中第一的 M 的数字是最多的 M 的 - 1,和数的的[M]减少电话号码,其中第一的的数字为大部分的 M 的,无论是从比特的第一个块。我们现在的的[M-1]和第的的[M]个序列之间检查的 K 的在存储器中的第二块位看看我们发现的 K 的;在最坏的情况下,有1000这样的序列,因此,如果我们用二进制搜索,我们可以在O完成(日志1000)操作

If we want to know whether a certain phone number is in our set, we first check if the first five binary digits match the five digits we have stored. Then we split the remaining five digits into its top m=7 bits (which is, say, the m-bit number M) and its lower k=10 bits (the number K). We now find the number a[M-1] of reduced phone numbers for which the first m digits are at most M - 1, and the number a[M] of reduced phone numbers for which the first m digits are at most M, both from the first block of bits. We now check between the a[M-1]th and a[M]th sequence of k bits in the second block of memory to see if we find K; in the worst case there are 1000 such sequences, so if we use binary search we can finish in O(log 1000) operations.

伪code在打印所有的1000个号码如下,我在那里访问的 K 的'日的 K 的第一个存储块的比特条目的的[K]和 M 的'个的的的存储器中的第二块的比特条目的 B 的[M](这两种都需要几个位操作的繁琐写出来)。前五个数字是在数量上的 C 的。

Pseudocode for printing all 1000 numbers follows, where I access the K'th k-bit entry of the first block of memory as a[K] and the M'th m-bit entry of the second block of memory as b[M] (both of these would require a few bit operations that are tedious to write out). The first five digits are in the number c.

i := 0;
for K from 0 to ceil(99999 / 2^k) do
  while i < a[K] do
    print(c * 10^5 + K * 2^k + b[i]);
    i := i + 1;
  end do;
end do;

也许不顺心的事与边界情况下的 K 的= CEIL(二分之九万九千九百九十九^ K 的),但是这很容易解决。

Maybe something goes wrong with the boundary case for K = ceil(99999/2^k), but that's easy enough to fix.

最后,从熵点,所以不可能以存储10 ^ 3的正整数的子集的所有小于10 ^ 5在比CEIL更少(日志[2](二项式(10 ^ 5,10 ^ 3)))= 8073.包括我们需要的前5位的17,仍有10997的间隙 - 8090 = 2907比特。这是一个有趣的挑战,看看是否有更好的解决方案,你仍然可以访问人数相对有效!

Finally, from an entropy point of view, it is not possible to store a subset of 10^3 positive integers all less than 10^5 in fewer than ceil(log[2](binomial(10^5, 10^3))) = 8073. Including the 17 we need for the first 5 digits, there is still a gap of 10997 - 8090 = 2907 bits. It's an interesting challenge to see if there are better solutions where you can still access the numbers relatively efficiently!

这篇关于储存一千电话号码最有效的方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆