高效计算的前20位中的子皮的十进制扩展重复 [英] Efficiently computing the first 20-digit substring to repeat in the decimal expansion of Pi

查看:183
本文介绍了高效计算的前20位中的子皮的十进制扩展重复的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

丕= 3.14159 26 5358979323846 26 433 ......所以第2位的子串重复为26。

Pi = 3.14159 26 5358979323846 26 433... so the first 2-digit substring to repeat is 26.

什么是找到第20位串重复的有效方法是什么?

What is an efficient way of finding the first 20-digit substring to repeat?

  • 我有大约500千兆字节的PI(每位1个字节)的数字,大约500千兆字节的免费的磁盘空间。

  • I have about 500 gigabytes of digits of Pi (1 byte per digit), and about 500 gigabytes of disk space free.

我有大约5 GB的RAM自由。

I have about 5 gigabytes of RAM free.

我感兴趣的一个有效的算法,将工作任意序列而不是皮本身的具体答案。换句话说,我不感兴趣,在形式的打印123 .... 456的解决方案,即使它打印的号码是正确的。

I am interested in an efficient algorithm that would work for arbitrary sequences rather than the specific answer for Pi itself. In other words, I am not interested in a solution of the form "print 123....456" even if the number it prints is correct.

我把每串入一个哈希表,并报告了第一次冲突。

I place each substring into a hash table and report the first collision.

(哈希表的构造方法进行排序链表的阵列。该指数到阵列由字符串(转换成整数)的底部位数定,并存储在每个节点的值是在该位置扩展丕其中子第一次出现的。)

(The hash table is constructed as an array of sorted linked lists. The index into the array is given by the bottom digits of the string (converted to an integer), and the value stored in each node is the location in the expansion of Pi where the substring first appeared.)

这工作得很好,直到我跑出的RAM。

This worked fine until I ran out of RAM.

要扩展到更长的序列我已经考虑:

To scale to longer sequences I have considered:

  • 生成的散列的所有子开始在一定的范围,然后继续搜索在其余的数字。这需要重新扫描的Pi每个范围的整个序列,所以变成为了N ^ 2

  • Generating the hash for all substrings starting in a certain range, and then continuing the search over the remainder of the digits. This needs to rescan the entire sequence of Pi for each range, so becomes order N^2

铲斗集合排序的20位的子串,以多个文件,然后使用散列表以分别找到每个文件中的第一重复。不幸的是,这个方法我运行的磁盘空间,以便将需要20穿过的数据。 (如果我开始与1000位数字,那么我将结束1000 20位子。)

Bucket sorting the set of 20-digit substrings to multiple files, and then using a hash table to find the first repeat in each file separately. Unfortunately, with this method I run out of disk space so would need 20 passes through the data. (If I start with 1000 digits, then I will end up with 1000 20-digit substrings.)

2储存每字节的圆周率位数,以释放更多的内存。

Storing 2 digits of Pi per byte to free up more memory.

添加一个基于磁盘的备份存储到我的哈希表。我担心这会表现得很差,因为是参照没有明显的局部性。

Adding a disk-based backing store to my hash table. I worry that this would behave very poorly as there is no obvious locality of reference.

有没有更好的方法?

  1. 我试图阿德里安·麦卡锡的qsort方法,但是这似乎有点慢于哈希查找重复

  1. I tried Adrian McCarthy's qsort method but this seemed a little slower than hashing for finding duplicates

我看着btilly的马preduce建议为parallelising的算法,但它是沉重的IO绑定在我的一台计算机对我这么不恰当的(我的一个磁盘驱动器)

I looked at btilly's MapReduce suggestion for parallelising the algorithm but it was heavily IO bound on my single computer so not appropriate for me (with my single disk drive)

我实现supercat的方法来度过昨晚分割的文件和前18位十亿寻找19位子。

I implemented supercat's method to spend last night splitting the files and searching for 19 digit substrings in the first 18 billion digits.

这个发现有16场比赛,所以我用震动的建议,重新检查的19位匹配找到的第20位匹配

This found 16 matches so I used Jarred's suggestion to re-check the 19 digit matches to find the first 20 digit match

要搜索18个十亿位了 3小时,以分割的文件,40分钟,然后重新扫描文件查找匹配

To search 18 billion digits took 3 hours to split the files, and 40 minutes to then rescan the files looking for matches.

20位串84756845106452435773是在皮内的十进制扩展位置1,549,4062,637和17601613330发现。

The 20-digit substring 84756845106452435773 is found at positions 1,549,4062,637 and 17,601,613,330 within the decimal expansion of Pi.

非常感谢大家!

推荐答案

您的数据集是pretty的大,所以某种分而治之是必要的。我建议作为第一步,则细分成问题的一些件数(例如100)。首先看是否该文件有任何重复的20位序列开始与00,然后看看它有什么写出来的文件全部开始01等高达每一项主要通行证99.启动开始的正确数目20位的序列。如果前两个数字是恒定的,你只需要编写出最后18;因为一个18位十进制数将适合的8字节的长,输出文件将可能保持大约50亿号码,服用的磁盘空间向上40GB。注意,这可能是值得的,以产生多个输出文件的时间,以避免不必读取源的每个字节文件100倍,但磁盘的性能可能更好如果只读取一个文件,并写入一个文件

Your data set is pretty big, so some sort of "divide and conquer" will be necessary. I would suggest that as a first step, you subdivide the problem into some number of pieces (e.g. 100). Start by seeing if the file has any duplicated 20-digit sequences start with 00, then see if it has any starting with 01, etc. up to 99. Start each of these "main passes" by writing out to a file all of the 20-digit sequences that start with the correct number. If the first two digits are constant, you'll only need to write out the last 18; since an 18-digit decimal number will fit into an 8-byte 'long', the output file will probably hold about 5,000,000,000 numbers, taking up 40GB of disk space. Note that it may be worthwhile to produce more than one output file at a time, so as to avoid having to read every byte of the source file 100 times, but disk performance may be better if you are simply reading one file and writing one file.

在1已经产生了特定的主通的数据文件时,必须再确定是否有任何重复。细分成的根据在其中存储的数的位更小的部分的一些数量可以是一个很好的下一个步骤。如果其中一个细分成256更小的部分,每个部分都会有地方约16-32万个号码;五个千兆字节的RAM 1已可用于缓冲一万个号码为每个256桶。写了一百万的数字每块都需要一个随机磁盘寻道,但这样写的数量将是pretty的合理的(可能约10000磁盘搜索)。

Once one has generated the data file for a particular "main pass", one must then determine whether there are any duplicates in it. Subdividing it into some number of smaller sections based upon the bits in the numbers stored therein may be a good next step. If one subdivides it into 256 smaller sections, each section will have somewhere around 16-32 million numbers; the five gigabytes of RAM one has could be used to buffer a million numbers for each of the 256 buckets. Writing out each chunk of a million numbers would require a random disk seek, but the number of such writes would be pretty reasonable (probably about 10,000 disk seeks).

一旦数据被细分成每16-32万人的数字文件,只需读取每一个这样的文件到内存中,并查找重复的。

Once the data has been subdivided into files of 16-32 million numbers each, simply read each such file into memory and look for duplicates.

大概描述的算法作为不是最佳的,但它应是相当接近的。的最感兴趣的是,切割主遍数的一半将在时间中的一个数的一半切成不得不读通过源数据文件,但将更多处理每个传所需要的时间的两倍,一旦它的数据已经复制。我猜想,使用100穿过源文件可能不是最优的,但需要使用分裂的因素将是pretty的接近时使用最佳分裂因素的整个过程的时间。

The algorithm as described probably isn't optimal, but it should be reasonably close. Of greatest interest is the fact that cutting in half the number of main passes would cut in half the number of times one had to read through the source data file, but would more than double the time required to process each pass once its data had been copied. I would guess that using 100 passes through the source file probably isn't optimal, but the time required for the overall process using that split factor would be pretty close to the time using the optimal split factor.

这篇关于高效计算的前20位中的子皮的十进制扩展重复的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆