外部随机播放:将大量数据从内存中移除 [英] External shuffle: shuffling large amount of data out of memory
问题描述
我正在寻找一种方法来重新调整大量不适合内存的数据(大约40GB)。
I am looking for a way to shuffle a large amount of data which does not fit into memory (approx. 40GB).
我有大约3000万条目,可变长度,存储在一个大文件中。我知道该文件中每个条目的起始位置和结束位置。我需要对这些不适合RAM的数据进行洗牌。
I have around 30 millions entries, of variable length, stored in one large file. I know the starting and ending positions of each entry in that file. I need to shuffle this data which does not fit in the RAM.
我想到的唯一解决方案是将包含来自的数字的数组洗牌1
到 N
,其中 N
是条目数, Fisher-Yates算法然后根据此顺序将条目复制到新文件中。不幸的是,这个解决方案涉及大量的搜索操作,因此会非常慢。
The only solution I thought of is to shuffle an array containing the numbers from 1
to N
, where N
is the number of entries, with the Fisher-Yates algorithm and then copy the entries in a new file, according to this order. Unfortunately, this solution involves a lot of seek operations, and thus, would be very slow.
是否有更好的解决方案来均匀分布大量数据?
Is there a better solution to shuffle large amount of data with uniform distribution?
推荐答案
首先从你脸上获取 shuffle
问题。通过为您的条目创建一个产生随机类似结果的哈希算法,然后对哈希值进行正常的外部排序来做到这一点。
First get the shuffle
issue out of your face. Do this by inventing a hash algorithm for your entries that produces random-like results, then do a normal external sort on the hash.
现在你已经改变了你的 shuffle
进入排序
您的问题转变为找到适合您的口袋和内存限制的有效外部排序算法。现在应该像 google
一样简单。
Now you have transformed your shuffle
into a sort
your problems turn into finding an efficient external sort algorithm that fits your pocket and memory limits. That should now be as easy as google
.
这篇关于外部随机播放:将大量数据从内存中移除的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!