外部随机播放:将大量数据从内存中移除 [英] External shuffle: shuffling large amount of data out of memory

查看:122
本文介绍了外部随机播放:将大量数据从内存中移除的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种方法来重新调整大量不适合内存的数据(大约40GB)。

I am looking for a way to shuffle a large amount of data which does not fit into memory (approx. 40GB).

我有大约3000万条目,可变长度,存储在一个大文件中。我知道该文件中每个条目的起始位置和结束位置。我需要对这些不适合RAM的数据进行洗牌。

I have around 30 millions entries, of variable length, stored in one large file. I know the starting and ending positions of each entry in that file. I need to shuffle this data which does not fit in the RAM.

我想到的唯一解决方案是将包含来自的数字的数组洗牌1 N ,其中 N 是条目数, Fisher-Yates算法然后根据此顺序将条目复制到新文件中。不幸的是,这个解决方案涉及大量的搜索操作,因此会非常慢。

The only solution I thought of is to shuffle an array containing the numbers from 1 to N, where N is the number of entries, with the Fisher-Yates algorithm and then copy the entries in a new file, according to this order. Unfortunately, this solution involves a lot of seek operations, and thus, would be very slow.

是否有更好的解决方案来均匀分布大量数据?

Is there a better solution to shuffle large amount of data with uniform distribution?

推荐答案

首先从你脸上获取 shuffle 问题。通过为您的条目创建一个产生随机类似结果的哈希算法,然后对哈希值进行正常的外部排序来做到这一点。

First get the shuffle issue out of your face. Do this by inventing a hash algorithm for your entries that produces random-like results, then do a normal external sort on the hash.

现在你已经改变了你的 shuffle 进入排序您的问题转变为找到适合您的口袋和内存限制的有效外部排序算法。现在应该像 google 一样简单。

Now you have transformed your shuffle into a sort your problems turn into finding an efficient external sort algorithm that fits your pocket and memory limits. That should now be as easy as google.

这篇关于外部随机播放:将大量数据从内存中移除的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆