Java:对巨大磁盘文件进行随机读取的最快方法 [英] Java: fastest way to do random reads on huge disk file(s)

查看:509
本文介绍了Java:对巨大磁盘文件进行随机读取的最快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个中等大小的数据集,大约800 MB左右,这基本上是一个很大的预计算表,我需要将一些计算加速几个数量级(创建该文件需要几个mutlicores计算机天来生成使用优化的多线程算法...我真的需要该文件。)

I've got a moderately big set of data, about 800 MB or so, that is basically some big precomputed table that I need to speed some computation by several orders of magnitude (creating that file took several mutlicores computers days to produce using an optimized and multi-threaded algo... I do really need that file).

现在已经计算了一次,那个800MB的数据是只读的。

Now that it has been computed once, that 800MB of data is read only.

我无法将其保存在内存中。

I cannot hold it in memory.

截至目前,它是一个巨大的800MB文件,但如果可以提供帮助,拆分成较小的文件不是问题。

As of now it is one big huge 800MB file but splitting in into smaller files ain't a problem if it can help.

我需要在该文件中读取大约32位数据很多时间。我不知道在哪里我需要读取这些数据:读取是均匀分布的。

I need to read about 32 bits of data here and there in that file a lot of time. I don't know before hand where I'll need to read these data: the reads are uniformly distributed.

Java中最快的方法是做我的随机读取这样的文件或文件?理想情况下,我应该从几个不相关的线程执行这些读取(但是如果需要,我可以在单个线程中对读取进行排队)。

What would be the fastest way in Java to do my random reads in such a file or files? Ideally I should be doing these reads from several unrelated threads (but I could queue the reads in a single thread if needed).

Java NIO是否可行?

Is Java NIO the way to go?

我不熟悉内存映射文件:我想我不想在内存中映射800 MB。

I'm not familiar with 'memory mapped file': I think I don't want to map the 800 MB in memory.

我想要的是访问这些800MB基于磁盘的数据的最快随机读取。

All I want is the fastest random reads I can get to access these 800MB of disk-based data.

btw万一人们不知道这不是与我不久前提出的问题完全相同:

btw in case people wonder this is not at all the same as the question I asked not long ago:

Java:快速基于磁盘的哈希集

推荐答案

800MB并不是那么多加载并存储在内存中。如果你有能力让多核机器在数据集上连续几天被剥夺,你可以支付额外的GB或两个RAM,不是吗?

800MB is not that much to load up and store in memory. If you can afford to have multicore machines ripping away at a data set for days on end, you can afford an extra GB or two of RAM, no?

这就是说,阅读Java的 java.nio.MappedByteBuffer 即可。从你的评论我想我不想将内存中的800 MB映射中可以清楚地看出这个概念并不清楚。

That said, read up on Java's java.nio.MappedByteBuffer. It is clear from your comment "I think I don't want to map the 800 MB in memory" that the concept is not clear.

在一个坚果shell中,一个映射的字节缓冲区允许用户以编程方式访问内存中的数据,尽管它可能在磁盘上或内存中 - 这是由操作系统决定的,因为Java的MBB基于操作系统的虚拟内存子系统。它也很好而且快速。您还可以安全地从多个线程访问单个MBB。

In a nut shell, a mapped byte buffer allows one to programmatically access the data as it were in memory, although it may be on disk or in memory--this is for the OS to decide, as Java's MBB is based on the OS's Virtual Memory subsystem. It is also nice and fast. You will also be able to access a single MBB from multiple threads safely.

以下是我建议您采取的步骤:

Here are the steps I recommend you take:


  1. 实例化MappedByteBuffer将您的数据文件映射到MBB。创作有点贵,所以请保持它。

  2. 在你的查找方法中...

  1. Instantiate a MappedByteBuffer that maps your data file to the MBB. The creation is kinda expensive, so keep it around.
  2. In your look up method...

  1. 实例化a byte [4] array

  2. call .get(byte [] dst,int offset,int length)

  3. 字节数组现在将包含您的数据,您可以将其转换为值

  1. instantiate a byte[4] array
  2. call .get(byte[] dst, int offset, int length)
  3. the byte array will now have your data, which you can turn into a value


并且presto!你有你的数据!

And presto! You have your data!

我是MBB的忠实粉丝,并且过去曾成功地使用过它们。

I'm a big fan of MBBs and have used them successfully for such tasks in the past.

这篇关于Java:对巨大磁盘文件进行随机读取的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆