有关处理大量数据的建议 [英] Advice on handling large data volumes

查看:126
本文介绍了有关处理大量数据的建议的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我有一个大的数字数据的非常大的ASCII文件(一共几千兆字节),我的程序需要至少按顺序处理整个数据。

So I have a "large" number of "very large" ASCII files of numerical data (gigabytes altogether), and my program will need to process the entirety of it sequentially at least once.

有关存储/加载数据的建议吗?我曾想过将文件转换为二进制文件以使它们更小并加快加载速度。

Any advice on storing/loading the data? I've thought of converting the files to binary to make them smaller and for faster loading.

我应该一次性将所有内容加载到内存中吗?

如果没有,打开部分加载数据的好方法是什么?

什么是与Java相关的效率提示?

Should I load everything into memory all at once?
If not, is opening what's a good way of loading the data partially?
What are some Java-relevant efficiency tips?

推荐答案


那么如果处理需要在多个文件和多个缓冲区的数据中跳转怎么办?不断打开和关闭二进制文件会变得昂贵吗?

So then what if the processing requires jumping around in the data for multiple files and multiple buffers? Is constant opening and closing of binary files going to become expensive?

我是'内存映射的忠实粉丝/ o',又名'直接字节缓冲区'。在Java中,它们被称为 Mapped Byte Buffers 是java.nio的一部分。 (基本上,这种机制使用操作系统的虚拟内存分页系统来映射你的文件,并以编程方式将它们作为字节缓冲区提供。操作系统将管理自动神奇且非常快速地将字节移入/移出磁盘和内存。

I'm a big fan of 'memory mapped i/o', aka 'direct byte buffers'. In Java they are called Mapped Byte Buffers are are part of java.nio. (Basically, this mechanism uses the OS's virtual memory paging system to 'map' your files and present them programmatically as byte buffers. The OS will manage moving the bytes to/from disk and memory auto-magically and very quickly.

我建议这种方法,因为a)它对我有用,而b)它会让你专注于你的算法,让JVM,OS和硬件处理性能优化。他们经常知道什么是最好的,比我们低级程序员更好。 ;)

I suggest this approach because a) it works for me, and b) it will let you focus on your algorithm and let the JVM, OS and hardware deal with the performance optimization. All to frequently, they know what is best more so than us lowly programmers. ;)

您将如何在您的上下文中使用MBB?只需为每个文件创建一个MBB,然后根据需要阅读它们。您只需要存储结果。 。

How would you use MBBs in your context? Just create an MBB for each of your files and read them as you see fit. You will only need to store your results. .

BTW:你要处理多少数据,用GB?如果它超过3-4GB,那么在32位计算机上这将不适合您,因为MBB实现是平台架构的可寻址内存空间的被告。 64位机器&操作系统将带您到1TB或128TB的可映射数据。

BTW: How much data are you dealing with, in GB? If it is more than 3-4GB, then this won't work for you on a 32-bit machine as the MBB implementation is defendant on the addressable memory space by the platform architecture. A 64-bit machine & OS will take you to 1TB or 128TB of mappable data.

如果您正在考虑性能,那么请了解Kirk Pepperdine(一位有点着名的Java性能大师。)他参与其中有一个网站,www.JavaPerformanceTuning.com,有更多的MBB细节: NIO性能提示 和其他Java性能相关的事情。

If you are thinking about performance, then know Kirk Pepperdine (a somewhat famous Java performance guru.) He is involved with a website, www.JavaPerformanceTuning.com, that has some more MBB details: NIO Performance Tips and other Java performance related things.

这篇关于有关处理大量数据的建议的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆