在Java中高效读取zip文件 [英] Reading zip file efficiently in Java

查看:1817
本文介绍了在Java中高效读取zip文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理一个处理大量数据的项目. 我有成千上万个zip文件,每个zip文件包含一个简单的txt文件,其中包含数千行(约80k行). 我当前正在执行以下操作:

I working on a project which works on a very large amount of data. I have a lot(thousands) of zip files, each containing ONE simple txt file with thousands of lines(about 80k lines). What I am currently doing is the following:

for(File zipFile: dir.listFiles()){
ZipFile zf = new ZipFile(zipFile);
ZipEntry ze = (ZipEntry) zf.entries().nextElement();
BufferedReader in = new BufferedReader(new InputStreamReader(zf.getInputStream(ze)));
...

通过这种方式,我可以逐行读取文件,但是它确实太慢了. 鉴于需要读取大量文件和行,我需要以更有效的方式读取它们.

In this way I can read the file line by line, but it is definetely too slow. Given the large number of files and lines that need to be read, I need to read them in a more efficient way.

我一直在寻找一种不同的方法,但是我什么也找不到. 我认为我应该使用的是专门用于密集I/O操作的java nio API,但我不知道如何将它们与zip文件一起使用.

I have looked for a different approach, but I haven't been able to find anything. What I think I should use are the java nio APIs intended right for intensive I/O operations, but I don't know how to use them with zip files.

我们将不胜感激.

谢谢

马可

推荐答案

我有成千上万个zip文件.每个压缩文件约为30MB,而压缩文件中的txt文件约为60/70 MB.使用此代码读取和处理文件需要花费很多时间,大约需要15个小时,但这要视情况而定.

I have a lot(thousands) of zip files. The zipped files are about 30MB each, while the txt inside the zip file is about 60/70 MB. Reading and processing the files with this code takes a lot of hours, around 15, but it depends.

让我们做一些事后的计算.

Let's do some back-of-the-envelope calculations.

假设您有5000个文件.如果需要15个小时来处理它们,则相当于每个文件约10秒.每个文件大约30MB,因此吞吐量约为3MB/s.

Let's say you have 5000 files. If it takes 15 hours to process them, this equates to ~10 seconds per file. The files are about 30MB each, so the throughput is ~3MB/s.

这比ZipFile解压缩内容的速度慢一两个数量级.

This is between one and two orders of magnitude slower than the rate at which ZipFile can decompress stuff.

要么磁盘有问题(它们是本地磁盘,还是网络共享?),要么是实际处理花费了大部分时间.

Either there's a problem with the disks (are they local, or a network share?), or it is the actual processing that is taking most of the time.

最好的确定方法是使用探查器.

The best way to find out for sure is by using a profiler.

这篇关于在Java中高效读取zip文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆