如何在没有“内存不足”的情况下在java中列出200万个文件目录例外 [英] How to list a 2 million files directory in java without having an "out of memory" exception

查看:140
本文介绍了如何在没有“内存不足”的情况下在java中列出200万个文件目录例外的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我已经解决了使用队列在机器和线程之间分配工作的处理过程一切顺利。

但现在最大的问题是用200万个文件读取目录的瓶颈,以便逐渐填满队列。



我试过使用 File.listFiles()方法,但它给了我一个java 内存不足:堆空间异常。任何想法?首先,你有没有可能使用Java 7?你有一个 FileVisitor Files.walkFileTree ,这可能在你的内存约束条件下工作。



否则,我能想到的唯一方法是使用带有过滤器的 File.listFiles(FileFilter filter)总是返回 false (确保文件的全部数组永远不会保存在内存中),但是捕获要处理的文件,也许把它们放在一个生产者/消费者队列或将文件名写入磁盘供以后遍历。另外,如果您控制文件的名称,或者如果它们以一些不错的方式命名,你可以使用一个过滤器处理这些文件,该过滤器接受 file0000000 - filefile0001000 然后 file0001000 - filefile0002000 等等。

em> not 这样的好方法,你可以尝试基于散列码o进行过滤f是文件名,它应该是相当均匀地分布在整数集合上。




更新:感叹。可能不会工作。只需看一下listFiles实现:

$ public $ File listFiles(FilenameFilter filter)
String ss [ ] = list();
if(ss == null)return null;
ArrayList v = new ArrayList();
for(int i = 0; i< ss.length; i ++){
if((filter == null)|| filter.accept(this,ss [i])){
v.add(new File(ss [i],this));

$ b return(File [])(v.toArray(new File [v.size()]));





$ b

因此,它可能会在第一行失败...排序令人失望。我相信你最好的选择是把文件放在不同的目录中。

顺便提一下,你能举一个文件名的例子吗?他们是可猜测的吗? (int i = 0; i< 100000; i ++)
tryToOpen(String.format(文件%05d,i))


I have to deal with a directory of about 2 million xml's to be processed.

I've already solved the processing distributing the work between machines and threads using queues and everything goes right.

But now the big problem is the bottleneck of reading the directory with the 2 million files in order to fill the queues incrementally.

I've tried using the File.listFiles() method, but it gives me a java out of memory: heap space exception. Any ideas?

解决方案

First of all, do you have any possibility to use Java 7? There you have a FileVisitor and the Files.walkFileTree, which should probably work within your memory constraints.

Otherwise, the only way I can think of is to use File.listFiles(FileFilter filter) with a filter that always returns false (ensuring that the full array of files is never kept in memory), but that catches the files to be processed along the way, and perhaps puts them in a producer/consumer queue or writes the file-names to disk for later traversal.

Alternatively, if you control the names of the files, or if they are named in some nice way, you could process the files in chunks using a filter that accepts filenames on the form file0000000-filefile0001000 then file0001000-filefile0002000 and so on.

If the names are not named in a nice way like this, you could try filtering them based on the hash-code of the file-name, which is supposed to be fairly evenly distributed over the set of integers.


Update: Sigh. Probably won't work. Just had a look at the listFiles implementation:

public File[] listFiles(FilenameFilter filter) {
    String ss[] = list();
    if (ss == null) return null;
    ArrayList v = new ArrayList();
    for (int i = 0 ; i < ss.length ; i++) {
        if ((filter == null) || filter.accept(this, ss[i])) {
            v.add(new File(ss[i], this));
        }
    }
    return (File[])(v.toArray(new File[v.size()]));
}

so it will probably fail at the first line anyway... Sort of disappointing. I believe your best option is to put the files in different directories.

Btw, could you give an example of a file name? Are they "guessable"? Like

for (int i = 0; i < 100000; i++)
    tryToOpen(String.format("file%05d", i))

这篇关于如何在没有“内存不足”的情况下在java中列出200万个文件目录例外的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆