为什么Lucene在索引大文件时会导致OOM? [英] Why does Lucene cause OOM when indexing large files?

查看:258
本文介绍了为什么Lucene在索引大文件时会导致OOM?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Lucene 2.4.0和JVM(JDK 1.6.0_07)。在尝试索引大型文本文件时,我一直收到 OutOfMemoryError:Java堆空间

I’m working with Lucene 2.4.0 and the JVM (JDK 1.6.0_07). I’m consistently receiving OutOfMemoryError: Java heap space, when trying to index large text files.

示例1:索引5 MB文本文件的内存不足,最大64 MB。堆大小。所以我增加了最大值。堆大小为512 MB。这适用于5 MB的文本文件,但Lucene仍然使用84 MB的堆空间来执行此操作。为什么这么多?

Example 1: Indexing a 5 MB text file runs out of memory with a 64 MB max. heap size. So I increased the max. heap size to 512 MB. This worked for the 5 MB text file, but Lucene still used 84 MB of heap space to do this. Why so much?

根据JConsole和TPTP,类 FreqProxTermsWriterPerField 似乎是迄今为止最大的内存消费者Eclipse Ganymede的内存分析插件。

The class FreqProxTermsWriterPerField appears to be the biggest memory consumer by far according to JConsole and the TPTP Memory Profiling plugin for Eclipse Ganymede.

示例2:索引62 MB文本文件,内存不足,最大512 MB。堆大小。增加最大值堆大小为1024 MB,但Lucene在执行此操作时使用了826 MB的堆空间。似乎还有太多的内存被用来做这件事。我确定较大的文件会导致错误,因为它似乎是相关的。

Example 2: Indexing a 62 MB text file runs out of memory with a 512 MB max. heap size. Increasing the max. heap size to 1024 MB works but Lucene uses 826 MB of heap space while performing this. Still seems like way too much memory is being used to do this. I’m sure larger files would cause the error as it seems correlative.

我在拥有2 GB RAM的Windows XP SP2平台上。那么索引大文件的最佳做法是什么?以下是我正在使用的代码段:

I’m on a Windows XP SP2 platform with 2 GB of RAM. So what is the best practice for indexing large files? Here is a code snippet that I’m using:

// Index the content of a text file.
private Boolean saveTXTFile(File textFile, Document textDocument) throws MyException {           

        try {             

              Boolean isFile = textFile.isFile();
              Boolean hasTextExtension = textFile.getName().endsWith(".txt");

              if (isFile && hasTextExtension) {

                    System.out.println("File " + textFile.getCanonicalPath() + " is being indexed");
                    Reader textFileReader = new FileReader(textFile);
                    if (textDocument == null)
                          textDocument = new Document();
                    textDocument.add(new Field("content", textFileReader));
                    indexWriter.addDocument(textDocument);   // BREAKS HERE!!!!
              }                    
        } catch (FileNotFoundException fnfe) {
              System.out.println(fnfe.getMessage());
              return false;
        } catch (CorruptIndexException cie) {
              throw new MyException("The index has become corrupt.");
        } catch (IOException ioe) {
              System.out.println(ioe.getMessage());
              return false;
        }                    
        return true;
  }


推荐答案

作为回复评论 Gandalf

In response as a comment to Gandalf

我可以看到你将setMergeFactor设置为1000

I can see you are setting the setMergeFactor to 1000

API说


setMergeFactor

setMergeFactor

public void setMergeFactor(int
mergeFactor)

public void setMergeFactor(int mergeFactor)

确定合并
段索引的频率通过
addDocument()。使用较小的值,在索引时使用
较少的RAM,而在未优化的索引上的
搜索速度快
,但索引速度较慢。
使用更大的值,在索引期间使用更多RAM
,而对
未优化索引的搜索速度更慢,
索引更快。因此,较大的值
(> 10)对于批量索引
创建最佳,而较小的值(<10)
对于交互式
维护的索引。

Determines how often segment indices are merged by addDocument(). With smaller values, less RAM is used while indexing, and searches on unoptimized indices are faster, but indexing speed is slower. With larger values, more RAM is used during indexing, and while searches on unoptimized indices are slower, indexing is faster. Thus larger values (> 10) are best for batch index creation, and smaller values (< 10) for indices that are interactively maintained.

这种方法是一种方便的方法,它在你增加mergeFactor时使用RAM

This method is a convenience method, it uses the RAM as you increase the mergeFactor

什么我建议把它设置为15左右。 (在试验和错误的基础上)补充了setRAMBufferSizeMB,也调用 Commit()。那么 optimize()然后 close() indexwriter对象。(可能是一个JavaBean并将所有这些方法放在一个方法中)当你关闭索引时调用这个方法。

What i would suggest is set it to something like 15 or so on.; (on a trial and error basis) complemented with setRAMBufferSizeMB, also call Commit(). then optimise() and then close() the indexwriter object.(probably make a JavaBean and put all these methods in one method) call this method when you are closing the index.

发布您的结果,反馈=]

post with your result, feedback =]

这篇关于为什么Lucene在索引大文件时会导致OOM?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆