增量索引lucene [英] incremental indexing lucene

查看:113
本文介绍了增量索引lucene的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Lucene 3.6在Java中创建应用程序,并希望提高增量率。
我已经创建了索引,我读到你要做的就是打开现有的索引,检查每个文档的索引和文档修改日期,看它们是否不同,删除索引文件并重新添加。
我的问题是我不知道如何在Java Lucene中这样做。

I'm making an application in Java using Lucene 3.6 and want to make an incremental rate. I have already created the index, and I read that you have to do is open the existing index, and check each document indexing and document modification dates to see if they differ delete the index file and re-add again. My problem is I do not know how to do that in Java Lucene.

谢谢

我的代码是:

public static void main(String[] args) 
    throws CorruptIndexException, LockObtainFailedException,
           IOException {

    File docDir = new File("D:\\PRUEBASLUCENE");
    File indexDir = new File("C:\\PRUEBA");

    Directory fsDir = FSDirectory.open(indexDir);
    Analyzer an = new StandardAnalyzer(Version.LUCENE_36);
    IndexWriter indexWriter
        = new IndexWriter(fsDir,an,MaxFieldLength.UNLIMITED);


    long numChars = 0L;
    for (File f : docDir.listFiles()) {
        String fileName = f.getName();
        Document d = new Document();
        d.add(new Field("Name",fileName,
                        Store.YES,Index.NOT_ANALYZED));
        d.add(new Field("Path",f.getPath(),Store.YES,Index.ANALYZED));
        long tamano = f.length();
        d.add(new Field("Size",""+tamano,Store.YES,Index.ANALYZED));
        long fechalong = f.lastModified();
        d.add(new Field("Modification_Date",""+fechalong,Store.YES,Index.ANALYZED));
        indexWriter.addDocument(d);
    }

    indexWriter.optimize();
    indexWriter.close();
    int numDocs = indexWriter.numDocs();

    System.out.println("Index Directory=" + indexDir.getCanonicalPath());
    System.out.println("Doc Directory=" + docDir.getCanonicalPath());
    System.out.println("num docs=" + numDocs);
    System.out.println("num chars=" + numChars);

}

感谢Edmondo1984,你帮了我很多。

Thanks Edmondo1984, you are helping me a lot.

最后我做了如下所示的代码。存储文件的哈希,然后检查修改日期。

Finally I did the code as shown below. Storing the hash of the file, and then checking the modification date.

在9300索引文件需要15秒,并重新索引(没有任何索引没有更改,因为没有文件)需要15秒。
我做错了什么或者我可以优化代码以减少费用?

In 9300 index files takes 15 seconds, and re-index (without any index has not changed because no file) takes 15 seconds. Am I doing something wrong or I can optimize the code to take less?

感谢jtahlborn,做了我设法平衡indexReader时间来创建和更新。您不应该更新现有索引应该更快地重新创建吗?是否有可能进一步优化代码?

Thanks jtahlborn, doing what I managed to equalize indexReader times to create and update. Are not you supposed to update an existing index should be faster to recreate it? Is it possible to further optimize the code?

if(IndexReader.indexExists(dir))
            {
                //reader is a IndexReader and is passed as parameter to the function
                //searcher is a IndexSearcher and is passed as parameter to the function
                term = new Term("Hash",String.valueOf(file.hashCode()));
                Query termQuery = new TermQuery(term);
                TopDocs topDocs = searcher.search(termQuery,1);
                if(topDocs.totalHits==1)
                {
                    Document doc;
                    int docId,comparedate;
                    docId=topDocs.scoreDocs[0].doc;
                    doc=reader.document(docId);
                    String dateIndString=doc.get("Modification_date");
                    long dateIndLong=Long.parseLong(dateIndString);
                    Date date_ind=new Date(dateIndLong);
                    String dateFichString=DateTools.timeToString(file.lastModified(), DateTools.Resolution.MINUTE);
                    long dateFichLong=Long.parseLong(dateFichString);
                    Date date_fich=new Date(dateFichLong);
                    //Compare the two dates
                    comparedates=date_fich.compareTo(date_ind);
                    if(comparedate>=0)
                    {
                        if(comparedate==0)
                        {
                            //If comparation is 0 do nothing
                            flag=2;
                        }
                        else
                        {
                            //if comparation>0 updateDocument
                            flag=1;
                        }
                    }


推荐答案

根据到Lucene数据模型,您将文档存储在索引中。在每个文档中,您将拥有要编制索引的字段,即所谓的已分析字段和未分析的字段,您可以在其中存储时间戳以及稍后可能需要的其他信息。

According to Lucene data model, you store documents inside the index. Inside each document you will have the fields that you want to index, which are so called "analyzed" and the fields which are not "analyzed", where you can store a timestamp and other information you might need later.

我感觉您在文件和文档之间存在某种混淆,因为在您的第一篇文章中您谈到了文档,现在您尝试调用IndexFileNames.isDocStoreFile(文件) .getName())实际上只告诉文件是否是一个包含Lucene索引的文件。

I have the feeling you have a certain confusion between files and documents, because in your first post you speak about documents and now you are trying to call IndexFileNames.isDocStoreFile(file.getName()) which actually tells only if file is a file containing a Lucene index.

如果您了解Lucene对象模型,那么编写所需的代码大约需要三分钟:

If you understand Lucene object model, writing the code you need takes approximately three minutes:


  • 您必须检查文档是否已存在于索引中(例如,通过存储包含唯一标识符的未分析字段),只需查询Lucene。

  • 如果您的查询返回0文档,则将新文档添加到索引

  • 如果查询返回1在文档中,您将获得其时间戳字段,并将其与您要存储的新文档进行比较。然后你可以使用文档的docId从索引中删除它,如果有必要,添加新文件。

如果在另一方面你确定你想要总是修改以前的值,你可以参考Lucene在行动中的这个片段:

If on the other side you are sure that you want always to modify the previous value, you can refer to this snippet from Lucene in Action:

public void testUpdate() throws IOException { 
    assertEquals(1, getHitCount("city", "Amsterdam"));
    IndexWriter writer = getWriter();
    Document doc = new Document();
    doc.add(new Field("id", "1",
    Field.Store.YES,
    Field.Index.NOT_ANALYZED));
    doc.add(new Field("country", "Netherlands",
    Field.Store.YES,
    Field.Index.NO));
    doc.add(new Field("contents",
    "Den Haag has a lot of museums",
    Field.Store.NO,
    Field.Index.ANALYZED));
    doc.add(new Field("city", "Den Haag",
    Field.Store.YES,
    Field.Index.ANALYZED));
    writer.updateDocument(new Term("id", "1"),
    doc);
    writer.close();
    assertEquals(0, getHitCount("city", "Amsterdam"));
    assertEquals(1, getHitCount("city", "Den Haag"));
}

正如您所见,这些代码段使用的是未分析的ID,因为我建议使用保存一个可查询的 - 简单属性,并将方法updateDocument首先删除然后重新添加文档。

As you see, the snippets uses a non analyzed ID as I was suggesting to save a queryable - simple attribute, and method updateDocument to first delete and then re-add the doc.

你可能想直接检查javadoc

You might want to directly check the javadoc at

http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/index/IndexWriter .html#updateDocument(org.apache.lucene.index.Term,org.apache.lucene.document.Document

这篇关于增量索引lucene的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆