如何避免在Lucene 6.0中建立重复的文档索引 [英] How to avoid duplicate document indexing in Lucene 6.0

查看:429
本文介绍了如何避免在Lucene 6.0中建立重复的文档索引的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为从数据库获取的值创建一个Lucene索引.我已将索引OpenMode设置为OpenMode.CREATE_OR_APPEND.

I am creating a Lucene Index for values got from database. I have set Index OpenMode as OpenMode.CREATE_OR_APPEND.

索引创建步骤是Spring Batch Job的一部分.

Index creation step is part of a Spring Batch Job.

我的理解是,当我第一次运行作业时,建立索引可能需要一段时间,但是当我针对相同的未更改原始数据再次运行该作业时,它应该很快,因为文档已经在那里因此尚未执行 UPDATE或INSERT .

My understanding is that when I run job for the first time, indexing might take a while but when I rerun the job again for same unchanged source data, it should be fast because document is already there so UPDATE OR INSERT has not be performed.

但是对于我来说,随后对相同不变的源数据建立索引的尝试越来越慢.

But for my case, subsequent indexing attempts for same unchanged source data gets slower and slower.

回答此问题表示将自动处理该问题基于术语.

Answer to this question says that it will be handled automatically based on a term.

我不确定如何定义案例来解决这个问题?

I am not sure as how to I define the term in my case to handle this?

下面是我的示例代码

        public Integer createIndex(IndexWriter writer, String str, LuceneIndexerInputVO luceneInputVO) throws Exception {
            Integer count = 0;
            Document d = null;
            txtFieldType.setTokenized(false);
            strFieldType.setTokenized(false);

            List<IndexVO> indexVO = null;

            indexVO = jdbcTemplate.
                    query(Constants.SELECT_FROM_TABLE1, 
                            new Object[] {luceneInputVO.getId1(), luceneInputVO.getId2(), str}, 
                            new IndexRowMapper());

            while (!indexVO.isEmpty()) {
                d = new Document();
                d.add(getStringField(Constants.ID, String.valueOf(luceneInputVO.getId())));
                .....
                ....
                writer.addDocument(d);
                indexVO.remove(indexVO.get(count));
                count++;
            }
            return count;
        }

我应该在上面的代码中进行哪些更改,以便在源数据没有更改的情况下不执行索引编制?

What should I change in above code to not perform indexing when there is no change in source data?

我是Lucene的初学者,不确定如何定义决定重复性的Term.

I am a beginner to Lucene and not sure as how to define that Term which would decide about duplicity.

如果索引中已经存在完全相同的Document,我不希望重新创建索引,并且希望跳过新的Document(不执行任何操作).

I don't want indices to be recreated and I wish new Document to be skipped ( don't do anything ) if exactly same Document already exists in Index.

编辑-我问了一个很长的问题,但是在阅读了SO的一些与Lucene相关的问题之后,我意识到我只是在寻求增量索引方法,同时专注于避免重复提供一个文档,该文档表示具有主键的RDBMS表的一行.如果更改了数据库行,则不更新文档,并为新行添加文档.

EDIT - I asked a long question but after reading SO for few Lucene related questions, I realize that I am simply asking for incremental indexing approach while focusing on duplicate avoidance provided a document represents a row of a RDBMS table having a primary key. If DB row is changed, update document otherwise not and add docs for new rows.

问题1 问题2

推荐答案

我已经验证了在Lucene 6.0.0中,IndexWriter.updateDocument(Term term,Document doc);如果尚不存在文档,则添加一个新文档;如果按照term.

I have verified that in Lucene 6.0.0 , IndexWriter.updateDocument(Term term,Document doc); adds a new Document if document doesn't already exist and updates existing Document if found as per term.

根据我的要求,我定义了一个key字段,该字段基本上是Document的所有其他值字段的串联. key以此方式标识内容明智的副本,即对于具有相同key的两个文档,这意味着文档是内容明智的副本.

For my requirement, I defined a key field which is basically a concatenation of all other value fields for Document. This way key identifies content wise duplicates i.e. for two documents having same key means that documents are content wise duplicates.

我将此key值构造为传递给IndexWriter.updateDocument(Term term,Document doc);term,只需调用IndexWriter.updateDocument(Term term,Document doc);而不是IndexWriter.addDocument(Document doc)即可解决问题.

I construct term to be passed to IndexWriter.updateDocument(Term term,Document doc); on this key value and just calling IndexWriter.updateDocument(Term term,Document doc); instead of IndexWriter.addDocument(Document doc) solves issue.

这篇关于如何避免在Lucene 6.0中建立重复的文档索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆