如何避免在Lucene 6.0中建立重复的文档索引 [英] How to avoid duplicate document indexing in Lucene 6.0

查看：429 发布时间：2020/5/4 7:49:09 java lucene spring-batch

本文介绍了如何避免在Lucene 6.0中建立重复的文档索引的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在为从数据库获取的值创建一个Lucene索引.我已将索引OpenMode设置为OpenMode.CREATE_OR_APPEND.

I am creating a Lucene Index for values got from database. I have set Index OpenMode as OpenMode.CREATE_OR_APPEND.

索引创建步骤是Spring Batch Job的一部分.

Index creation step is part of a Spring Batch Job.

我的理解是，当我第一次运行作业时，建立索引可能需要一段时间，但是当我针对相同的未更改原始数据再次运行该作业时，它应该很快，因为文档已经在那里因此尚未执行 UPDATE或INSERT .

My understanding is that when I run job for the first time, indexing might take a while but when I rerun the job again for same unchanged source data, it should be fast because document is already there so UPDATE OR INSERT has not be performed.

但是对于我来说，随后对相同不变的源数据建立索引的尝试越来越慢.

But for my case, subsequent indexing attempts for same unchanged source data gets slower and slower.

回答此问题表示将自动处理该问题基于术语.

Answer to this question says that it will be handled automatically based on a term.

我不确定如何定义案例来解决这个问题?

I am not sure as how to I define the term in my case to handle this?

下面是我的示例代码

        public Integer createIndex(IndexWriter writer, String str, LuceneIndexerInputVO luceneInputVO) throws Exception {
            Integer count = 0;
            Document d = null;
            txtFieldType.setTokenized(false);
            strFieldType.setTokenized(false);

            List<IndexVO> indexVO = null;

            indexVO = jdbcTemplate.
                    query(Constants.SELECT_FROM_TABLE1, 
                            new Object[] {luceneInputVO.getId1(), luceneInputVO.getId2(), str}, 
                            new IndexRowMapper());

            while (!indexVO.isEmpty()) {
                d = new Document();
                d.add(getStringField(Constants.ID, String.valueOf(luceneInputVO.getId())));
                .....
                ....
                writer.addDocument(d);
                indexVO.remove(indexVO.get(count));
                count++;
            }
            return count;
        }

我应该在上面的代码中进行哪些更改，以便在源数据没有更改的情况下不执行索引编制?

What should I change in above code to not perform indexing when there is no change in source data?

我是Lucene的初学者，不确定如何定义决定重复性的Term.

I am a beginner to Lucene and not sure as how to define that Term which would decide about duplicity.

如果索引中已经存在完全相同的Document，我不希望重新创建索引，并且希望跳过新的Document(不执行任何操作).

I don't want indices to be recreated and I wish new Document to be skipped ( don't do anything ) if exactly same Document already exists in Index.

编辑-我问了一个很长的问题，但是在阅读了SO的一些与Lucene相关的问题之后，我意识到我只是在寻求增量索引方法，同时专注于避免重复提供一个文档，该文档表示具有主键的RDBMS表的一行.如果更改了数据库行，则不更新文档，并为新行添加文档.

EDIT - I asked a long question but after reading SO for few Lucene related questions, I realize that I am simply asking for incremental indexing approach while focusing on duplicate avoidance provided a document represents a row of a RDBMS table having a primary key. If DB row is changed, update document otherwise not and add docs for new rows.

问题1 ，问题2

如何避免在Lucene 6.0中建立重复的文档索引 [英] How to avoid duplicate document indexing in Lucene 6.0

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

如何避免在Lucene 6.0中建立重复的文档索引 [英] How to avoid duplicate document indexing in Lucene 6.0

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭