在Lucene中使用无唯一键的多表联接对数据库数据进行增量索引的方法 [英] Approach to Incrementally Index Database Data from Multi-Table Join in Lucene with No Unique Key

查看:60
本文介绍了在Lucene中使用无唯一键的多表联接对数据库数据进行增量索引的方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个特殊的SQL连接,例如:

I have a particular SQL join such that:

select DISTINCT ... 100 columns
from ... 10 tabes, some left joins

当前,我使用Toad将查询结果导出到XML(稍后将直接从Java查询它).我使用Java解析XML文件,并使用Lucene(Java)对其进行索引并搜索Lucene索引.效果很好:我得到的结果比从数据库中查询结果快6到10倍.

Currently I export the result of this query to XML using Toad (I'll query it straight from Java later). I use Java to parse the XML file, and I use Lucene (Java) to index it and to search the Lucene index. This works great: I get results 6-10 times faster than querying it from the database.

我需要考虑一种在数据库中的数据更改时增量更新此索引的方法.

I need to think of a way to incrementally update this index when the data in the database changes.

因为我要联接表(尤其是左联接),所以我不确定能否获得唯一的业务键组合来进行增量更新.另一方面,因为我使用的是DISTINCT,所以我知道每个字段都是唯一的组合.有了这些信息,我想我可以将文档的hashCode放在文档的字段中,然后像这样在IndexWriter上调用updateDocument:

Because I am joining tables (especially left joins) I'm not sure I can get a unique business key combination to do an incremental update. On the other hand, because I am using DISTINCT, I know that every single field is a unique combination. Given this information, I thought I could put the hashCode of a document as a field of the document, and call updateDocument on the IndexWriter like this:

public static void addDoc(IndexWriter w, Row row) throws IOException {
    //Row is simply a java representation of a single row from the above query
    Document document = new Document();
    document.add(new StringField("fieldA", row.fieldA, Field.Store.YES));
    ...
    String hashCode = String.valueOf(document.hashCode());
    document.add(new StringField("HASH", hashCode, Field.Store.YES));
    w.updateDocument(new Term("HASH", hashCode), document);
}

然后我意识到updateDocument实际上是在删除具有匹配哈希码的文档,然后再次添加相同的文档,因此这没有任何用处.

Then I realized that updateDocument was actually deleting the document with the matching hash code and adding the identical document again, so this wasn't of any use.

解决这个问题的方法是什么?

What is the way to approach this?

推荐答案

如果您在源数据库表的每个相关更新中增加一个ID 如果您在删除记录时记录了这些ID,

If you increment an id on each relevant update of your source DB tables and if you log these ids on record deletion,

然后您应该能够列出已删除,已更新和新的记录 您的数据被索引.

you should then be able to list deleted, updated and new records of your data being indexed.

此步骤可能在临时表中执行, 本身被提取到lucene输入的xml文件中.

This step might be performed within a transitory table, itself extracted into the xml file used as input to lucene.

这篇关于在Lucene中使用无唯一键的多表联接对数据库数据进行增量索引的方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆